admin:hero

The Hero Sysadmin

Analogous to the Hero Developer, a hero sysadmin is the single individual who understands the operational infrastructure so deeply – and often exclusively – that the organization's long-term stability depends on their presence. Their knowledge is undocumented, their tooling is non-standard, and their workflows are tightly coupled to personal habits rather than shared operational processes.

While the hero sysadmin may be efficient individually, their impact on long-term organizational health is usually negative. The bus factor drops to 1, onboarding becomes difficult, and system resilience declines over time.

Don't be fooled if a sysadmin left and everything still works – usually the systems only start breaking down due to hardware failure, software updates and seemingly insignificant changes.

Here are some indicators that your team may have (or is creating) a hero sysadmin:

  • Opaque Infrastructure – critical systems rely on undocumented scripts, cron jobs, or "secret" directories known only to one person.
  • No Reproducible Playbooks – server setups, deployments or recovery steps exist only in someone's memory, not in version control.
  • Personal Toolchains – tools, aliases, or patched binaries only present on the sysadmin’s machine.
  • Resistance to Delegation – tasks that could be shared are "too complicated," "too risky," or "just quicker if I do it."
  • Single Point of Failure – planned vacations trigger anxiety, discussions about on-call rotations stall, and incidents pile up in their absence.
  • Maintenance Surprises – updates or migrations fail because "only they knew" about a certain dependency, override, or workaround.

A sysadmin maintains a set of provisioning scripts and services that only function correctly on their personal workstation, or if you run them in some arcane order which isn't documented properly. They might rely on environment variables and configuration files stored in their home directory.

When another team member attempts to reproduce the setup on a different machine, deployments inexplicably fail, services misbehave, or dependencies cannot be resolved. They need to create their environment either by reverse engineering the other sysadmin's environment or reading the scripts to figure out what to put in the config files.

This is the classic "works on my machine" anti-pattern: the sysadmin's environment has diverged from other machines. The organization becomes dependent on a fragile, not easily reproducible setup tied to a single person.

How to Prevent This: Ephemeral Environments

To avoid this failure mode, infrastructure should be reproducible, declarative, and environment-agnostic. One effective approach is using ephemeral environments, such as:

Ephemeral environments ensure:

  • No hidden local state or personal configurations.
  • Every change must be codified, not improvised.
  • Setup works consistently across machines, clouds, and CI systems.
  • "Rebuilding" becomes normal. If it can't be rebuilt automatically, it doesn't ship.

A production synchronization process stops functioning three weeks after a sysadmin leaves. Investigation reveals a cron job running on an old utility server under the sysadmin's personal user account, which had been removed due to security guidelines in the offboarding procedure. No one knew it existed.

A server suddenly fails to boot after a kernel update. The previous sysadmin had installed a custom kernel module without documentation, and no one knows how it was built or why it was needed.

Backups quietly fail for months because the "backup script" was actually a bash script maintained manually on one machine. It was never checked into Git and stopped working when a path changed, but didn't report this in monitoring. The script also doesn't have proper error codes, so debugging the wrong path is nontrivial.

To avoid creating or depending on hero sysadmins:

  • Document everything in a shared system: Runbooks, architecture diagrams, onboarding notes.
  • Enforce version control for scripts, configs, and infrastructure-as-code.
  • Regularly test disaster recovery and system rebuild procedures.
  • Implement shared on-call rotations and knowledge transfer sessions.
  • Encourage collaborative ownership rather than gatekeeping.
  • Last modified: 2025-12-03 17:33