Defensive Bash: scripts that survive 3AM

I like Bash the way I like duct tape: it is fast, it is everywhere, and it can hold a surprising amount of weight. I also fear it, mostly because I have personally written scripts that looked fine at noon and became a minor disaster at 3AM.

These notes are the guardrails I keep re-learning in my lab. I am not claiming any of this is novel. The theme is: assume the script will run when you are tired, on a machine you have not touched in months, with inputs that are slightly wrong.

start with strict mode, but do not worship it

The usual baseline is set -euo pipefail. It is not magic, but it shifts the default from "keep going quietly" to "stop and tell me something is wrong". In my experience that is already a big reliability win.

Still, set -e has edge cases. Some failures do not count as failures in the way you expect. Some compound commands behave differently than you think. I treat strict mode as a seatbelt, not as the roll cage.

example: a reusable bash header

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

log() { printf '%s %s\n' "$(date -Is)" "$*"; }
die() { log "ERROR: $*"; exit 1; }

require_cmd() { command -v "$1" >/dev/null 2>&1 || die "missing command: $1"; }

# defensive: avoid running in unexpected working dirs
[[ -n "${PWD:-}" ]] || die "PWD is empty"

validate inputs like an untrusted API

Most of my overnight failures are not "Bash did something weird". They are "my script ghosted me at 3AM while production burned". They are "my variables were empty" or "a path did not exist" or "a command returned a partial result". So I try to make argument parsing and validation boring and explicit.

Two rules that help me: never assume an environment variable exists, and never assume a path is safe. If the script would be dangerous with an empty string, check for empty strings.

the safest rm is the one you never run

I try to avoid deletion when I can. When I cannot, I wrap it in checks that feel redundant. If you ever write rm -rf $DIR, you have already lost. Quotes are necessary, and sanity checks are the real protection.

╔════════════════════════════════════════╗
║  WARNING: rm -rf $DIR detected         ║
║  Are you sure? Press Ctrl+C to abort   ║
╚════════════════════════════════════════╝

DIR="${DIR:-}"
[[ -n "$DIR" ]] || die "DIR is empty"
[[ "$DIR" != "/" ]] || die "refusing to operate on /"
[[ -d "$DIR" ]] || die "DIR not found: $DIR"

rm -rf -- "$DIR"

prefer arrays when building commands

If I find myself constructing a command in a string, I stop and rethink. Strings force you to re-implement shell parsing, which ends in pain. Arrays keep arguments safe, even when paths have spaces or weird characters.

cmd=(rsync -a --delete --numeric-ids --info=stats2,progress2 "$SRC/" "$DST/")
log "running: ${cmd[*]}"
"${cmd[@]}"

This is not only about safety. It also improves observability. Logging the expanded command tells me what the script actually did, not what I think it did.

use traps to clean up, and to explain failure

In my lab, scripts often create temporary files, mount things, or acquire locks. When they fail mid-way, they can leave behind state that makes the next run behave differently. That is a subtle way to turn a one-off issue into a recurring one.

A simple trap can help, but again I try not to be fancy. I want two behaviors: remove temp files and print a message that includes the line number.

example: temp dir + error trap

tmpdir=""
cleanup() {
  local rc=$?
  [[ -n "$tmpdir" ]] && rm -rf -- "$tmpdir"
  exit "$rc"
}

on_err() {
  local rc=$?
  log "failed at line ${BASH_LINENO[0]} (rc=$rc)"
  exit "$rc"
}

trap cleanup EXIT
trap on_err ERR

tmpdir=$(mktemp -d)

idempotency: make the second run safe

The difference between a "script" and an "automation primitive" is how it behaves when re-run. In the grid, reruns happen. A VM reboots. A disk fills. A network link flaps. If a script cannot be re-run safely, I treat it as a fragile one-shot.

Idempotency can be as simple as writing outputs to a new directory with a timestamp, or checking whether a step has already completed. For example, when building an artifact, I prefer "write to temp, then move into place" so partial results do not masquerade as success.

locks: the easiest race condition is the one you prevent

Cron jobs, systemd timers, and manual runs overlap more often than you expect. If the script touches shared state, I usually add a lock. I do not care whether it is the most elegant lock implementation. I care that it is obvious.

In a pinch I use flock because it is simple and it does not require extra daemons.

example: a lock with flock

LOCKFILE="/var/lock/grid-backup.lock"
exec 9>"$LOCKFILE"
flock -n 9 || die "another run is already in progress"

log "lock acquired"

what worked / what broke

what worked

Making the script noisy. Logging inputs and major steps looks verbose, but it makes failures diagnosable.
Failing early on missing prerequisites. A clean "missing command: jq" beats a half-written file.
Idempotent outputs. Writing to a temp directory and moving into place reduced "partial success" bugs.
Locks for shared jobs. Preventing overlap was easier than debugging overlap.

what broke

Assuming strict mode would catch everything. It catches a lot, but not all logic bugs.
Overusing clever Bash. When I get fancy with parsing, I regret it later.
Not testing with weird paths. Spaces and newlines show up when you least want them.

closing thought

In my lab, "defensive Bash" is mostly about respecting future me. If a script can delete data, reboot a host, or change network state, it deserves a few extra lines. Those lines buy you sleep.