ZFS snapshots: rollback without panic

I mostly like ZFS for one reason: it turns “I think I broke it” into “I can probably roll it back”. That is a subtle but meaningful change in how I experiment. It is easier to be brave when you have a decent way to undo.

These notes are from my homelab, where ZFS is doing two jobs: hosting VM disks and hosting shared data (datasets for model artifacts, embeddings, logs, and backup staging). I am not running a massive storage array, so my lessons are small and practical.

snapshots are not backups, but they are still useful

I repeat this to myself because it is easy to forget: snapshots live on the same pool. If the pool dies, your snapshots die with it. That means snapshots do not replace a second copy.

SNAPSHOT TIMELINE              BACKUP COPY
   │                               │
   │    t1 ──── t2 ──── t3         │    ┌───> offsite
   │    │        │        │        │    │
   └───│────────│────────│────────┘    └───────> separate pool
       │        │        │
    (local)  (local)   (local)

What they do replace is panic. If I accidentally delete a directory or I apply a bad config change, snapshots give me a local rewind. For anything beyond local mistakes, I still want replication or a backup system.

dataset layout matters more than snapshot frequency

My first ZFS mistake was treating the whole pool like one filesystem. Snapshots are per dataset, so dataset boundaries are how you decide what rolls back together. I now try to split datasets along operational lines:

configs that change slowly and should be snapshotted before upgrades,
artifacts (models, wheels, containers) that are large and mostly append-only,
state (databases, queues) that need more careful backup semantics,
scratch that can be deleted aggressively.

This is not a perfect taxonomy, but it keeps me from doing one snapshot policy for everything.

naming: make it diffable and sortable

I use timestamped snapshot names with a short reason label. The reason label helps later when I am trying to remember why a snapshot exists. The timestamp keeps ordering obvious.

My pattern is something like: auto-YYYYMMDD-HHMM for scheduled jobs and pre-upgrade-YYYYMMDD when I am about to do something risky.

example: snapshot script sketch

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

DATASET="${1:?dataset required}"
LABEL="${2:-auto}"
TS=$(date +%Y%m%d-%H%M)
SNAP="${DATASET}@${LABEL}-${TS}"

zfs snapshot "$SNAP"

echo "created $SNAP"

pruning: keep less than you think, but keep it predictably

Snapshots feel free until they are not. In a small grid, I would rather keep a short, predictable retention policy than a huge history that slowly eats the pool.

I do not have a single best policy. Roughly, I like something like:

hourly snapshots for a day on active datasets,
daily snapshots for a week,
weekly snapshots for a month,
and then a hard stop.

The exact numbers are less important than the existence of a cap. If you do not choose a cap, time will choose one for you.

send/recv: replication that feels tangible

The first time I successfully used zfs send and zfs recv, it clicked for me that ZFS can be a real replication tool. It is not always the right tool, and it can be dangerous if you do not understand what it overwrites, but it is powerful.

In my lab I use it in a conservative way: I replicate important datasets to a second box on a schedule. I assume the second box can also fail, but at least it is a different failure domain.

example: incremental send sketch

# example: send latest snapshot incrementally
set -euo pipefail

SRC="tank/apps"
DST="backup/apps"

latest=$(zfs list -t snapshot -o name -s creation -H "$SRC" | tail -n 1)
prev=$(zfs list -t snapshot -o name -s creation -H "$SRC" | tail -n 2 | head -n 1)

# create a new snapshot for this run
now="${SRC}@rep-$(date +%Y%m%d-%H%M)"
zfs snapshot "$now"

# send incrementally from prev to now
zfs send -I "$prev" "$now" | zfs recv -u "$DST"

I treat this as a sketch, not a drop-in script. The important part is understanding what -I does and testing on a non-critical dataset first. Replication is a sharp tool.

restore drills: the only honest test

I do not trust a snapshot policy until I have used it to restore something. In a lab, the easiest drill is to create a temporary clone, mount it somewhere, and check the files. If you can do that calmly, you can probably recover during a real incident.

For VM datasets in Proxmox, restores can involve the UI or the CLI. Either way, I want at least one practiced path written down.

what worked / what broke

what worked

Splitting datasets by behavior. Backups got clearer and rollbacks got less scary.
Snapshot before upgrades. It reduced the risk of “I should not have done that”.
Short retention with a hard cap. The pool stays healthy and predictable.
Replication to a second box. It is not perfect, but it is a real second copy.

what broke

Letting snapshots grow without pruning. Eventually something fills, and you notice too late.
Assuming send/recv is safe by default. It can overwrite state if you are sloppy.
Snapshotting everything the same way. Databases and logs do not behave like model artifacts.

closing thought

ZFS does not remove risk, but it changes the shape of it. In the grid, I am willing to trade a little complexity for the ability to roll back. The price is that I have to be disciplined about datasets and pruning.