Proxmox notes: small habits that make the cluster feel calm

I like Proxmox because it lets me keep my grid in one place without pretending I am running a full enterprise virtualization platform. It is not perfect, and I have learned to be careful about upgrades, storage decisions, and "just click around" configuration.

These are my running notes. They are aimed at the recurring problems in my lab: rebuilding a node, cloning an environment for an experiment, and not losing data when I inevitably change my mind. I am writing from a small-cluster perspective: a few machines, a mix of SSDs and spinning rust, and a desire for predictable behavior more than maximum density.

naming and inventory: boring details that prevent confusion

The first thing I do on a new Proxmox install is commit to a naming scheme. It sounds cosmetic, but it affects your ability to read logs and to reason about backups. I try to keep:

Hostnames short and stable (grid-01, grid-02, etc.).
VM IDs grouped by role (100s for infra, 200s for experiments, 300s for throwaways).
VM names descriptive enough to be useful during an incident (rag-api-01, ingest-worker).

None of this is enforced by Proxmox. The payoff is that "what is 217" becomes less common.

VM ID RANGES
═══════════════════════════════════════
100s  │ infra (DB, API, proxy, etc.)
200s  │ experiments (testing, one-offs)
300s  │ throwaways (can be nuked anytime)
═══════════════════════════════════════

storage: choose boring defaults, then iterate

Storage is where Proxmox stops being a UI and becomes a set of real tradeoffs. In my lab, I am usually choosing between local ZFS, local LVM-thin, and some network storage. I have a soft preference for local ZFS for VM disks when the host has enough RAM.

The reason is not performance. It is operational clarity. ZFS snapshots are visible, sendable, and testable. When I mess up a change, the "go back" button is more real.

That said, I do not assume ZFS solves everything. It can amplify mistakes if you are casual about dataset layout and quotas. It also changes how you think about ARC and memory pressure. I try to keep a margin.

templates and cloud-init: less clicking, fewer snowflakes

A VM template is the closest thing I have to a lab "standard image". I keep one Debian template with cloud-init enabled and a minimal package set. When I want a new VM, I clone the template, set the network, and let cloud-init do the rest.

It is tempting to build a perfect golden image. I do not. I just want something that boots, updates cleanly, and can be provisioned by a small script. The more you bake in, the more you forget what is inside.

example: quick VM creation via qm

# example: create a VM and enable cloud-init
# treat as a sketch; adjust storage names and bridge
VMID=220
qm clone 9000 "$VMID" --name "exp-llm-220" --full true

qm set "$VMID" --memory 8192 --cores 4
qm set "$VMID" --net0 virtio,bridge=vmbr0
qm set "$VMID" --ipconfig0 ip=dhcp
qm set "$VMID" --ciuser neo --sshkey ~/.ssh/id_ed25519.pub

qm start "$VMID"

networking: keep it simple until you cannot

I have burned time on fancy network topologies that did not pay back. For most of my workloads, I need only:

a stable management network for Proxmox itself,
a bridge for VMs to reach the LAN,
optional segmentation for experiments that might behave badly.

My heuristic is: if I cannot explain the topology in two sentences, I am making it too clever. Complexity is fine when it solves a clear problem. It is costly when it is there "just in case".

backups: test restores, not just backup jobs

Proxmox Backup Server (PBS) is a good fit for a small grid. The key operational point is that a green backup job is not the same thing as a working restore. I try to do a practice restore after any major change: new PBS version, new storage target, or a new VM layout.

I also keep a clear policy for "how far back do I care". For experiments, daily backups for a week might be enough. For services, I keep longer retention. The point is to decide, not to accrete an infinite history by accident.

example: a restore drill checklist in text

# example: a simple runbook snippet I keep in the repo
# 1) pick a non-critical VM
# 2) restore to a new VMID
# 3) boot in an isolated network
# 4) verify: ssh works, service starts, data exists
# 5) delete the restored copy (or keep it as a cold spare)

patching and upgrades: slow down and read the room

Proxmox upgrades are usually fine. The failures I have seen are correlated with "I upgraded everything right before I needed it". I now avoid upgrading the cluster on the same day as a big model deployment, a storage migration, or any other high-risk change.

My current habit is to upgrade one node, wait a day, then continue. That is not optimal for speed, but it is optimal for noticing breakage while I still remember what changed.

what worked / what broke

what worked

VM templates with cloud-init. It reduced "hand-built snowflake" VMs.
Practice restores. A restore drill turned vague confidence into specific confidence.
Slow upgrades. One node at a time made failures smaller and easier to diagnose.
Simple network defaults. Fewer clever bridges meant fewer mysterious outages.

what broke

Underestimating storage decisions. Changing storage later is possible, but it is never free.
Too many experimental VMs. Sprawl makes backups and monitoring feel noisy.
Assuming the UI is the truth. The UI is helpful, but I still keep key commands documented.

closing thought

Proxmox is at its best when you treat it as an operational tool, not as a toy. When I keep things boring and test restores, the grid feels calm. When I chase cleverness, I pay for it later.