I do not have a formal “yearly review” process for the grid, but I do like doing an inventory now and then. It helps me separate what actually matters from what was just a temporary fascination. The LLM world changes quickly. The parts that keep services alive change slowly.
This is a snapshot of my current stack. It is not a recommendation. It is more like: if you dropped me into a new rack with a weekend to rebuild, these are the decisions I would probably make again.
the stable foundation: boring wins
The pieces that stayed are the ones that fail predictably. That is not a joke. When you operate a small system without a full-time team, predictability is worth more than elegance.
┌─────────────────────────────────────┐
│ EXPERIMENTAL LAYER │
│ (LLMs, quants, agents, tooling) │
├─────────────────────────────────────┤
│ STABLE LAYER (boring) │
│ Linux ── systemd ── nginx ── ZFS │
│ (changes only when it breaks) │
└─────────────────────────────────────┘
- Linux as the base layer. I keep it simple and avoid exotic tweaks.
- systemd for long-running services and timers. Not because it is trendy, but because it is everywhere.
- nginx as the front door. It is boring, fast, and I understand how to debug it.
- Proxmox for virtualization. It lets me compartmentalize experiments without losing visibility.
- ZFS where it makes sense. Snapshots changed how I take risks.
what changed: the LLM layer is still moving
The biggest changes were around local inference and tooling. I have not found a “final” runner. I have found workable runners, each with their own sharp edges. The practical improvement this year was that I started treating model runners like services: versioned configs, explicit ports, systemd units, and logs that I can grep.
I also got stricter about moving artifacts around. If a model is going to be used for more than a day, I want it on predictable storage, with a manifest. Otherwise I end up re-downloading a few hundred gigabytes because I forgot what I did.
service shape: small APIs with hard edges
Early on I tried to build a single “AI server” that did everything. It became an untestable blob. What works better in my lab is splitting responsibilities: one service that runs inference, one service that does retrieval, and thin glue in front.
This also makes scaling less dramatic. If inference is overloaded, I can move just that unit to a GPU box. If retrieval is slow, I can change the index without touching the model runner.
example: a systemd unit for a local inference server
[Unit]
Description=Local LLM API
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=neo
WorkingDirectory=/srv/llm-api
ExecStart=/usr/local/bin/llm-server --host 127.0.0.1 --port 8080
Restart=on-failure
RestartSec=2
# basic resource hygiene
MemoryMax=16G
CPUQuota=250%
[Install]
WantedBy=multi-user.target
This is intentionally plain. In a small environment, “plain” is easier to carry between machines.
the front door: nginx stays because it does not surprise me
I continue to use nginx as the edge. I like being able to terminate TLS, set timeouts, and route to different backends without rewriting my apps. I also like having a single place to add basic controls like request size limits.
example: a tiny reverse proxy block
server {
listen 443 ssl;
server_name llm.grid.lan;
location /v1/ {
proxy_pass http://127.0.0.1:8080/;
proxy_http_version 1.1;
proxy_read_timeout 300s;
client_max_body_size 2m;
}
}
observability: enough logs to explain the weird stuff
I do not run a full metrics stack everywhere. What I do run is a consistent logging story. If something fails, I want to be able to answer:
- which host ran the request,
- which model and runner version handled it,
- how long it took (roughly),
- and whether it failed in a predictable way.
For a homelab, journald plus a few structured log lines is often enough. It is not glamorous, but it turns “the model got weird” into a debuggable event.
habits that mattered more than tools
The most useful stack improvements were not new software. They were habits. I now treat changes as a sequence of small, reversible moves. If a change is not reversible, I make it smaller.
- Write down the command that creates an environment or starts a service.
- Do restore drills on backups and snapshots. Green checks are not proof.
- Keep manifests for models and configs so I can reproduce behavior later.
- Do upgrades slowly and avoid stacking multiple risky changes in one day.
what worked / what broke
what worked
- systemd for everything long-running. It standardized restarts and logging.
- nginx as a stable edge. Reverse proxying made backend swaps less painful.
- ZFS snapshots + replication. It made experiments feel safer.
- smaller services. Splitting inference from retrieval improved debugging.
what broke
- Chasing runner performance too early. The bottleneck was often disk or cold-start, not tokens per second.
- Letting artifacts drift. Untracked model versions created confusing “it changed” failures.
- Assuming defaults are safe. Timeouts and memory limits matter when you run big models.
closing thought
The grid is less about having the newest tools and more about having tools you can operate. My 2026 goal is to keep the base boring and to treat the LLM layer as experimental, with guardrails.