“Which runner should I use?” is the LLM version of “which distro should I install.” The honest answer is “it depends.” The useful answer is: it depends on what kind of pain you prefer.
In my lab I bounce between three camps: Ollama, llama.cpp, and vLLM. They overlap, but they are not the same tool with different stickers. They make different tradeoffs about packaging, control, and what they assume about your hardware.
my baseline assumptions
I am writing this from a homelab perspective, not a “we have a fleet and a platform team” perspective. The grid is a handful of machines, mostly Linux, with a mix of CPU-only and GPU-capable boxes. I care about:
- Time-to-first-token and general responsiveness more than peak tokens per second.
- Operability: logs, restarts, updates, and not inventing a new snowflake every week.
- Portability: I want to move a model and a config between machines without a ritual.
- Good enough throughput: one user most of the time, a few users sometimes.
Ollama: frictionless, until you want to look under the hood
Ollama is what I reach for when I want a model running in minutes. In my lab it is the “developer convenience layer.” It feels like an app store: pull a model, run it, and do not think too hard about file formats.
The tradeoff is that it is opinionated. That is fine when you are exploring. It becomes annoying when you want to control exactly how the server starts, how it binds, or how models are stored.
My practical guidance: use Ollama when you are still answering basic questions like “Do I even like this model class?” and “Is this workflow worth automating?” Once you have a stable workload, you might outgrow it.
llama.cpp: portable power tools
llama.cpp is the one I trust when I want something to keep working.
It is not pretty. It is not trying to be a whole platform.
It is just an engine that runs and keeps getting better.
In my lab, the killer feature is the ecosystem around GGUF. If I can put a model in a single file, move that file around, and run it on weird hardware, I can treat the model like any other artifact.
The tradeoff is that you own the knobs. That includes the knobs that can hurt you. If you set context too large, or offload too aggressively, you will discover the boundaries of your RAM and VRAM in a very personal way.
example: a boring llama.cpp server command
This is the shape of what I run in my lab. The flags vary by machine. Treat this as an example, not gospel.
# example: start a llama.cpp HTTP server
# adjust --threads / -ngl / --ctx-size to your box
./llama-server \
-m /models/your-instruct-model.gguf \
--host 0.0.0.0 \
--port 8080 \
--threads 8 \
--ctx-size 8192 \
--temp 0.7
The boring part is the point. When the process is predictable, I can wrap it with systemd, monitor it, and treat it like infrastructure.
vLLM: throughput and concurrency, with a bigger surface area
vLLM feels like the “serious server” option. It is really good when you have multiple concurrent requests and you care about keeping GPUs busy. If your problem is queueing and batching, vLLM is usually in the conversation.
In my lab, I use it when I am doing more batch-like tasks or when I am stress-testing a workflow that will later have multiple clients. The cost is that it pulls you into a larger Python stack, with more dependencies, and more ways to get yourself into environment trouble.
I do not treat that as a moral failing. It is just a different shape of operational risk. With vLLM I spend less time tuning low-level flags, and more time keeping the runtime environment stable.
how I pick (a practical decision tree)
- I want quick experiments: Ollama.
- I want a service I can operate and move around: llama.cpp.
- I want throughput with concurrency on GPU: vLLM.
the hidden axis: packaging and upgrades
Runner choice is not just speed. It is also how your future self upgrades it. In my lab, upgrades happen at bad times, usually right after I fix something else. So I care about:
- Where the model files live and whether I can back them up sanely.
- How configs are represented (files I can version control beat UI toggles).
- How I roll back if a new build changes behavior.
example: a tiny “runner sanity” script
This is a pattern I keep around. It is not sophisticated, but it catches “the service is dead” and “the service is alive but slow” in a way that is easy to repeat.
# example: quick health + latency check
# replace URL with your runner endpoint
set -euo pipefail
URL="http://127.0.0.1:8080/"
printf "checking %s\n" "$URL"
/usr/bin/time -f "elapsed=%es" curl -fsS -o /dev/null "$URL"
No fake precision. I am not claiming this is a benchmark. It is just a guardrail so I notice when an update makes the service feel like it is wading through mud.
what worked / what broke
what worked
- Standardizing on one model directory in my lab made everything easier. I can swap runners without re-downloading artifacts.
- Keeping commands in text (README + systemd unit files) beat “I swear I clicked the right toggles.”
- Measuring TTFT informally (same prompt, same model, same machine) is enough to spot regressions.
what broke
- Over-tuning too early. I wasted time chasing theoretical speed while my actual bottleneck was slow storage and cold starts.
- Assuming GPU offload “just works”. In practice it depends on memory headroom and the model. When it fails, it fails loudly.
- Environment drift when experimenting with Python stacks. vLLM is powerful, but it rewards you for using clean virtual environments and documenting what you did.
closing thoughts
If you want a single takeaway: pick one runner that matches your current constraints, learn its failure modes, and automate the boring parts. You can always switch later. In my lab, I do.