GGUF Quantization Notes (without lying to myself)

Quantization is why local models feel practical in 2025. It is also why it is easy to accidentally gaslight yourself. You download three GGUF files with similar names, try them for ten minutes each, and then declare one “better” based on vibes and one lucky answer.

These are my lab notes for staying honest. I am not doing research here. I am trying to run assistants locally without turning the grid into a space heater, while keeping quality at a level that does not annoy me.

what I mean by “quantization” in practice

In the GGUF world, the quant is basically the compression setting for the model weights. Lower-bit quants are smaller and often faster to run, but you are trading away some fidelity. The trade is not linear and it is not always obvious.

In my lab I think about it as three buckets:

  • “Lean” quants (roughly Q3 to Q4-ish): when I need it to fit or I need speed.
  • “Daily driver” quants (roughly Q4 to Q5-ish): usually the sweet spot for home hardware.
  • “Almost full” (roughly Q6 to Q8-ish): when I care more about quality than footprint.

I am intentionally vague because names and schemes evolve. My actual method is to test a small set and pick the first one that feels stable.

my selection workflow (fast, repeatable, not fancy)

I pick a model family I like, then I pull two or three quants of that same model. I try to keep the rest of the variables fixed: same runner, same context size, same prompt, and ideally the same machine.

The first pass is not about correctness. It is about failure modes. Does it load reliably? Does it OOM? Does it become a latency monster after a few turns? Those are the questions that matter in an ops notebook.

example: keeping a “model manifest” file

I keep a tiny text file next to my models. It is not a database. It is just enough to remember what I downloaded and why.

# example: /models/MANIFEST.txt
# (this is intentionally human-readable)

# family: some-8b-instruct
# runner: llama.cpp (HTTP server)
# notes: daily driver for tool-ish prompts

some-8b-instruct-q4.gguf  # smaller, good speed, slightly more hallucination in my tests
some-8b-instruct-q5.gguf  # better instruction-following, still fits comfortably
some-8b-instruct-q8.gguf  # quality test, slower load, not always worth it

The big win is not the file itself. The win is that future me does not have to re-learn why there are seven similarly named blobs in that directory.

what I measure (roughly) so I do not chase ghosts

I avoid pretending I have a perfect benchmark suite. I do not. I have a life. What I do have is a handful of repeatable checks that tell me if a quant is usable.

1) time-to-first-token and cold start

TTFT is what your brain perceives as “snappy.” Cold start matters because I reboot boxes, I redeploy services, and sometimes I accidentally kill the process.

In my lab, a quant that loads reliably and gives a decent TTFT usually beats a heavier quant that is slightly smarter but feels sluggish.

2) “can it follow a boring instruction”

I test a few stable prompts: parse a config, write a cautious shell snippet, summarize notes. Not because those are glamorous, but because they are my actual workload.

example: a tiny prompt harness

This is the kind of thing I run from a terminal. It is not a benchmark. It is a sanity loop.

# example: send the same prompt to a local OpenAI-compatible endpoint
# adjust the URL and model name to your server
set -euo pipefail

API_BASE="http://127.0.0.1:8080/v1"
MODEL="local-model"

curl -sS "$API_BASE/chat/completions" \
  -H 'Content-Type: application/json' \
  -d @- <<'JSON'
{
  "model": "local-model",
  "messages": [
    {"role": "system", "content": "You are a cautious assistant."},
    {"role": "user", "content": "Explain, briefly, how to rotate logs for a small service. Include 3 bullet points."}
  ],
  "temperature": 0.4
}
JSON

The content of the prompt is not sacred. The stability is. If a quant starts ignoring constraints, repeating itself, or turning everything into a manifesto, I notice quickly.

common gotchas I keep relearning

quantization is not the only variable

Runners have flags. Context size changes memory pressure. Offload settings change the bottleneck. It is easy to blame “the quant” when the real issue is “I changed three knobs at once.”

small models can feel smarter when they are fast

This sounds weird, but it shows up in my workflow. If the model responds quickly, I iterate more. More iterations can beat slightly higher quality per response. In that sense, speed is a quality multiplier.

what worked / what broke

what worked

  • Keeping two “approved” quants per model family: one daily driver and one quality check.
  • Testing with my real prompts instead of random trivia questions.
  • Documenting the runner flags alongside the model, so I can reproduce behavior later.

what broke

  • Benchmark brain: optimizing for tokens-per-second while my problem was cold start latency.
  • Assuming “bigger is always better”: heavier quants sometimes reduced stability in my setup.
  • Mixing changes: swapping runner builds and quants at the same time made debugging miserable.

the rule I try to follow

Pick the smallest quant that reliably follows instructions for your workload. If you cannot tell the difference in your day-to-day tasks, you probably do not need the heavier file. And if you can tell the difference, keep the heavier file for the jobs that actually benefit.