Prompt hygiene sounds like a soft skill. In my lab it became an ops discipline. The moment I started using local models for real workflows, prompts stopped being fun one-off spells and started being configuration.
Configuration needs structure, versioning, and guardrails. Otherwise you get the worst kind of failure: the system still runs, but it behaves differently each day, and you cannot explain why.
the problems I was trying to solve
- drift: prompts accrete random instructions over time.
- leakage: secrets and personal data end up in logs or prompt files.
- fragility: a small wording change breaks a workflow that depends on structured output.
- overfitting: prompts become tailored to one model and fall apart when you switch quants or runners.
None of these are glamorous. They are the reasons people give up and say "LLMs are unreliable." Sometimes the model is unreliable. Sometimes the prompt is.
[Input] → [│ REDACT → │ VALIDATE → │ STRUCTURE │] → [Clean Output]
^ ^ ^
removes checks forces
secrets types consistency
rule 1: treat prompts like code
In my lab, prompts live in a repo. That means diff, history, and the ability to roll back. It also means I can answer "when did this start happening?"
I also keep prompts small. Big prompts feel powerful until you have to debug them. When something goes wrong, you want a small surface area.
rule 2: separate system policy from task instructions
I keep a stable "policy" prompt that rarely changes: tone, safety, citation rules, tool calling rules. Then I keep task prompts that are specific to an action: summarize, extract, generate.
This separation makes it easier to switch models. When a model is stubborn, I adjust the task prompt instead of rewriting my whole worldview.
example: a simple prompt directory layout
# example: prompt repo structure
prompts/
policy/
base_system.txt
tool_calling_rules.txt
tasks/
summarize_notes.txt
extract_todos.txt
write_bash_function.txt
templates/
json_schema_response.txt
README.md
rule 3: bake in failure behavior
The worst output is a confident wrong answer. My prompts explicitly allow the model to say "I do not know" and to ask a question. If a workflow requires structured output, I tell the model what to do when it cannot comply.
For example: if it cannot produce valid JSON, it should return an error object. That way my downstream code can handle it.
example: "valid JSON or error object" instruction
# example: structured-output guardrail
Return ONLY valid JSON.
If you cannot comply, return:
{"error": "invalid_input", "message": "Explain what is missing."}
rule 4: build redaction into the workflow
Local-first does not mean "no secrets ever touch the prompt." It means you have more control. In my lab I still redact tokens, API keys, and hostnames in shared logs. I do not want to train myself to paste secrets everywhere.
A very boring pattern works: pre-process input through a redaction step before it hits the model, and store the unredacted original separately if you really need it.
example: a rough redaction filter
# example: redact common secret shapes (not perfect)
# use as a pre-processing step before sending text to a model
sed -E \
-e 's/(AKIA[0-9A-Z]{16})/[REDACTED_AWS_KEY]/g' \
-e 's/([A-Za-z0-9_\-]{20,})/[REDACTED_TOKEN]/g' \
-e 's/(password\s*[:=]\s*)[^\s]+/\1[REDACTED]/gi'
This does not catch everything. It is still worth doing because it catches the easy stuff.
the boring knobs: temperature, top-p, and why I pin them
Prompt hygiene is not only words. If you change sampling settings between runs, your prompt becomes a moving target. In my lab I keep a default profile for "automation" work: relatively low temperature, and I only change it when I am explicitly exploring.
The practical rule is: if the output feeds a script, pin the generation settings in code. Otherwise you will eventually debug a failure that is just randomness.
regression tests for prompts (yes, really)
I do not have a full test suite, but I do keep a small directory of "golden prompts." They are short and they cover my core tasks. When I swap a model, or change a prompt template, I run the golden set and scan the output.
The goal is not perfect equality. The goal is catching obvious breakage: missing fields, format changes, or the model suddenly refusing to follow constraints.
example: a tiny golden-prompt runner
# example: run prompts and save outputs for diffing
set -euo pipefail
API="http://127.0.0.1:8080/v1/chat/completions"
OUTDIR="./out"
mkdir -p "$OUTDIR"
for p in prompts/golden/*.txt; do
base=$(basename "$p" .txt)
curl -sS "$API" -H 'Content-Type: application/json' -d @- \
| tee "$OUTDIR/$base.json" >/dev/null <<JSON
{"model":"local-model","messages":[{"role":"user","content":"'$(cat "$p" | sed 's/"/\\"/g')'"}],"temperature":0.3}
JSON
printf "ran %s\n" "$base"
done
diffing outputs without lying to yourself
The trap is eyeballing raw JSON and deciding it is "basically the same." I try to normalize the output into something diff-friendly, usually plain text with the fields I care about. Even if the model is non-deterministic, the structure should stay stable.
# example: normalize a chat completion into just the assistant text
# (adjust the jq path to match your API)
set -euo pipefail
for f in out/*.json; do
jq -r '.choices[0].message.content' "$f" > "${f%.json}.txt"
done
git diff --no-index -- out_before/ out_after/ || true
When the diff explodes, I take it as a signal to shrink the prompt, not to add more rules. In practice, the smallest prompts are easier to keep stable across model swaps.
what worked / what broke
what worked
- Versioned prompts with diffs and rollbacks.
- Small, composable prompts instead of one giant instruction block.
- Explicit failure paths so downstream code can react sanely.
what broke
- Prompt bloat: every time something failed, I wanted to add another rule.
- Model-specific hacks: fixes that only worked for one quant and made others worse.
- Ignoring input quality: garbage input makes prompt hygiene irrelevant.
closing thought
Prompt hygiene is mostly about respecting your own time. If a prompt powers a real workflow, treat it like a config file. Small, auditable, and boring. The grid runs on boring.