The Evaluation Mindset, Designing for Model Testing
Why prompt evaluation is as critical as generation. This section introduces levels of difficulty, failure design, YAML contracts, reproducibility scaffolds, and evaluation goals for structured LLM testing.
Created with my Doodle into Midjourney
AI helped shape the words here, but the ideas, experiments, and code are 100% human-made. This is part 3 of a series on prompt engineering—turning intuition into engineering.
Post 3: The Evaluation Mindset, Designing for Model Testing
Creating prompts is only half the battle. The real work starts when you use those prompts to test large language models consistently across versions, architectures, domains, and production scenarios.
This section explains the philosophy and structure of the evaluation system. Prompts are treated not as isolated curiosities but as benchmarkable, reproducible, domain-specific test cases. A prompt that works once is interesting. A prompt that holds under variation is engineering.
Prompt Evaluation vs Prompt Generation
Prompt generation is the art of writing useful input for an LLM.
Prompt evaluation is the science of determining how reliably that input produces expected, interpretable, and bounded output.
- Prompt generation asks: How do I get the best possible result?
- Prompt evaluation asks: What does this prompt reveal about the model that answered it?
Key distinction: Evaluation is not creative. It is empirical.
Our evaluation prompts are:
- Domain-specific
- Instructionally scoped
- Outcome-aware
- Repeatable
And they must reveal a measurable property of model behavior.
Levels Are Lenses
Every prompt type spans 10 levels of difficulty. This is not just a gradient of hard-to-harder. Each band has a different shape of difficulty that targets different cognitive behaviors.
Complexity Bands
Level band | Characteristics | Example capabilities | Metadata tags |
---|---|---|---|
1–3 | Lexical substitution, factual recall, short completions | Vocabulary swap, fact lookup, direct mapping | complexityLevel=low |
4–6 | Multi-part responses, reasoning with examples, ordering logic | Summarization plus ordering, structured transforms | complexityLevel=mid |
7–9 | Compound instructions, synthesis across formats, goal planning | Multi-step reasoning, cross-evidence synthesis | complexityLevel=high |
10 | Long-context navigation, multi-turn inference, cross-domain | State tracking across 2K+ tokens, dialog memory | complexityLevel=max |
Each example is tagged with complexityLevel
, estimatedTokens
, and wordCount
so downstream tools can match test difficulty to model capacity.
Designing for Failure (On Purpose)
Not every prompt should succeed. Some are designed to be too hard, too ambiguous, or too long for the target context. These stress tests surface brittleness early.
Failure Modes I Observe
Failure type | What it looks like | What it reveals |
---|---|---|
Overconfidence | Fabricated citations and false certainty | Hallucination tendency and calibration |
Inconsistency | Different answers to minor rephrasings | Stability of reasoning under paraphrase |
Control break | Ignores system or role instructions | Instruction adherence and prompt loyalty |
Ambiguity collapse | Arbitrary choice among valid readings | Disambiguation strategy and bias |
Why it matters: Robustness is not about always getting it right. It is about knowing what breaks, when, and why.
Prompts that break cleanly teach us something. Prompts that break inconsistently get flagged for redesign.
YAML: The Evaluation Contract
Each prompt file carries full YAML frontmatter. This is not decoration. It is the machine-readable contract that binds the test to its evaluation logic.
Example YAML
1
2
3
4
5
6
7
8
9
10
11
title: Few-shot Prompt Example 7
model: gpt-4o
promptType: fewshot
tokenizer: cl100k_base
complexityLevel: 7
estimatedTokens: 405
wordCount: 288
promptFormat: chatML
useCase: internalTesting
domains: [programming, math]
tags: [fewshot, reasoning, promptEvaluation]
How the contract is used
Capability | YAML fields that drive it | Outcome |
---|---|---|
Select tests by type | promptType , promptFormat |
Batch runs for few-shot, CoT, ToT, ReAct |
Match tokenizer constraints | tokenizer , estimatedTokens |
Avoids boundary truncation and unfair failures |
Target a domain | domains , tags |
Narrow evaluation to Python, logic, finance, and more |
Scope difficulty | complexityLevel , wordCount |
Aligns test shape to model context size |
Prompt Structure for Reproducibility
Each example includes:
- A fully written, domain-specific prompt
- Escalation from level 1 to level 10 with fresh logic at each level
- No template clones
- Token and word estimates
- Explicit format marker: instruction, chatML, CoT, multi-turn
Required Structure Fields
Field | Purpose | Notes |
---|---|---|
promptBody |
The actual instruction or chat turns | Stored in source file, referenced by YAML |
expectedShape |
Format constraints for output | Freeform, JSON schema, table, or bullet list |
assertions |
Checks for evaluation | Regex, JSON schema, keyphrase, or custom function |
scaffoldType |
Structural frame used | System, role, contextual, or none |
Rule: You cannot version a prompt if you do not version its intent and expected shape.
Run Parameters and Controls
Determinism and fairness require explicit run controls. I pin defaults, then vary them intentionally.
Parameter | Default | When I vary it | Why it matters |
---|---|---|---|
temperature |
0.2 | Creativity or exploration tests | Affects diversity and stability |
top_p |
1.0 | Sampling studies | Changes tail behavior of token choice |
max_tokens |
task-specific | Boundary tests | Truncation detection |
seed |
fixed per batch | Reproducibility checks | Repeatable comparisons |
tools_enabled |
false by default | ReAct and agent tests | Tool hallucination control |
stop_sequences |
explicit per task | Format compliance tests | Prevents run-on outputs |
Scoring and Instrumentation
Outputs are scored with both automated and human-in-the-loop checks. The goal is to measure shape, truth, and cost.
Metric | Definition | Measurement |
---|---|---|
Accuracy | Correctness against ground truth | Unit tests, golden answers, or programmatic checks |
Completeness | Presence of all required parts | Keyphrase or section checks |
Format compliance | Output matches expected shape | JSON schema or regex assertions |
Token efficiency | Useful output per token | Tokens to reach valid answer |
Hallucination risk | Unsupported claims or citations | Heuristic flags and human spot checks |
Determinism | Stability across reruns | Variance over fixed seeds |
Filtering by Evaluation Goals
Prompts are tagged so that test runs can be goal-driven rather than monolithic.
Goal | Tags and domain filters | What it reveals |
---|---|---|
Reasoning depth | chainofthought , planning , react |
Whether multi-step logic holds or collapses |
Code-specific ability | python , programming , fewshot |
Syntax adherence and runtime plausibility |
Tool use simulation | toolformer , rag , react |
Reliability of API-style calls and tool paths |
Resistance to hallucination | selfask , instruction , cot |
Confidence calibration and citation quality |
Structured output | multi-turn , chatML , reflective |
Schema control and output discipline |
Anti-Patterns I Avoid
I deliberately avoid practices that inflate scores without improving reliability.
Anti-pattern | Why I avoid it | What I do instead |
---|---|---|
Synthetic cloning of templates | Looks like coverage without new behavior | Write new logic per level and type |
Blind token padding | Longer prompts that do nothing better | Short, targeted scaffolds with measurable effect |
Placeholder injection | Meta-prompts about prompts | Real instructions with explicit assertions |
Unpinned run params | Hidden variance across runs | Fixed seeds and documented parameter sweeps |
What Comes Next
With the evaluation mindset and contract in place, the next section turns technical. I move from prompt-level design to token-aware evaluation: measuring token cost, tokenizer alignment across model families, and performance at boundary lengths.
Because even the strongest prompt fails if it silently breaks at 2,049 tokens.