Start Here — The Test Harness Pattern
The Problem We're Solving
You ask an LLM to do multi-step work: analyze a codebase, find bottlenecks, suggest optimizations, verify the suggestions, and write a report.
What usually happens:
- Agent writes a wall of text
- You can't verify if it actually did the work
- No structured output (just narrative)
- If something's wrong, you can't tell where it failed
- Hard to reuse or automate
What you wish happened:
- Agent produces evidence of each step
- You can verify work was done
- Outputs are machine-readable
- Failures are clear and debuggable
- System is repeatable and composable
The Solution: Test Harness Pattern
Think of this like a CI/CD pipeline for LLM work. An orchestrator creates an isolated session directory, spawns phases in order, validates outputs before continuing, and aggregates final results.
Key Concepts
1. Session Directory
Every workflow run gets its own isolated folder with timestamped reports. This is your evidence trail — you can inspect what was actually done.
2. Strict JSON Contracts
Every phase MUST return a structured format: status, report_path, and phase_summary. This makes outputs machine-readable so the orchestrator can validate programmatically.
3. Validation Gates
After each phase, the orchestrator checks: Is the JSON valid? Is the status "complete"? Does the report file exist on disk? Are required summary keys present? If anything fails, it stops immediately.
4. The Verification Phase
This is what makes the pattern powerful: one phase runs an actual script (not LLM analysis), compares script output against earlier conclusions, and reports what was confirmed, revised, or unexpected. Empirical validation — the script doesn't lie.
5. Reference Instruction Docs
Each phase has a corresponding instruction file that tells the agent exactly what to do. Same instructions produce the same outputs — deterministic behavior.
The Mental Model
Think of it as a scientific experiment protocol:
- Hypothesis Phase (1-3): Analyze, form conclusions
- Verification Phase (4): Run experiment to test hypothesis
- Conclusion Phase (5): Synthesize validated findings
The key insight: this pattern turns "LLM wrote some text" into "LLM executed a validated procedure with evidence and structured outputs." That's the difference between a chatbot and a production system.
What You'll Learn
By the end of this lab, you'll be able to:
- Understand the test harness pattern and why it's powerful
- Navigate the reference implementation
- Modify phases and reference docs for your needs
- Build your own multi-phase workflow from scratch
- Debug when phases fail or return invalid outputs
- Deploy production-ready workflows using this pattern
Full Guide on GitHub
The complete guide includes detailed diagrams, terminology tables, and extended examples: