Start Here — The Test Harness Pattern

Goal: Understand the mental model before diving into code

The Problem We're Solving

You ask an LLM to do multi-step work: analyze a codebase, find bottlenecks, suggest optimizations, verify the suggestions, and write a report.

What usually happens:

Agent writes a wall of text
You can't verify if it actually did the work
No structured output (just narrative)
If something's wrong, you can't tell where it failed
Hard to reuse or automate

What you wish happened:

Agent produces evidence of each step
You can verify work was done
Outputs are machine-readable
Failures are clear and debuggable
System is repeatable and composable

The Solution: Test Harness Pattern

Think of this like a CI/CD pipeline for LLM work. An orchestrator creates an isolated session directory, spawns phases in order, validates outputs before continuing, and aggregates final results.

Key Concepts

1. Session Directory

Every workflow run gets its own isolated folder with timestamped reports. This is your evidence trail — you can inspect what was actually done.

2. Strict JSON Contracts

Every phase MUST return a structured format: status, report_path, and phase_summary. This makes outputs machine-readable so the orchestrator can validate programmatically.

3. Validation Gates

After each phase, the orchestrator checks: Is the JSON valid? Is the status "complete"? Does the report file exist on disk? Are required summary keys present? If anything fails, it stops immediately.

4. The Verification Phase

This is what makes the pattern powerful: one phase runs an actual script (not LLM analysis), compares script output against earlier conclusions, and reports what was confirmed, revised, or unexpected. Empirical validation — the script doesn't lie.

5. Reference Instruction Docs

Each phase has a corresponding instruction file that tells the agent exactly what to do. Same instructions produce the same outputs — deterministic behavior.

The Mental Model

Think of it as a scientific experiment protocol:

Hypothesis Phase (1-3): Analyze, form conclusions
Verification Phase (4): Run experiment to test hypothesis
Conclusion Phase (5): Synthesize validated findings

The key insight: this pattern turns "LLM wrote some text" into "LLM executed a validated procedure with evidence and structured outputs." That's the difference between a chatbot and a production system.

What You'll Learn

By the end of this lab, you'll be able to:

Understand the test harness pattern and why it's powerful
Navigate the reference implementation
Modify phases and reference docs for your needs
Build your own multi-phase workflow from scratch
Debug when phases fail or return invalid outputs
Deploy production-ready workflows using this pattern

Full Guide on GitHub

The complete guide includes detailed diagrams, terminology tables, and extended examples: