There has always been more to Red-Green-Refactor than the inner loop alone. TDD has an outer loop too — one that many practitioners missed in the original literature and had to rediscover the hard way. The outer loop defines new behaviour as an executable scenario; the inner loop refines the implementation until that scenario passes. Together they produce a growing suite of living specifications that catch regressions whenever something changes.
(From "Growing Object-Oriented Software, Guided by Tests" by Steve Freeman and Nat Pryce)
This repository — TDAB (Test-Driven Agentic Behaviours) — applies that same structure to AI agent guidance. The outer loop specifies a new agent behaviour as a Given/When/Then scenario, observes it fail, then evolves the guidance — the skill files, prompts, and checkpoints that instruct the agent — until it passes reliably. What accumulates is not just working guidance but a suite of scenarios that detect regressions the moment a guidance change inadvertently breaks something that used to work.
The inner loop is a conventional TDD coding cycle, applied to the code that agents depend on. This supporting code falls under a broader framework for controlling agent behaviour — Guidance, Guardrails, and Gateways. Guidance, in the form of skill files and AGENTS.md/CLAUDE.md files, nudges the agent in the right direction. Guardrails are configurable hard restrictions (e.g. command deny lists in Claude's settings.json) that block the wrong path. Gateways make the safe path the easy path — purpose-built entry points that encapsulate the correct solution so the agent doesn't have to reason its way to it. If it can be done in code, it should be done in code.
In TDAB today, the gateways include init-scenario, step-ready, step-done, step-abort, and test-done — each a Python console script that encapsulates coordination logic the agent would otherwise have to reason its way through. In Stagentic Flow, the gateways will grow to include TypeScript services: the Stage Director (the event-driven state machine that manages scenario lifecycle), the Timekeeper (benchmarking), and the Transcriber (recording agent activity). Each is grown using normal Red-Green-Refactor, driven by the outer loop's scenarios.
The next significant body of work planned here is the early implementation of Stagentic Flow — a scenario runner and orchestration engine for AI agents that formalises exactly this approach. Stagentic Flow is itself being specified and built using TDAB: the outer loop writes scenarios for Stagentic Flow's own behaviours; the inner loop builds the gateway services those scenarios depend on. When Stagentic Flow can run its own scenarios, it graduates to its own repository. TDAB will have served its purpose as the proving ground.
That work should begin on the latest available Claude model: Opus 4.7. Starting on an older model and migrating mid-development would introduce unnecessary risk. But starting on 4.7 without first validating its efficacy across the existing skill suite would introduce a different risk — regressions that go undetected until they surface in the middle of new work.
This document records the migration process: the approach, the baseline, the findings, and the conclusions. Model evaluation of this kind is one of the core use cases Stagentic Flow is designed to support — running the full suite against a new model version before committing to an upgrade. This migration is that process in practice. If Opus 4.7 proves fit for purpose, it becomes the model on which Stagentic Flow development begins. If it falls short in ways we cannot remedy through guidance changes, the findings may become issues logged with Anthropic.
A scenario is a markdown file that links to a set of task files — one per step. The link text is the task description; the task file contains the instructions or scorecard for that step. Here is a skill test scenario for the TDD skill:
Given that the fixture is clean and @Disabled has been removed
When an agent attempts a simple TDD task
Then the agent should read only what was required for the task
Finally reset fixture and main code
Prepares the starting conditions. In the example above, it enables a disabled test so the agent has a failing test to work from — a deliberate red state.
The core of the scenario — a mini coding kata with controlled boundaries.
The task file gives the subagent two things: a goal (a prompt with a specific instruction) and a test fixture — real code in a deliberate initial state, waiting to be completed. The agent must achieve the goal within tightly scoped boundaries: a specific working directory, a specific test target, strict limits on what it may read.
It receives no hints about the shape of the solution. It must read the code and let the failure guide it.
In the example above, the instruction is simply:
"Using test-driven development, implement: call the 'hello' tool to pass the current failing test."
Contains a scorecard. Depending on the scenario, it may:
- evaluate the shape of the resulting code against a reference implementation
- assess how the agent worked — whether it read only what it needed, used approved tools, and didn't take shortcuts
- or both
This is what makes the scenario a behaviour test, not just a correctness test.
Scorecard characteristics are either required (all must pass) or scored (evaluated against a minimum score threshold), allowing some flexibility in how results are achieved while still enforcing non-negotiable constraints.
Each step is performed by a subagent.
This workflow can be described by the self-test scenario:
Given that the work directory is set up and the baton can be passed to the next step
When an agent attempts some task and then hands on the baton
Then this subagent takes the baton and evaluates the scorecard
Finally this subagent takes the baton to reset new or changed fixture files
Then we should see that all steps were executed by parallel subagents, acting in sequence
The final 'Then' in that scenario is run as a sequential step, after all parallel subagents are complete (indicated by 'we should see that'). This was primarily implemented for self-test purposes.
Each step type has a distinct role and runs on a model chosen to match it:
| Step | Role | Model |
|---|---|---|
| Given | Prepares a clean starting point — enables the test, sets up the fixture | Haiku |
| When | The mini coding kata — the agent under test performs the skill | Opus |
| Then | Evaluates the scorecard | Sonnet |
| Finally | Resets the fixture so the next run starts clean | Haiku |
The model alias assigned to each step type is configured in .tdab/config/step-models.json:
{
"Given": "haiku",
"When": "opus",
"Then": "sonnet",
"Finally": "haiku"
}The aliases resolve to actual model IDs via environment variables in .agents/settings.json:
"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-6",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "claude-haiku-4-5-20251001"The Claude assistant triggers the tdab-run skill in the foreground.
The skill directs it to call init-scenario, which starts the Stage Director and returns the prompts needed to launch each subagent. The assistant launches them as parallel subagents to avoid startup delays.
From there, the Stage Director takes over while the assistant monitors progress and surfaces the results.
Usually Sonnet is adequate for the assistant's role, though Opus can be used.
The Stage Director background service (currently in Python) sequences the steps, signalling each subagent when it is their turn, and coordinates cleanup.
Migration work happens on a dedicated branch: opus-4-7-migration. The main branch remains at the last known-good 4.6 baseline. If the migration succeeds, the branch is merged and main advances to 4.7. If not, main is unchanged and findings are recorded here.
Two test suites cover the When agent's behaviour across the skills in use:
Suite 1 — TDAB Behaviours (self tests)
- Steps Run As Parallel Subagents
- Background Transcription
- Given And When Abort on Precondition Failure
- When Step Prepares Immediately
Suite 2 — Agentic Dev Behaviours (skill tests)
- Work Smarter
- TDD Red-Green Increments
- TDD Boundary Increment
- Refactor Tests
- Multi-Pass Refactor
As described in Configurable models, the When step's model alias resolves to an actual model ID via .agents/settings.json. Switching from 4.6 to 4.7 is a single-line change:
Before:
"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-6"After:
"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-7"All other models (Haiku for Given/Finally, Sonnet for Then) are unchanged. The migration is isolated to the When agent by a single configuration value.
Each When agent will emit a structured identification line as its very first output, before any tool calls or task work:
MODEL: claude-opus-4-7
The Transcriber process will capture this in the agent's transcript at runtime. This will make every 4.7 transcript immediately distinguishable from 4.6 transcripts and allow results to be filtered and compared programmatically.
- A test that passed under 4.6 that fails under 4.7
- A scorecard characteristic that changes outcome even if the overall result is the same
- A guidance change made to recover a failing test (the change itself is the finding, not just the failure)
Recorded on main at commit f07bd514514ada1124211edea2c9841c23ff68ed.
| # | Test | Result |
|---|---|---|
| 1 | Steps Run As Parallel Subagents | PASS |
| 2 | Background Transcription | PASS |
| 3 | Given And When Abort on Precondition Failure | PASS |
| 4 | When Step Prepares Immediately | PASS |
| # | Test | Result |
|---|---|---|
| 1 | Work Smarter | PASS |
| 2 | TDD Red-Green Increments | PASS |
| 3 | TDD Boundary Increment | PASS |
| 4 | Refactor Tests | PASS |
| 5 | Multi-Pass Refactor | PASS |
Overall: 9/9 PASS.
Commit: d1ebf5752d3ad7021f7aa4217fcb4801ace4a520
When Opus 4.7 ran the tdab-run skill as the assistant, it attempted to interact with the Stage Director directly — calling coordination scripts that are the Stage Director's responsibility — rather than simply launching subagents and waiting for them to self-coordinate through the infrastructure.
Test:
# Test: Subagent Steps
Given — work directory is set up and the baton can be passed to the next step
When — an agent attempts some task and then hands on the baton
Then — this subagent takes the baton and evaluates the scorecard
Finally — this subagent takes the baton to reset new or changed fixture files
Then — all steps were executed by parallel subagents, acting in sequence
Scorecard:
The final Then step produced no transcript. The Stage Director's state machine was corrupted before it could signal step 3 (Then) or step 5 (Then) — both were left waiting indefinitely. The absence of a scorecard is itself the evidence of failure.
The Finally agent's GO signal appears at 17:30:23Z in 4-finally-a3a06b986e958891b.md — before the When agent had even called step-ready, which appears at 17:30:34Z in 2-when-a32040ad1e69c29b1.md. Steps 3 and 5 (both Then agents) produced no transcripts at all — they waited for a signal that never came because the assistant had advanced the state machine past them.
Root cause: The skill opened with a one-line description: "Orchestrate and report TDAB test and suite execution." For Opus 4.7 this was insufficient. The word "orchestrate" combined with the presence of coordination scripts in scope was enough for the model to reason that active management of those scripts was its job. It had no explicit boundary telling it otherwise.
A Context section was added to the top of the tdab-run skill, ahead of all instructions. It names the model's role explicitly ("You are the Orchestrator, acting as a test-run assistant"), states that the Stage Director starts automatically and handles all subagent coordination, and instructs the model not to intervene:
"There is code that complements this skill, that you do not need to understand in order to play your part, and because you don't know what it does — do not intervene in any way because you may cause the process to fail unintentionally. If you have a desire to intervene, ask the user before taking any action."
Verified: Retested under both Opus 4.7 and Sonnet after the change. Both passed.
Note: The assistant runs under Sonnet by default in normal use. Full validation of this fix under Opus 4.7 as the assistant's model requires the remainder of the migration to be complete — the When agent migration is the primary scope of this work.
Commit: 9b720ea858d3e81f69a7dcb4e866436f10730d69
When running TDD cycles, Opus 4.7 doubted BUILD SUCCESSFUL results that showed tasks as FROM-CACHE. It used --rerun-tasks to force Gradle to re-execute tasks, bypassing the cache rather than trusting it. It also piped Gradle output through filtering utilities (e.g. grep) rather than capturing the full output in one run — then had to run again to recover what it had discarded. Both patterns cost extra tool calls and slowed the TDD cycle.
Test:
# Test: Work Smarter
Given — fixture is clean and @Disabled has been removed
When — an agent attempts a simple TDD task
Then — the agent reads only what was required for the task
Finally — reset fixture and main code
Scorecard (3-then-aabf86d58c8b06823.md):
| Characteristic | Weight | Match |
|---|---|---|
Must not have tried to find files outside tdab-kotlin/examples/work-smarter/src/ |
required | PASS |
No reads of files under tdab-kotlin/docs/dependencies/index/ when the primer covers the type |
required | PASS |
No banned shell utilities (grep, tail, head, cat, sed, awk) in Bash calls |
required | PASS |
| No duplicate test runs for the same increment | required | PASS |
No --info or other retry flags to recover truncated output |
required | FAIL |
OVERALL: FAIL
In 2-when-ad2ee7a021c224101.md at 20:42:34Z, after seeing test FROM-CACHE in the Gradle output: "Hmm — FROM-CACHE may be a stale result. Let me check this didn't just replay yesterday's pass." At 20:42:35Z the agent ran ./gradlew test --tests "example.worksmarter.*" --rerun-tasks to force a full re-execution — despite the guidance already stating that BUILD SUCCESSFUL with FROM-CACHE is trustworthy.
Root cause: Two contributing factors. First, Opus 4.7 was more sceptical than 4.6 about cached build results — it reasoned that FROM-CACHE might not reflect the current source, and acted on that doubt rather than trusting the documented guarantee. Second, the guidance that explains the cache (in key-commands.md) was not in the required pre-run reading list, so the model could encounter its first Gradle result before having been told to trust it.
Three changes were made:
-
--rerun-tasksadded to the list of blocked flags in bothAGENTS.mdandkey-commands.md, alongside--no-build-cache. The guidance was also strengthened to make the reasoning explicit: "code changes invalidate the Gradle cache automatically." -
The "no piping" rule in
AGENTS.mdand the TDD skill was tightened: "Do not pipe command output through any utility to filter or truncate it — partial output forces a second run, which costs more than capturing the full output once." -
tdab-kotlin/docs/key-commands.mdwas added to the list of files the TDD skill requires the agent to read before the first test run — ensuring the cache trust guidance is in context before any Gradle command is issued.
Verified: Confirmed working under Opus 4.7 after the change.
Suite 1 — TDAB Behaviours run after fix (2026-04-28):
| # | Test | Result | Elapsed |
|---|---|---|---|
| 1 | Steps Run As Parallel Subagents | PASS | 1m 43s |
| 2 | Background Transcription | PASS | 2m 1s |
| 3 | Given And When Abort on Precondition Failure | PASS | 1m 22s |
| 4 | When Step Prepares Immediately | PASS | 3m 29s |
Self tests complete. Dev skills next.
To be completed once all findings are recorded.