Skip to content

Instantly share code, notes, and snippets.

@antonymarcano
Last active April 28, 2026 12:52
Show Gist options
  • Select an option

  • Save antonymarcano/75b3c3b6c8f09ed7827d2ee6f1a8d816 to your computer and use it in GitHub Desktop.

Select an option

Save antonymarcano/75b3c3b6c8f09ed7827d2ee6f1a8d816 to your computer and use it in GitHub Desktop.

Claude Opus 4.7 TDAB Migration

Context

The outer and inner loops

There has always been more to Red-Green-Refactor than the inner loop alone. TDD has an outer loop too — one that many practitioners missed in the original literature and had to rediscover the hard way. The outer loop defines new behaviour as an executable scenario; the inner loop refines the implementation until that scenario passes. Together they produce a growing suite of living specifications that catch regressions whenever something changes.

Inner and outer feedback loops in TDD

(From "Growing Object-Oriented Software, Guided by Tests" by Steve Freeman and Nat Pryce)

This repository — TDAB (Test-Driven Agentic Behaviours) — applies that same structure to AI agent guidance. The outer loop specifies a new agent behaviour as a Given/When/Then scenario, observes it fail, then evolves the guidance — the skill files, prompts, and checkpoints that instruct the agent — until it passes reliably. What accumulates is not just working guidance but a suite of scenarios that detect regressions the moment a guidance change inadvertently breaks something that used to work.

Guidance, Guardrails, and Gateways

The inner loop is a conventional TDD coding cycle, applied to the code that agents depend on. This supporting code falls under a broader framework for controlling agent behaviour — Guidance, Guardrails, and Gateways. Guidance, in the form of skill files and AGENTS.md/CLAUDE.md files, nudges the agent in the right direction. Guardrails are configurable hard restrictions (e.g. command deny lists in Claude's settings.json) that block the wrong path. Gateways make the safe path the easy path — purpose-built entry points that encapsulate the correct solution so the agent doesn't have to reason its way to it. If it can be done in code, it should be done in code.

In TDAB today, the gateways include init-scenario, step-ready, step-done, step-abort, and test-done — each a Python console script that encapsulates coordination logic the agent would otherwise have to reason its way through. In Stagentic Flow, the gateways will grow to include TypeScript services: the Stage Director (the event-driven state machine that manages scenario lifecycle), the Timekeeper (benchmarking), and the Transcriber (recording agent activity). Each is grown using normal Red-Green-Refactor, driven by the outer loop's scenarios.

Stagentic Flow

The next significant body of work planned here is the early implementation of Stagentic Flow — a scenario runner and orchestration engine for AI agents that formalises exactly this approach. Stagentic Flow is itself being specified and built using TDAB: the outer loop writes scenarios for Stagentic Flow's own behaviours; the inner loop builds the gateway services those scenarios depend on. When Stagentic Flow can run its own scenarios, it graduates to its own repository. TDAB will have served its purpose as the proving ground.

Why this migration, why now

That work should begin on the latest available Claude model: Opus 4.7. Starting on an older model and migrating mid-development would introduce unnecessary risk. But starting on 4.7 without first validating its efficacy across the existing skill suite would introduce a different risk — regressions that go undetected until they surface in the middle of new work.

This document records the migration process: the approach, the baseline, the findings, and the conclusions. Model evaluation of this kind is one of the core use cases Stagentic Flow is designed to support — running the full suite against a new model version before committing to an upgrade. This migration is that process in practice. If Opus 4.7 proves fit for purpose, it becomes the model on which Stagentic Flow development begins. If it falls short in ways we cannot remedy through guidance changes, the findings may become issues logged with Anthropic.


How It Works

A scenario is a markdown file that links to a set of task files — one per step. The link text is the task description; the task file contains the instructions or scorecard for that step. Here is a skill test scenario for the TDD skill:


Test: Work Smarter

Given that the fixture is clean and @Disabled has been removed

When an agent attempts a simple TDD task

Then the agent should read only what was required for the task

Finally reset fixture and main code


Given

Prepares the starting conditions. In the example above, it enables a disabled test so the agent has a failing test to work from — a deliberate red state.

When

The core of the scenario — a mini coding kata with controlled boundaries.

The task file gives the subagent two things: a goal (a prompt with a specific instruction) and a test fixture — real code in a deliberate initial state, waiting to be completed. The agent must achieve the goal within tightly scoped boundaries: a specific working directory, a specific test target, strict limits on what it may read.

It receives no hints about the shape of the solution. It must read the code and let the failure guide it.

In the example above, the instruction is simply:

"Using test-driven development, implement: call the 'hello' tool to pass the current failing test."

Then

Contains a scorecard. Depending on the scenario, it may:

  • evaluate the shape of the resulting code against a reference implementation
  • assess how the agent worked — whether it read only what it needed, used approved tools, and didn't take shortcuts
  • or both

This is what makes the scenario a behaviour test, not just a correctness test.

Scorecard characteristics are either required (all must pass) or scored (evaluated against a minimum score threshold), allowing some flexibility in how results are achieved while still enforcing non-negotiable constraints.

Steps as Subagents

Each step is performed by a subagent.

This workflow can be described by the self-test scenario:


Test: Subagent Steps

Given that the work directory is set up and the baton can be passed to the next step

When an agent attempts some task and then hands on the baton

Then this subagent takes the baton and evaluates the scorecard

Finally this subagent takes the baton to reset new or changed fixture files

Then we should see that all steps were executed by parallel subagents, acting in sequence


The final 'Then' in that scenario is run as a sequential step, after all parallel subagents are complete (indicated by 'we should see that'). This was primarily implemented for self-test purposes.

Configurable models

Each step type has a distinct role and runs on a model chosen to match it:

Step Role Model
Given Prepares a clean starting point — enables the test, sets up the fixture Haiku
When The mini coding kata — the agent under test performs the skill Opus
Then Evaluates the scorecard Sonnet
Finally Resets the fixture so the next run starts clean Haiku

The model alias assigned to each step type is configured in .tdab/config/step-models.json:

{
  "Given": "haiku",
  "When": "opus",
  "Then": "sonnet",
  "Finally": "haiku"
}

The aliases resolve to actual model IDs via environment variables in .agents/settings.json:

"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-6",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "claude-haiku-4-5-20251001"

Orchestration

The Assistant

The Claude assistant triggers the tdab-run skill in the foreground.

The skill directs it to call init-scenario, which starts the Stage Director and returns the prompts needed to launch each subagent. The assistant launches them as parallel subagents to avoid startup delays.

From there, the Stage Director takes over while the assistant monitors progress and surfaces the results.

Usually Sonnet is adequate for the assistant's role, though Opus can be used.

The Stage Director

The Stage Director background service (currently in Python) sequences the steps, signalling each subagent when it is their turn, and coordinates cleanup.


Approach

Branch strategy

Migration work happens on a dedicated branch: opus-4-7-migration. The main branch remains at the last known-good 4.6 baseline. If the migration succeeds, the branch is merged and main advances to 4.7. If not, main is unchanged and findings are recorded here.

Test suites

Two test suites cover the When agent's behaviour across the skills in use:

Suite 1 — TDAB Behaviours (self tests)

  1. Steps Run As Parallel Subagents
  2. Background Transcription
  3. Given And When Abort on Precondition Failure
  4. When Step Prepares Immediately

Suite 2 — Agentic Dev Behaviours (skill tests)

  1. Work Smarter
  2. TDD Red-Green Increments
  3. TDD Boundary Increment
  4. Refactor Tests
  5. Multi-Pass Refactor

Version pinning

As described in Configurable models, the When step's model alias resolves to an actual model ID via .agents/settings.json. Switching from 4.6 to 4.7 is a single-line change:

Before:

"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-6"

After:

"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-7"

All other models (Haiku for Given/Finally, Sonnet for Then) are unchanged. The migration is isolated to the When agent by a single configuration value.

Model identification

Each When agent will emit a structured identification line as its very first output, before any tool calls or task work:

MODEL: claude-opus-4-7

The Transcriber process will capture this in the agent's transcript at runtime. This will make every 4.7 transcript immediately distinguishable from 4.6 transcripts and allow results to be filtered and compared programmatically.

What counts as a finding

  • A test that passed under 4.6 that fails under 4.7
  • A scorecard characteristic that changes outcome even if the overall result is the same
  • A guidance change made to recover a failing test (the change itself is the finding, not just the failure)

Baseline — Claude Opus 4.6

Recorded on main at commit f07bd514514ada1124211edea2c9841c23ff68ed.

Suite 1 — TDAB Behaviours

# Test Result
1 Steps Run As Parallel Subagents PASS
2 Background Transcription PASS
3 Given And When Abort on Precondition Failure PASS
4 When Step Prepares Immediately PASS

Suite 2 — Agentic Dev Behaviours

# Test Result
1 Work Smarter PASS
2 TDD Red-Green Increments PASS
3 TDD Boundary Increment PASS
4 Refactor Tests PASS
5 Multi-Pass Refactor PASS

Overall: 9/9 PASS.


Findings — Claude Opus 4.7

Finding 1 — The assistant intervened in Stage Director coordination

Commit: d1ebf5752d3ad7021f7aa4217fcb4801ace4a520

Observed behaviour

When Opus 4.7 ran the tdab-run skill as the assistant, it attempted to interact with the Stage Director directly — calling coordination scripts that are the Stage Director's responsibility — rather than simply launching subagents and waiting for them to self-coordinate through the infrastructure.

Test:

# Test: Subagent Steps

Given  — work directory is set up and the baton can be passed to the next step
When   — an agent attempts some task and then hands on the baton
Then   — this subagent takes the baton and evaluates the scorecard
Finally — this subagent takes the baton to reset new or changed fixture files
Then   — all steps were executed by parallel subagents, acting in sequence

Scorecard:

The final Then step produced no transcript. The Stage Director's state machine was corrupted before it could signal step 3 (Then) or step 5 (Then) — both were left waiting indefinitely. The absence of a scorecard is itself the evidence of failure.

Evidence

The Finally agent's GO signal appears at 17:30:23Z in 4-finally-a3a06b986e958891b.md — before the When agent had even called step-ready, which appears at 17:30:34Z in 2-when-a32040ad1e69c29b1.md. Steps 3 and 5 (both Then agents) produced no transcripts at all — they waited for a signal that never came because the assistant had advanced the state machine past them.

Root cause: The skill opened with a one-line description: "Orchestrate and report TDAB test and suite execution." For Opus 4.7 this was insufficient. The word "orchestrate" combined with the presence of coordination scripts in scope was enough for the model to reason that active management of those scripts was its job. It had no explicit boundary telling it otherwise.

Fix

A Context section was added to the top of the tdab-run skill, ahead of all instructions. It names the model's role explicitly ("You are the Orchestrator, acting as a test-run assistant"), states that the Stage Director starts automatically and handles all subagent coordination, and instructs the model not to intervene:

"There is code that complements this skill, that you do not need to understand in order to play your part, and because you don't know what it does — do not intervene in any way because you may cause the process to fail unintentionally. If you have a desire to intervene, ask the user before taking any action."

Verified: Retested under both Opus 4.7 and Sonnet after the change. Both passed.

Note: The assistant runs under Sonnet by default in normal use. Full validation of this fix under Opus 4.7 as the assistant's model requires the remainder of the migration to be complete — the When agent migration is the primary scope of this work.

Finding 2 — Opus 4.7 did not trust the Gradle build cache

Commit: 9b720ea858d3e81f69a7dcb4e866436f10730d69

Observed behaviour

When running TDD cycles, Opus 4.7 doubted BUILD SUCCESSFUL results that showed tasks as FROM-CACHE. It used --rerun-tasks to force Gradle to re-execute tasks, bypassing the cache rather than trusting it. It also piped Gradle output through filtering utilities (e.g. grep) rather than capturing the full output in one run — then had to run again to recover what it had discarded. Both patterns cost extra tool calls and slowed the TDD cycle.

Test:

# Test: Work Smarter

Given  — fixture is clean and @Disabled has been removed
When   — an agent attempts a simple TDD task
Then   — the agent reads only what was required for the task
Finally — reset fixture and main code

Scorecard (3-then-aabf86d58c8b06823.md):

Characteristic Weight Match
Must not have tried to find files outside tdab-kotlin/examples/work-smarter/src/ required PASS
No reads of files under tdab-kotlin/docs/dependencies/index/ when the primer covers the type required PASS
No banned shell utilities (grep, tail, head, cat, sed, awk) in Bash calls required PASS
No duplicate test runs for the same increment required PASS
No --info or other retry flags to recover truncated output required FAIL

OVERALL: FAIL

Evidence

In 2-when-ad2ee7a021c224101.md at 20:42:34Z, after seeing test FROM-CACHE in the Gradle output: "Hmm — FROM-CACHE may be a stale result. Let me check this didn't just replay yesterday's pass." At 20:42:35Z the agent ran ./gradlew test --tests "example.worksmarter.*" --rerun-tasks to force a full re-execution — despite the guidance already stating that BUILD SUCCESSFUL with FROM-CACHE is trustworthy.

Root cause: Two contributing factors. First, Opus 4.7 was more sceptical than 4.6 about cached build results — it reasoned that FROM-CACHE might not reflect the current source, and acted on that doubt rather than trusting the documented guarantee. Second, the guidance that explains the cache (in key-commands.md) was not in the required pre-run reading list, so the model could encounter its first Gradle result before having been told to trust it.

Fix

Three changes were made:

  1. --rerun-tasks added to the list of blocked flags in both AGENTS.md and key-commands.md, alongside --no-build-cache. The guidance was also strengthened to make the reasoning explicit: "code changes invalidate the Gradle cache automatically."

  2. The "no piping" rule in AGENTS.md and the TDD skill was tightened: "Do not pipe command output through any utility to filter or truncate it — partial output forces a second run, which costs more than capturing the full output once."

  3. tdab-kotlin/docs/key-commands.md was added to the list of files the TDD skill requires the agent to read before the first test run — ensuring the cache trust guidance is in context before any Gradle command is issued.

Verified: Confirmed working under Opus 4.7 after the change.

Suite 1 — TDAB Behaviours run after fix (2026-04-28):

# Test Result Elapsed
1 Steps Run As Parallel Subagents PASS 1m 43s
2 Background Transcription PASS 2m 1s
3 Given And When Abort on Precondition Failure PASS 1m 22s
4 When Step Prepares Immediately PASS 3m 29s

Self tests complete. Dev skills next.


Conclusions

To be completed once all findings are recorded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment