Summary of all token-cost optimizations shipped, with the evidence that motivated each and the measured or projected impact. Intended for task reporting (RSM / Linear).
Pilot 11 (first instrumented run): ~$102.90 total. Manager ran on Opus for the entire session, including mechanical phases (file IO, jq calls, idle wave-wait). Token capture was manual and inaccurate — Tester model was assumed to be Sonnet but transcript analysis later showed ~50% of Tester calls went to Opus.
Pilot 17 (latest comparable): ~$19.50 total. ~78% reduction. Sonnet Manager + Sonnet Planner + Haiku Testers. 9/10 recall (10/10 is the amended-harness baseline).
Problem: The Manager ran on Opus for the entire session. Pilot 11 token analysis showed ~66% of the $70.81 Manager bill came from ~11 minutes of high-cognition work in Phase 1.5 (static analysis) and Phase 3 (charter generation). The other 34% was mechanical: file IO, prompt assembly, jq merging, and idle wave-wait — none of which needs Opus.
Fix: Created a dedicated planner-opus subagent pinned to model: opus. It fires twice per pilot — once for static analysis (~3 min), once for charter generation (~4 min). The Manager itself defaults to Sonnet 4.6 for everything else.
Impact: ~30–40% Manager cost reduction per pilot with no recall regression (validated across Pilots 12–15 on magellan-backups). The isolation also eliminates Opus-rate cache creation for the entire Manager conversation — the planner gets a fresh context each dispatch, pays only for what it reads.
Problem: Transcript analysis (via the token instrumentation shipped in 5030a83) revealed that the prior assumption "Testers run on Sonnet" was wrong. Claude Code routes subagents dynamically; in one measured session, 79 subagents split ~50% Opus / 40% Sonnet / 2% Haiku.
Fix: Added model: sonnet to .claude/agents/tester.md frontmatter. Claude Code's Agent tool honors this as the default; per-charter model: overrides remain available.
Impact: Sonnet rates are ~40% of Opus across all token categories (input, output, 5m cache-create, 1h cache-create, cache-read). A 6-Tester wave at Opus rates costs ~$32–40; at Sonnet rates ~$13–16. Validated across three plugin shapes with no recall regression.
Problem: Testers returned full prose summaries of findings — PQIP tables, anchor verdict lists, evidence excerpts — directly into the Manager's conversation context. Pilot 14 traced a 3× Manager cost spike to this: Manager cc1h (1-hour cache creation) jumped from $0.58 to $10.43 between Pilots 13 and 14 despite the same number of Testers. Each Tester summary block was absorbed into the Manager context and re-cached at 1-hour TTL.
Fix: Testers now return exactly one line:
status=<status> report=<absolute-path-to-report.json>
The report.json is the source of truth. aggregate-reports.mjs reads all reports at Phase 5 and produces final-report.md. The Manager reads the aggregated output — not per-Tester returns.
Impact: Manager cc1h projected back to ~$1.50 (from $10.43) for an 8-Tester run. The fix also removes the semantic risk of summaries drifting from the actual report content.
Problem: Tester dispatch prompts were ~60 lines of inline guidance — driver-specific notes ("write a Playwright spec, not MCP calls"), schema enum reminders, validation step reminders, evidence minimums. All of this was already in .claude/agents/tester.md and the driver skill files, so it was repeated in every dispatch, inflating the Manager's cache.
Fix: Dispatch prompts reduced to ~12 lines: the Tester's role spec and driver skill cover everything else. An explicit anti-pattern table documents what NOT to add to dispatch prompts and why.
Impact: ~50 lines × 7 charters = ~350 lines of Manager output saved per run. At Sonnet Manager rates the absolute dollar impact is small, but the discipline prevents the pattern from re-inflating as new instructions get added over time.
Problem: Pilot 12 spent ~30 of ~105 Manager turns on work that didn't need an LLM: iterating over charter slugs to read site metadata, looping through teardown calls, writing 8 separate markdown charter files. Each turn paid Sonnet Manager rates plus accumulated cache-write overhead.
Fix: Three scripts moved this work out of the LLM:
provision-charters.sh— bulk parallel site provisioningteardown-all-sites.sh— bulk teardown in one call (replaces N sequential site-delete calls)generate-charter-files.mjs— renderscoverage.md+ per-charter markdown from a singlecharter-set.json(planner writes one JSON file instead of 8 markdown files)
Impact: ~30 Manager turns saved per run. Planner Opus turn count reduced from ~8 file-writes to 1. Manager no longer reads individual charter files during dispatch — Testers read their own charters.
Problem: Early token capture used the Agent tool's total_tokens surface only — no input/output split, no cache-tier breakdown, no per-model attribution. This produced the incorrect "Sonnet Testers" assumption described in §2.
Fix: scripts/capture-run-tokens.mjs parses Claude Code's local transcript files (~/.claude/projects/<proj-hash>/*.jsonl), which contain per-message usage blocks with full breakdown: input_tokens, output_tokens, cache_creation_input_tokens (5m + 1h tiers), cache_read_input_tokens, and model name. Writes runs/<id>/token-usage.json with per-LLM × per-tier × per-subagent detail plus cost estimate.
Why this matters: Without accurate instrumentation, optimizations can't be measured and claims about where the cost goes are unreliable. The Opus→Sonnet Tester switch (§2) was discovered via this tool, not by assumption.
Problem: Planner was pinned to Opus. Some pilots (cost-bounded experiments, rapid-draft passes) don't need Opus-level recall precision.
Fix: planner_model: field in MISSION.md (or MAGELLAN_PLANNER_MODEL env var) selects haiku | sonnet | opus. Three variants initially shipped as separate files; later consolidated to one planner.md with model selected at dispatch.
| Variant | Model | Validated recall |
|---|---|---|
planner-opus |
Opus 4.7 | 10/10 (regression-test plugin) |
planner-sonnet |
Sonnet 4.6 | 9/10 (Pilot 17, magellan-backups) |
planner-haiku |
Haiku 4.5 | Not validated — known recall regression risk |
Impact: Pilot 17 (Sonnet planner + Haiku Testers + Sonnet Manager) ran at ~$19.50 vs Pilot 11's $102.90 — ~81% reduction — while holding 9/10 recall.
Fix: tester_model: haiku in MISSION.md. Charter-level model: override available for mixed waves (Haiku for low-complexity breadth charters, Sonnet for depth/AND-list charters).
Impact: Pilot 17 validated Haiku Testers at 9/10 recall on magellan-backups. The one miss (Issue 9 — scale-sensitive query) was a known chronic miss class addressed by Reinforcement 3 (shipped in ef3205b).
Problem: Five agent variant files drifted after forking: tester-opus.md, tester-haiku.md, planner-sonnet.md, planner-haiku.md, and the original tester.md/planner-opus.md. Drift caused quality regressions: Haiku/Opus Tester variants were missing env_warnings, lens loading, and env-blocker triage. Duplication also meant Phase 1.5 and Phase 3 specs existed in both test-plugin.md and planner-opus.md, diverging over time.
Fix (8 sub-commits, squashed):
- Delete 4 variant files (3 Tester, 1 Planner):
tester-opus.md,tester-haiku.md,planner-sonnet.md,planner-haiku.md— 1,930 lines deleted - Move Phase 3 charter-authoring rules from
test-plugin.mdintoplanner.md(single authoritative source) - Move Phase 1.5 static-analysis spec from
test-plugin.mdintoplanner.md - Split operator reference content out of
AGENTS.md→docs/operating-guide.md - Tighten browser-driver skills: shared dialog/upload/snapshot rules across all three driver skill files
File-level deltas:
test-plugin.md: 1,475 → 1,124 lines (−23.8%)AGENTS.md: 366 → 216 lines (−41%)- Net: ~2,160 lines removed across 24 files
Measured cost impact: total cost moved +1.8% (within run-to-run variance). The audit's original "save N lines = save tokens" framing was wrong on magnitude — cached prompt size is a small cost lever. Cache reads account for ~50% of the total bill and are not affected by prompt size. The drivers that dominate cost are: model selection, turn count, wave size, and context inflation (addressed by §3).
The consolidation's primary value is correctness (drift eliminated) and maintainability (one place to edit), not direct token savings.
From the audit data across Pilots 11–18:
| Lever | Impact | Notes |
|---|---|---|
| Model selection (Opus→Sonnet→Haiku) | High | Sonnet is 40% of Opus; Haiku is ~10% of Opus across all cache tiers |
| Context inflation prevention | High | Tester prose returns inflated Manager cc1h 18× in one pilot |
| Turn count reduction | Medium | ~30 turns saved by script-driven phases; each turn has output + cache-write cost |
| Prompt size reduction | Low | Cache reads are billed per token regardless; reducing prompt size saves only uncached input reads |
| Parallel wave dispatch | No direct token impact | Reduces wallclock, not token count |
The largest single optimization was the Sonnet Tester default (§2) combined with planner subagent isolation (§1). Together they changed the model mix from ~50% Opus to ~10% Opus (planner only), which explains most of the $102.90 → $19.50 reduction.