Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 30, 2026 11:07
Show Gist options
  • Select an option

  • Save alopezari/ea4e350e31fb1f98fdb4d17cdcdf9175 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/ea4e350e31fb1f98fdb4d17cdcdf9175 to your computer and use it in GitHub Desktop.
Magellan — Token Efficiency Optimizations

Magellan — Token Efficiency Optimizations

Summary of all token-cost optimizations shipped, with the evidence that motivated each and the measured or projected impact. Intended for task reporting (RSM / Linear).


Baseline

Pilot 11 (first instrumented run): ~$102.90 total. Manager ran on Opus for the entire session, including mechanical phases (file IO, jq calls, idle wave-wait). Token capture was manual and inaccurate — Tester model was assumed to be Sonnet but transcript analysis later showed ~50% of Tester calls went to Opus.

Pilot 17 (latest comparable): ~$19.50 total. ~78% reduction. Sonnet Manager + Sonnet Planner + Haiku Testers. 9/10 recall (10/10 is the amended-harness baseline).


Optimization 1 — Planner subagent isolation (commit b95af9e)

Problem: The Manager ran on Opus for the entire session. Pilot 11 token analysis showed ~66% of the $70.81 Manager bill came from ~11 minutes of high-cognition work in Phase 1.5 (static analysis) and Phase 3 (charter generation). The other 34% was mechanical: file IO, prompt assembly, jq merging, and idle wave-wait — none of which needs Opus.

Fix: Created a dedicated planner-opus subagent pinned to model: opus. It fires twice per pilot — once for static analysis (~3 min), once for charter generation (~4 min). The Manager itself defaults to Sonnet 4.6 for everything else.

Impact: ~30–40% Manager cost reduction per pilot with no recall regression (validated across Pilots 12–15 on magellan-backups). The isolation also eliminates Opus-rate cache creation for the entire Manager conversation — the planner gets a fresh context each dispatch, pays only for what it reads.


Optimization 2 — Sonnet Tester default (commit e8c8094)

Problem: Transcript analysis (via the token instrumentation shipped in 5030a83) revealed that the prior assumption "Testers run on Sonnet" was wrong. Claude Code routes subagents dynamically; in one measured session, 79 subagents split ~50% Opus / 40% Sonnet / 2% Haiku.

Fix: Added model: sonnet to .claude/agents/tester.md frontmatter. Claude Code's Agent tool honors this as the default; per-charter model: overrides remain available.

Impact: Sonnet rates are ~40% of Opus across all token categories (input, output, 5m cache-create, 1h cache-create, cache-read). A 6-Tester wave at Opus rates costs ~$32–40; at Sonnet rates ~$13–16. Validated across three plugin shapes with no recall regression.


Optimization 3 — File-backed Tester returns (commit f395200)

Problem: Testers returned full prose summaries of findings — PQIP tables, anchor verdict lists, evidence excerpts — directly into the Manager's conversation context. Pilot 14 traced a 3× Manager cost spike to this: Manager cc1h (1-hour cache creation) jumped from $0.58 to $10.43 between Pilots 13 and 14 despite the same number of Testers. Each Tester summary block was absorbed into the Manager context and re-cached at 1-hour TTL.

Fix: Testers now return exactly one line:

status=<status> report=<absolute-path-to-report.json>

The report.json is the source of truth. aggregate-reports.mjs reads all reports at Phase 5 and produces final-report.md. The Manager reads the aggregated output — not per-Tester returns.

Impact: Manager cc1h projected back to ~$1.50 (from $10.43) for an 8-Tester run. The fix also removes the semantic risk of summaries drifting from the actual report content.


Optimization 4 — Slim Tester dispatch prompts (commit f395200)

Problem: Tester dispatch prompts were ~60 lines of inline guidance — driver-specific notes ("write a Playwright spec, not MCP calls"), schema enum reminders, validation step reminders, evidence minimums. All of this was already in .claude/agents/tester.md and the driver skill files, so it was repeated in every dispatch, inflating the Manager's cache.

Fix: Dispatch prompts reduced to ~12 lines: the Tester's role spec and driver skill cover everything else. An explicit anti-pattern table documents what NOT to add to dispatch prompts and why.

Impact: ~50 lines × 7 charters = ~350 lines of Manager output saved per run. At Sonnet Manager rates the absolute dollar impact is small, but the discipline prevents the pattern from re-inflating as new instructions get added over time.


Optimization 5 — Script-driven mechanical Manager phases (commit 4b4e0fe)

Problem: Pilot 12 spent ~30 of ~105 Manager turns on work that didn't need an LLM: iterating over charter slugs to read site metadata, looping through teardown calls, writing 8 separate markdown charter files. Each turn paid Sonnet Manager rates plus accumulated cache-write overhead.

Fix: Three scripts moved this work out of the LLM:

  • provision-charters.sh — bulk parallel site provisioning
  • teardown-all-sites.sh — bulk teardown in one call (replaces N sequential site-delete calls)
  • generate-charter-files.mjs — renders coverage.md + per-charter markdown from a single charter-set.json (planner writes one JSON file instead of 8 markdown files)

Impact: ~30 Manager turns saved per run. Planner Opus turn count reduced from ~8 file-writes to 1. Manager no longer reads individual charter files during dispatch — Testers read their own charters.


Optimization 6 — Accurate token instrumentation (commits 428d365, 5030a83)

Problem: Early token capture used the Agent tool's total_tokens surface only — no input/output split, no cache-tier breakdown, no per-model attribution. This produced the incorrect "Sonnet Testers" assumption described in §2.

Fix: scripts/capture-run-tokens.mjs parses Claude Code's local transcript files (~/.claude/projects/<proj-hash>/*.jsonl), which contain per-message usage blocks with full breakdown: input_tokens, output_tokens, cache_creation_input_tokens (5m + 1h tiers), cache_read_input_tokens, and model name. Writes runs/<id>/token-usage.json with per-LLM × per-tier × per-subagent detail plus cost estimate.

Why this matters: Without accurate instrumentation, optimizations can't be measured and claims about where the cost goes are unreliable. The Opus→Sonnet Tester switch (§2) was discovered via this tool, not by assumption.


Optimization 7 — Planner model configurability (commits ab4b487, 6a07b65)

Problem: Planner was pinned to Opus. Some pilots (cost-bounded experiments, rapid-draft passes) don't need Opus-level recall precision.

Fix: planner_model: field in MISSION.md (or MAGELLAN_PLANNER_MODEL env var) selects haiku | sonnet | opus. Three variants initially shipped as separate files; later consolidated to one planner.md with model selected at dispatch.

Variant Model Validated recall
planner-opus Opus 4.7 10/10 (regression-test plugin)
planner-sonnet Sonnet 4.6 9/10 (Pilot 17, magellan-backups)
planner-haiku Haiku 4.5 Not validated — known recall regression risk

Impact: Pilot 17 (Sonnet planner + Haiku Testers + Sonnet Manager) ran at ~$19.50 vs Pilot 11's $102.90 — ~81% reduction — while holding 9/10 recall.


Optimization 8 — Haiku Tester support (commit a947a0b)

Fix: tester_model: haiku in MISSION.md. Charter-level model: override available for mixed waves (Haiku for low-complexity breadth charters, Sonnet for depth/AND-list charters).

Impact: Pilot 17 validated Haiku Testers at 9/10 recall on magellan-backups. The one miss (Issue 9 — scale-sensitive query) was a known chronic miss class addressed by Reinforcement 3 (shipped in ef3205b).


Optimization 9 — Agent file consolidation and de-duplication (commit 15a1d44)

Problem: Five agent variant files drifted after forking: tester-opus.md, tester-haiku.md, planner-sonnet.md, planner-haiku.md, and the original tester.md/planner-opus.md. Drift caused quality regressions: Haiku/Opus Tester variants were missing env_warnings, lens loading, and env-blocker triage. Duplication also meant Phase 1.5 and Phase 3 specs existed in both test-plugin.md and planner-opus.md, diverging over time.

Fix (8 sub-commits, squashed):

  • Delete 4 variant files (3 Tester, 1 Planner): tester-opus.md, tester-haiku.md, planner-sonnet.md, planner-haiku.md — 1,930 lines deleted
  • Move Phase 3 charter-authoring rules from test-plugin.md into planner.md (single authoritative source)
  • Move Phase 1.5 static-analysis spec from test-plugin.md into planner.md
  • Split operator reference content out of AGENTS.mddocs/operating-guide.md
  • Tighten browser-driver skills: shared dialog/upload/snapshot rules across all three driver skill files

File-level deltas:

  • test-plugin.md: 1,475 → 1,124 lines (−23.8%)
  • AGENTS.md: 366 → 216 lines (−41%)
  • Net: ~2,160 lines removed across 24 files

Measured cost impact: total cost moved +1.8% (within run-to-run variance). The audit's original "save N lines = save tokens" framing was wrong on magnitude — cached prompt size is a small cost lever. Cache reads account for ~50% of the total bill and are not affected by prompt size. The drivers that dominate cost are: model selection, turn count, wave size, and context inflation (addressed by §3).

The consolidation's primary value is correctness (drift eliminated) and maintainability (one place to edit), not direct token savings.


What Actually Moves the Token Bill

From the audit data across Pilots 11–18:

Lever Impact Notes
Model selection (Opus→Sonnet→Haiku) High Sonnet is 40% of Opus; Haiku is ~10% of Opus across all cache tiers
Context inflation prevention High Tester prose returns inflated Manager cc1h 18× in one pilot
Turn count reduction Medium ~30 turns saved by script-driven phases; each turn has output + cache-write cost
Prompt size reduction Low Cache reads are billed per token regardless; reducing prompt size saves only uncached input reads
Parallel wave dispatch No direct token impact Reduces wallclock, not token count

The largest single optimization was the Sonnet Tester default (§2) combined with planner subagent isolation (§1). Together they changed the model mix from ~50% Opus to ~10% Opus (planner only), which explains most of the $102.90 → $19.50 reduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment