dmora · March 9, 2026 22:18
diff --git a/issue-42-context-exhaustion-analysis.html b/issue-42-context-exhaustion-analysis.html
 <!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>FND-042 — Context Exhaustion Recovery</title>
 <style>
  :root {
    --bg: #0a0a0f;
    --bg-surface: #12121a;
    --bg-card: #16161f;
    --bg-elevated: #1a1a25;
    --border: #2a2a3a;
    --border-bright: #3a3a4f;
    --text: #c8c8d4;
    --text-dim: #888899;
    --text-bright: #e8e8f0;
    --amber: #ffb000;
    --cyan: #00d4ff;
    --green: #00ff88;
    --red: #ff3333;
    --magenta: #ff44aa;
    --blue: #4488ff;
  }

  * { margin: 0; padding: 0; box-sizing: border-box; }

  body {
    background: var(--bg);
    color: var(--text);
    font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', 'Cascadia Code', monospace;
    font-size: 0.8rem;
    line-height: 1.6;
    padding: 2rem;
    max-width: 1200px;
    margin: 0 auto;
  }

  h1 {
    color: var(--amber);
    font-size: 1.4rem;
    letter-spacing: 0.15em;
    text-transform: uppercase;
    border-bottom: 2px solid var(--amber);
    padding-bottom: 0.5rem;
    margin-bottom: 0.5rem;
  }

  h2 {
    color: var(--cyan);
    font-size: 1rem;
    letter-spacing: 0.1em;
    text-transform: uppercase;
    margin: 2.5rem 0 1rem;
    border-bottom: 1px solid var(--border);
    padding-bottom: 0.4rem;
  }

  h3 {
    color: var(--amber);
    font-size: 0.85rem;
    margin: 1.5rem 0 0.75rem;
  }

  h4 {
    color: var(--text-bright);
    font-size: 0.8rem;
    margin: 0.75rem 0 0.5rem;
  }

  a { color: var(--cyan); text-decoration: none; }
  a:hover { text-decoration: underline; }

  .doc-meta {
    color: var(--text-dim);
    font-size: 0.7rem;
    letter-spacing: 0.1em;
    text-transform: uppercase;
    margin-bottom: 2rem;
    line-height: 1.8;
  }

  .doc-meta span {
    display: inline-block;
    margin-right: 2rem;
  }

  /* Section numbers */
  .section-num {
    display: inline-block;
    color: var(--amber);
    font-size: 1.8rem;
    font-weight: 700;
    opacity: 0.3;
    margin-right: 0.75rem;
    vertical-align: middle;
  }

  /* Metrics boxes */
  .metrics {
    display: grid;
    grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
    gap: 0.75rem;
    margin: 1.5rem 0;
  }

  .metric {
    background: var(--bg-card);
    border: 1px solid var(--border);
    padding: 1rem;
    text-align: center;
  }

  .metric-value {
    font-size: 1.6rem;
    font-weight: 700;
    color: var(--cyan);
    display: block;
  }

  .metric-label {
    font-size: 0.65rem;
    color: var(--text-dim);
    text-transform: uppercase;
    letter-spacing: 0.1em;
    display: block;
    margin-top: 0.25rem;
  }

  .metric-sub {
    font-size: 0.6rem;
    color: var(--text-dim);
    display: block;
  }

  /* Tables */
  table {
    width: 100%;
    border-collapse: collapse;
    margin: 1rem 0;
    font-size: 0.75rem;
  }

  th {
    background: var(--bg-elevated);
    color: var(--amber);
    text-align: left;
    padding: 0.5rem 0.75rem;
    border: 1px solid var(--border);
    text-transform: uppercase;
    font-size: 0.7rem;
    letter-spacing: 0.05em;
  }

  td {
    padding: 0.5rem 0.75rem;
    border: 1px solid var(--border);
    vertical-align: top;
  }

  .cell-good { color: var(--green); }
  .cell-warn { color: var(--amber); }
  .cell-bad { color: var(--red); }

  /* Callout boxes */
  .unknown-callout {
    background: var(--bg-card);
    border-left: 3px solid var(--amber);
    padding: 1rem 1.25rem;
    margin: 0.75rem 0;
    font-size: 0.78rem;
  }

  .unknown-callout strong:first-child {
    color: var(--amber);
  }

  .boundary-box {
    background: var(--bg-card);
    border: 1px solid var(--red);
    padding: 1rem 1.25rem;
    margin: 1rem 0;
    font-size: 0.78rem;
    color: var(--red);
    text-align: center;
    text-transform: uppercase;
    letter-spacing: 0.1em;
  }

  /* Solution cards */
  .solution {
    background: var(--bg-card);
    border: 1px solid var(--border);
    margin: 1.5rem 0;
    overflow: hidden;
  }

  .solution-header {
    display: flex;
    align-items: center;
    gap: 1rem;
    padding: 1rem 1.25rem;
    background: var(--bg-elevated);
    border-bottom: 1px solid var(--border);
  }

  .solution-num {
    font-size: 1.8rem;
    font-weight: 700;
    color: var(--amber);
    opacity: 0.4;
    min-width: 2.5rem;
  }

  .solution-title {
    color: var(--text-bright);
    font-size: 0.9rem;
    font-weight: 600;
  }

  .solution-subtitle {
    color: var(--text-dim);
    font-size: 0.72rem;
    margin-top: 0.15rem;
  }

  .solution-body {
    padding: 1.25rem;
  }

  .card-badge {
    font-size: 0.6rem;
    padding: 0.2rem 0.6rem;
    border-radius: 2px;
    text-transform: uppercase;
    letter-spacing: 0.1em;
    font-weight: 600;
    white-space: nowrap;
    margin-left: auto;
  }

  .badge-green { background: rgba(0,255,136,0.15); color: var(--green); border: 1px solid rgba(0,255,136,0.3); }
  .badge-cyan { background: rgba(0,212,255,0.15); color: var(--cyan); border: 1px solid rgba(0,212,255,0.3); }
  .badge-red { background: rgba(255,51,51,0.15); color: var(--red); border: 1px solid rgba(255,51,51,0.3); }
  .badge-amber { background: rgba(255,176,0,0.15); color: var(--amber); border: 1px solid rgba(255,176,0,0.3); }
  .badge-magenta { background: rgba(255,68,170,0.15); color: var(--magenta); border: 1px solid rgba(255,68,170,0.3); }
  .badge-blue { background: rgba(68,136,255,0.15); color: var(--blue); border: 1px solid rgba(68,136,255,0.3); }

  /* Pros/cons */
  .pros-cons {
    display: grid;
    grid-template-columns: 1fr 1fr;
    gap: 1rem;
    margin: 1rem 0;
  }

  .pros h4 { color: var(--green); }
  .cons h4 { color: var(--red); }

  .pros ul, .cons ul {
    list-style: none;
    padding: 0;
  }

  .pros li::before { content: "✓ "; color: var(--green); }
  .cons li::before { content: "✗ "; color: var(--red); }

  .pros li, .cons li {
    margin: 0.4rem 0;
    font-size: 0.75rem;
  }

  /* Algorithm flow boxes */
  .algo-flow {
    background: var(--bg);
    border: 1px solid var(--border);
    padding: 1rem;
    margin: 0.75rem 0;
    overflow-x: auto;
  }

  .algo-flow pre {
    color: var(--text);
    font-size: 0.72rem;
    line-height: 1.5;
    margin: 0;
    white-space: pre;
  }

  /* Code blocks */
  .code-block {
    background: var(--bg);
    border: 1px solid var(--border);
    padding: 1rem;
    margin: 0.75rem 0;
    overflow-x: auto;
    font-size: 0.72rem;
    line-height: 1.5;
  }

  code {
    background: var(--bg-elevated);
    padding: 0.1rem 0.3rem;
    font-size: 0.75rem;
    color: var(--cyan);
  }

  /* Review blocks */
  .review-block {
    background: var(--bg-card);
    border: 1px solid var(--border);
    margin: 1rem 0;
    padding: 1.25rem;
  }

  .review-header {
    display: flex;
    align-items: center;
    gap: 0.75rem;
    margin-bottom: 0.75rem;
  }

  .review-badge {
    font-size: 0.6rem;
    padding: 0.25rem 0.75rem;
    text-transform: uppercase;
    letter-spacing: 0.1em;
    font-weight: 700;
  }

  .review-badge-gemini { background: var(--blue); color: #fff; }
  .review-badge-claude { background: var(--magenta); color: #fff; }

  .review-title {
    font-size: 0.8rem;
    color: var(--text-bright);
  }

  .review-block p {
    margin: 0.5rem 0;
    font-size: 0.78rem;
  }

  /* Two-column layout */
  .two-col {
    display: grid;
    grid-template-columns: 1fr 1fr;
    gap: 1.5rem;
    margin: 1rem 0;
  }

  .col-card {
    background: var(--bg-card);
    border: 1px solid var(--border);
    padding: 1rem 1.25rem;
  }

  .col-card h4 {
    margin-top: 0;
  }

  .col-card ul {
    list-style: none;
    padding: 0;
  }

  .col-card li {
    margin: 0.4rem 0;
    font-size: 0.75rem;
  }

  /* Algorithm catalog cards */
  .algo-card {
    background: var(--bg-card);
    border: 1px solid var(--border);
    margin: 1rem 0;
    padding: 1.25rem;
  }

  .algo-card-header {
    display: flex;
    align-items: center;
    gap: 0.75rem;
    margin-bottom: 0.5rem;
  }

  .algo-card-title {
    font-size: 0.85rem;
    color: var(--text-bright);
    font-weight: 600;
  }

  .algo-card p {
    margin: 0.4rem 0;
    font-size: 0.75rem;
  }

  .applicability {
    margin-top: 0.5rem;
    font-size: 0.72rem;
    color: var(--text-dim);
  }

  .applicability strong { color: var(--cyan); }

  .tag {
    display: inline-block;
    font-size: 0.6rem;
    padding: 0.1rem 0.4rem;
    margin: 0.15rem 0.1rem;
    border-radius: 2px;
    background: rgba(0,212,255,0.1);
    color: var(--cyan);
    border: 1px solid rgba(0,212,255,0.2);
  }

  .tag-red {
    background: rgba(255,51,51,0.1);
    color: var(--red);
    border-color: rgba(255,51,51,0.2);
  }

  /* Gemini validation box */
  .validation-box {
    background: var(--bg-card);
    border: 1px solid var(--red);
    margin: 0.75rem 0;
    padding: 1rem 1.25rem;
  }

  .validation-box .review-badge {
    margin-bottom: 0.5rem;
    display: inline-block;
  }

  /* Timeline / Roadmap */
  .timeline {
    margin: 1.5rem 0;
    padding-left: 1.5rem;
    border-left: 2px solid var(--border);
  }

  .timeline-phase {
    font-size: 0.85rem;
    font-weight: 600;
    margin: 1.5rem 0 0.5rem;
    position: relative;
  }

  .timeline-phase::before {
    content: '';
    position: absolute;
    left: -1.85rem;
    top: 0.35rem;
    width: 10px;
    height: 10px;
    border-radius: 50%;
    background: var(--border-bright);
    border: 2px solid var(--bg);
  }

  .timeline p, .timeline ul {
    font-size: 0.78rem;
    margin: 0.4rem 0;
  }

  .timeline ul {
    list-style: none;
    padding: 0;
  }

  .timeline li::before {
    content: "→ ";
    color: var(--text-dim);
  }

  .ref {
    display: inline-block;
    font-size: 0.6rem;
    padding: 0.1rem 0.4rem;
    background: rgba(0,212,255,0.08);
    color: var(--cyan);
    border: 1px solid rgba(0,212,255,0.15);
    margin: 0.1rem;
  }

  /* Architecture diagram */
  .arch-diagram {
    background: var(--bg-card);
    border: 1px solid var(--border);
    padding: 1.5rem;
    margin: 1.5rem 0;
    overflow-x: auto;
  }

  .arch-row {
    display: flex;
    gap: 0.75rem;
    margin: 0.5rem 0;
    align-items: stretch;
  }

  .arch-box {
    background: var(--bg);
    border: 1px solid var(--border);
    padding: 0.75rem 1rem;
    flex: 1;
    font-size: 0.7rem;
  }

  .arch-box-title {
    color: var(--amber);
    font-weight: 600;
    font-size: 0.72rem;
    text-transform: uppercase;
    margin-bottom: 0.3rem;
  }

  .arch-box-cyan .arch-box-title { color: var(--cyan); }
  .arch-box-green .arch-box-title { color: var(--green); }
  .arch-box-red .arch-box-title { color: var(--red); }
  .arch-box-magenta .arch-box-title { color: var(--magenta); }

  .arch-full {
    background: var(--bg-elevated);
    border: 1px solid var(--border);
    padding: 0.75rem 1rem;
    font-size: 0.7rem;
    margin: 0.5rem 0;
    text-align: center;
  }

  .arch-label {
    text-align: center;
    color: var(--text-dim);
    font-size: 0.65rem;
    margin: 0.25rem 0;
  }

  /* Fuel gauge */
  .fuel-bar {
    display: flex;
    height: 1.5rem;
    margin: 0.75rem 0;
    border: 1px solid var(--border);
    overflow: hidden;
    font-size: 0.6rem;
  }

  .fuel-segment {
    display: flex;
    align-items: center;
    justify-content: center;
    color: #000;
    font-weight: 600;
    letter-spacing: 0.05em;
  }

  /* Sub-agent lib cards */
  .lib-card {
    background: var(--bg-card);
    border: 1px solid var(--border);
    padding: 1rem 1.25rem;
    margin: 0.75rem 0;
  }

  .lib-card-header {
    display: flex;
    align-items: center;
    gap: 0.75rem;
    margin-bottom: 0.5rem;
  }

  .lib-card-title {
    font-size: 0.85rem;
    font-weight: 600;
    color: var(--text-bright);
  }

  /* Footer */
  footer {
    margin-top: 3rem;
    padding-top: 1.5rem;
    border-top: 1px solid var(--border);
    color: var(--text-dim);
    font-size: 0.65rem;
    line-height: 1.8;
    text-align: center;
  }

  footer strong { color: var(--text); }

  /* Data structures */
  .data-struct {
    background: var(--bg);
    border: 1px solid var(--border);
    padding: 1rem;
    margin: 0.75rem 0;
    font-size: 0.72rem;
    overflow-x: auto;
  }

  .data-struct-title {
    color: var(--amber);
    font-size: 0.7rem;
    text-transform: uppercase;
    letter-spacing: 0.1em;
    margin-bottom: 0.5rem;
  }

  /* S6 key differences table */
  .diff-table td:first-child {
    color: var(--amber);
    font-weight: 600;
    white-space: nowrap;
  }

  @media (max-width: 768px) {
    body { padding: 1rem; font-size: 0.75rem; }
    .metrics { grid-template-columns: repeat(2, 1fr); }
    .pros-cons { grid-template-columns: 1fr; }
    .two-col { grid-template-columns: 1fr; }
    .arch-row { flex-direction: column; }
  }
 </style>
 </head>
 <body>

 <h1>Context Exhaustion Recovery</h1>
 <p style="color:var(--text-dim); font-size:0.85rem; margin-bottom:0.25rem;">Station Continuity Architecture &mdash; Issue #21 / #42</p>
 <div class="doc-meta">
  <span>DOCUMENT: FND-042-ANALYSIS</span>
  <span>DATE: 2026-03-09</span>
  <span>STATUS: DISCOVERY</span>
  <span>CLASSIFICATION: ARCHITECTURE</span>
 </div>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 00: EXECUTIVE SUMMARY                              -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">00</span>Executive Summary</h2>

 <p><strong>Problem:</strong> When a station (Claude Code, 200K context) exhausts its context window after 1+ hours of work, all accumulated reasoning and progress is lost. The supervisor (Gemini, 1M context) must decide how to recover &mdash; ranging from simply re-dispatching with a smaller task (zero implementation) to building sophisticated continuation machinery. This document presents <strong>8 architectural options</strong> with tradeoffs, informed by 8 industry algorithms, 60+ research papers, and multi-model validation. Solutions range from ~0 LOC (task scoping) to ~350 LOC (full PCW with LLM summarization). Each has different cost, complexity, and quality profiles. All recovery solutions operate supervisor-side &mdash; stations are black-box CLI processes communicating via stdin/stdout JSON.</p>

 <div class="metrics">
  <div class="metric">
    <span class="metric-value">200K</span>
    <span class="metric-label">Station Context</span>
    <span class="metric-sub">tokens (bottleneck)</span>
  </div>
  <div class="metric">
    <span class="metric-value">1M</span>
    <span class="metric-label">Supervisor Context</span>
    <span class="metric-sub">tokens (not bottleneck)</span>
  </div>
  <div class="metric">
    <span class="metric-value">8</span>
    <span class="metric-label">Algorithms Analyzed</span>
    <span class="metric-sub">from industry</span>
  </div>
  <div class="metric">
    <span class="metric-value">60+</span>
    <span class="metric-label">Papers Surveyed</span>
    <span class="metric-sub">in #40 research</span>
  </div>
  <div class="metric">
    <span class="metric-value">8</span>
    <span class="metric-label">Solutions Evaluated</span>
    <span class="metric-sub">architecture options</span>
  </div>
 </div>

 <div class="boundary-box">
  Stations are BLACK BOX external CLI processes. Each station is a Claude Code CLI binary running as a separate OS process (<code style="background:none;color:var(--red)">exec.Command</code>). Foundry communicates with it via stdin/stdout JSON &mdash; nothing else. We cannot access, read, or modify the station's internal context window, token arrays, or memory.
 </div>

 <p style="font-size:0.78rem; margin-top:1rem;">Every solution in this document operates <strong>OUTSIDE</strong> the station &mdash; in the Go supervisor process (<code>processManager</code> in <code>internal/agent/agentrun.go</code>). When this document discusses "compaction", "sliding window", or "summarization", it refers to operations on our <strong>supervisor-side mirror</strong> of the station's conversation, NOT on the station's internal state.</p>

 <p style="font-size:0.78rem; margin-top:0.75rem;"><strong>Data source:</strong> The handler callback in <code>newStationTool()</code> intercepts every message from the station as agentrun parses stdout JSON. This is the ONLY input to all recovery mechanisms. <strong>Recovery method:</strong> Kill the station process, start a fresh one, inject a continuation prompt built from externally-captured data. The station never knows it was replaced.</p>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 01: DECISION CONTEXT                               -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">01</span>Decision Context &mdash; Constraints, Costs, and Unknowns</h2>

 <p>Before evaluating solutions, the reader needs to understand the constraints that shape the decision space, the costs involved, and the unknowns that remain. This section provides the information needed to make your own judgment about which solution (or combination) is right.</p>

 <h3>Hard Constraints</h3>
 <table>
  <tr>
    <th>Constraint</th>
    <th>Value</th>
    <th>Source</th>
    <th>Impact</th>
  </tr>
  <tr>
    <td>Station context window</td>
    <td>200K tokens</td>
    <td>Claude Code CLI / Anthropic API</td>
    <td>This is the bottleneck. Cannot be increased without Anthropic changing it.</td>
  </tr>
  <tr>
    <td>Supervisor context window</td>
    <td>1M tokens</td>
    <td>Gemini 3.1 Pro via ADK</td>
    <td>NOT a bottleneck for normal use. Becomes relevant only in 100+ turn sessions.</td>
  </tr>
  <tr>
    <td>Station is a black box</td>
    <td>CLI process via exec.Command</td>
    <td>Architecture (agentrun)</td>
    <td>Cannot read/modify station internals. Communication: stdin/stdout JSON only.</td>
  </tr>
  <tr>
    <td>Observability: handler callback</td>
    <td>Sees tool calls, thinking, results, errors</td>
    <td>agentrun message types</td>
    <td>Rich external signal, but not the same as internal model state.</td>
  </tr>
  <tr>
    <td>Fuel gauge accuracy</td>
    <td>ContextUsedTokens on MessageResult only</td>
    <td>agentrun v0.3.0</td>
    <td>Usage data arrives at end of turn, not continuously. Mid-turn exhaustion is a blind spot.</td>
  </tr>
  <tr>
    <td>Existing infra: runOneShot()</td>
    <td>Separate fresh LLM call (already used for title generation)</td>
    <td>internal/agent/agent.go</td>
    <td>Available but miscalibrated &mdash; configured for <code>titleMaxOutputTokens = 40</code>. Summarization needs 500-2000 output tokens, different prompts, different error handling. Not drop-in.</td>
  </tr>
  <tr>
    <td><strong>Single build station</strong></td>
    <td>One build station active per session</td>
    <td>Architecture (processManager)</td>
    <td><strong>Critical:</strong> Recovery is always sequential. No background pre-warming. Supervisor already sits in the dispatch loop between turns &mdash; it IS the natural recovery layer.</td>
  </tr>
  <tr>
    <td>Supervisor has no size awareness</td>
    <td>Decomposes by function (station routing), not by size (fitting in 200K)</td>
    <td>coder.md.tpl, stationInput</td>
    <td><strong>S0 gap:</strong> System prompt says "decompose tasks into station assignments" meaning draft&rarr;build&rarr;review routing, not task sizing. <code>stationInput</code> is <code>{ task: string }</code> &mdash; no token budget, no sizing metadata. Supervisor has zero awareness of the 200K station limit. S0 requires teaching new behavior, not leveraging existing capability.</td>
  </tr>
  <tr>
    <td>No structured exhaustion signal</td>
    <td>RunTurn() error doesn't distinguish context exhaustion from crashes</td>
    <td>agentrun / handler callback</td>
    <td><strong>Prerequisite:</strong> Cannot build any recovery without detecting WHY the station died. <code>StopMaxTokens</code> detection path is undefined.</td>
  </tr>
  <tr>
    <td>CLI JSON truncation bug</td>
    <td>Stdout truncated at 4K/6K/8K/16K char boundaries</td>
    <td><a href="https://github.com/anthropics/claude-code/issues/2904">claude-code#2904</a></td>
    <td>Breaks ALL handler-based solutions. Truncated JSON causes parser failures, corrupts S1 buffer, blinds S7 fuel gauge.</td>
  </tr>
  <tr>
    <td>CLI zombie process bug</td>
    <td>Process hangs indefinitely after emitting final result</td>
    <td><a href="https://github.com/anthropics/claude-code/issues/25629">claude-code#25629</a></td>
    <td>Frequent replacement risks process table saturation. Requires PGID-based kill + escalating SIGINT&rarr;SIGKILL.</td>
  </tr>
 </table>

 <h3>Cost Model</h3>
 <table>
  <tr>
    <th>Operation</th>
    <th>Cost</th>
    <th>Frequency</th>
    <th>Notes</th>
  </tr>
  <tr>
    <td>S0: Task scoping (prompt engineering)</td>
    <td class="cell-good">$0 runtime</td>
    <td>&mdash;</td>
    <td>Zero Go code, zero runtime cost. Requires non-trivial prompt engineering &mdash; supervisor currently has no size-aware decomposition. May also need context usage feedback in station results.</td>
  </tr>
  <tr>
    <td>S1: Deterministic buffer</td>
    <td class="cell-good">~$0</td>
    <td>Every handler message</td>
    <td>In-memory struct append. CPU cost negligible.</td>
  </tr>
  <tr>
    <td>S6: runOneShot() per compaction</td>
    <td class="cell-warn">~$0.01-0.05</td>
    <td>Every ~5 turns per station</td>
    <td>Small model, short prompt. Estimate 1K-5K input tokens, 500-2K output.</td>
  </tr>
  <tr>
    <td>S6: runOneShot() failure</td>
    <td class="cell-warn">Falls back to deterministic</td>
    <td>Unknown</td>
    <td>If LLM summary fails, PCW degrades gracefully to S1-style buffer.</td>
  </tr>
  <tr>
    <td>Replacement: fresh station start</td>
    <td class="cell-warn">~3-5s latency</td>
    <td>Per exhaustion event</td>
    <td>Claude Code CLI startup time. Operator sees a brief "Starting..." phase.</td>
  </tr>
  <tr>
    <td>Replacement: git state capture</td>
    <td class="cell-good">&lt;1s</td>
    <td>Per replacement</td>
    <td>git diff --stat, git status. Fast on typical repos.</td>
  </tr>
 </table>

 <h3>Unknowns and Open Questions</h3>

 <div class="unknown-callout">
  <strong>How often do stations actually exhaust context?</strong>
  No empirical data exists yet. In typical use, the build station (act mode, reading + writing files) is most
  likely to exhaust after 20-40 tool calls over 1+ hours. Draft/inspect/review (plan mode, read-only) rarely approach
  the limit. <strong>If exhaustion is rare (1 in 50 sessions), S0 or S1 may be sufficient. If frequent (1 in 5), S6 is justified.</strong>
  Priority should be calibrated to actual frequency &mdash; consider shipping S1 first and measuring before investing in S6.
 </div>

 <div class="unknown-callout">
  <strong>Is task scoping the real root cause?</strong>
  If the supervisor dispatches overly broad tasks ("implement the entire feature"), stations exhaust quickly.
  Better task scoping ("implement the parser, stop before validation") may reduce exhaustion to near-zero.
  However, the supervisor currently has <strong>no concept of context limits</strong> &mdash; its "decompose" behavior routes by function
  (which station), not by size (fitting in 200K). Teaching size-aware decomposition requires non-trivial prompt engineering
  and possibly feedback mechanisms (context usage reported in station results). Not "zero effort" &mdash; but still zero Go code.
 </div>

 <div class="unknown-callout">
  <strong>Is git state sufficient as continuation context?</strong>
  The codebase IS the state. A fresh station with <code>git diff --stat</code> + <code>git status</code> +
  the original task may be enough context for continuation &mdash; especially if the task is well-scoped. The question
  is whether the overhead of maintaining a PCW is worth it over a simpler git-first approach.
 </div>

 <div class="unknown-callout">
  <strong>How good is LLM rolling summarization in practice?</strong>
  Amp retired compaction because "recursive summaries distorted reasoning." Our PCW's rolling summaries face the
  same risk: summary of summary of summary can drift. Mitigations exist (overlap, protected fields, generation cap),
  but their effectiveness is unproven for this use case. <strong>Deterministic approaches (S1) avoid this risk entirely.</strong>
  Both external reviewers (Gemini, Claude) independently recommended skipping S6 until data proves S1 is insufficient.
 </div>

 <div class="unknown-callout">
  <strong>What signal does RunTurn() return on context exhaustion?</strong>
  Currently undefined. The handler cannot distinguish "context exhausted" from "network error" or "CLI crash."
  <code>agentrun.StopMaxTokens</code> exists but the detection path through the handler is not wired.
  <strong>This is the #1 prerequisite &mdash; everything else depends on reliably detecting WHY the station died.</strong>
  Force exhaustion in a test environment and inspect error/message types before building any recovery.
 </div>

 <div class="unknown-callout">
  <strong>What happens to half-finished file edits on mid-turn death?</strong>
  The most dangerous practical failure mode. When a station dies mid-turn, files may contain half-written functions,
  truncated JSON, incomplete imports. <code>git diff --stat</code> shows changes but cannot indicate completeness.
  A fresh station receiving "Modified: validator.go (+23 lines)" has no way to know those 23 lines are incomplete.
  <strong>May require pre-turn file snapshots or <code>git stash</code> of current turn's changes before replacement.</strong>
 </div>

 <div class="unknown-callout">
  <strong>Should the supervisor be the recovery layer instead of processManager?</strong>
  With a single build station, the supervisor already sits in the sequential dispatch loop. If the station returns
  an error, the supervisor can inspect git state, decompose remaining work, and re-dispatch &mdash; using its 1M context.
  This makes "transparent replacement" inside the tool closure potentially unnecessary. The key implementation
  becomes signal detection (structured NEEDS_CONTINUATION result) rather than in-tool replacement machinery.
 </div>

 <h3>Decision Framework</h3>
 <p>Choose your approach based on your risk tolerance and how often stations exhaust:</p>
 <table>
  <tr>
    <th>If...</th>
    <th>Then consider...</th>
    <th>Why</th>
  </tr>
  <tr>
    <td>Exhaustion is rare and tasks can be scoped smaller</td>
    <td class="cell-good"><strong>S0: Task scoping</strong></td>
    <td>Zero Go code, zero runtime cost. Requires prompt engineering &mdash; supervisor currently lacks size awareness.</td>
  </tr>
  <tr>
    <td>Exhaustion happens but simple continuation is enough</td>
    <td class="cell-good"><strong>S1: Deterministic buffer</strong></td>
    <td>~200 LOC, zero LLM cost, predictable. Ship in days.</td>
  </tr>
  <tr>
    <td>Continuation quality matters and you want proven patterns</td>
    <td class="cell-warn"><strong>S1 &rarr; S6: Incremental enhancement</strong></td>
    <td>Start with S1, add LLM summarization when data shows it's needed.</td>
  </tr>
  <tr>
    <td>Exhaustion is frequent and tasks are inherently large</td>
    <td class="cell-warn"><strong>S6: Full PCW</strong></td>
    <td>~350 LOC, LLM cost per cycle, but highest quality continuation.</td>
  </tr>
  <tr>
    <td>You want supervisor control over recovery strategy</td>
    <td><strong>S5: Supervisor handoff</strong></td>
    <td>Supervisor can adjust (smaller tasks, different station). Explicit, debuggable.</td>
  </tr>
  <tr>
    <td>You want the station to wrap up cleanly before replacement</td>
    <td><strong>S7: Graceful handoff</strong></td>
    <td>Warn at threshold, let station finish current work, then replace.</td>
  </tr>
 </table>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 02: HOW FOUNDRY WORKS                              -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">02</span>How Foundry Works</h2>

 <h3>What Is Foundry</h3>
 <p>Foundry is a terminal application (TUI) that orchestrates autonomous software development. A Supervisor LLM (Gemini, 1M context) talks to the user, reasons about what to do, and delegates work to stations &mdash; external agent CLI processes that do the actual coding, reviewing, and testing. The Supervisor never writes code itself; it only dispatches and interprets results.</p>

 <h3>What Is a Station</h3>
 <p>A station is a real OS process &mdash; specifically a Claude Code CLI binary launched via <code>exec.Command</code>. Foundry has four stations: <strong>draft</strong> (plan a spec), <strong>build</strong> (write code), <strong>inspect</strong> (validate quality), <strong>review</strong> (code review). Each runs as a separate process with its own 200K token context window. When the Supervisor decides to delegate, it makes an ADK tool call which triggers the Go code in <code>newStationTool()</code> to spawn or reuse a station process.</p>

 <h3>How Communication Works</h3>
 <p>The agentrun library manages the station process lifecycle. Communication is via stdin/stdout JSON: Foundry sends prompts to the station's stdin using <code>proc.Send()</code>, and the station streams responses to stdout as newline-delimited JSON messages. The agentrun library parses each JSON line into typed Go messages: <code>MessageText</code> (with tool call data), <code>MessageThinking</code> (reasoning), <code>MessageResult</code> (final result + token usage), <code>MessageError</code>. A handler callback receives every parsed message in the <code>newStationTool()</code> closure &mdash; this Go function running in the supervisor process sees everything the station does in real-time.</p>

 <h3>The Data Flow</h3>
 <div class="algo-flow"><pre>
  User &rarr; Supervisor (Gemini, ADK runner) &rarr; ADK tool call: "run station build"
         |
         v
  newStationTool() &mdash; Go code in supervisor process
         |
         +-- processManager.getOrStart() &rarr; spawns Claude Code CLI via exec.Command
         |                                 OR reuses existing process
         |
         +-- agentrun.RunTurn(process, prompt, handler)
         |   |
         |   +-- proc.Send(prompt)     &rarr; writes JSON to station's STDIN
         |   |
         |   +-- proc.Output() channel &rarr; reads JSON from station's STDOUT
         |       |
         |       +-- MessageText      &rarr; handler sees tool calls (Read, Edit, Bash...)
         |       +-- MessageThinking  &rarr; handler sees reasoning content
         |       +-- MessageError     &rarr; handler sees errors
         |       +-- MessageResult    &rarr; handler sees final result + ContextUsedTokens
         |
         +-- handler callback captures EVERY message into processManager state
         |   (activity log, fuel gauge, process info &rarr; published to UI via pubsub)
         |
         +-- Returns result to Supervisor &rarr; Supervisor interprets and may dispatch again</pre></div>

 <h3>Architecture Diagram</h3>
 <div class="arch-diagram">
  <div class="arch-full" style="border-color:var(--text-dim); color:var(--text-dim);">OPERATOR</div>

  <div class="arch-full" style="border-color:var(--amber); color:var(--amber);">
    <div style="font-weight:700;">SUPERVISOR (Gemini via ADK)</div>
    <div style="font-size:0.65rem; color:var(--text-dim);">1M CONTEXT &middot; ROLE-SEALED &middot; DELEGATES ONLY</div>
    <div style="font-size:0.65rem; color:var(--text-dim);">runner.Run() &rarr; processEvent() &rarr; pubsub &rarr; UI</div>
    <div style="margin-top:0.5rem; display:flex; gap:0.5rem; justify-content:center;">
      <span class="tag">STEERING PLUGIN</span>
      <span class="tag">NOTIFY PLUGIN</span>
      <span class="tag">ADK ARTIFACTS<br><span style="font-size:0.5rem">SQLite-backed<br>cross-station</span></span>
    </div>
  </div>

  <div class="arch-label">tool call &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tool call &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tool call &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tool call</div>

  <div class="arch-row">
    <div class="arch-box">
      <div class="arch-box-title">DRAFT</div>
      <div>Claude Code &middot; Plan Mode</div>
      <div style="color:var(--text-dim);">200K context window</div>
      <div style="color:var(--green);">70% &middot; 140K tokens</div>
    </div>
    <div class="arch-box">
      <div class="arch-box-title">INSPECT</div>
      <div>Claude Code &middot; Plan Mode</div>
      <div style="color:var(--text-dim);">200K context window</div>
      <div style="color:var(--green);">30% &middot; 60K tokens</div>
    </div>
    <div class="arch-box">
      <div class="arch-box-title" style="color:var(--amber);">BUILD</div>
      <div>Claude Code &middot; Act Mode</div>
      <div style="color:var(--text-dim);">200K context window</div>
      <div style="color:var(--amber);">&olarr; 90% &middot; 180K tokens</div>
    </div>
    <div class="arch-box">
      <div class="arch-box-title">REVIEW</div>
      <div>Claude Code &middot; Plan Mode</div>
      <div style="color:var(--text-dim);">200K context window</div>
      <div style="color:var(--green);">10% &middot; 20K tokens</div>
    </div>
  </div>

  <div class="arch-full" style="border-color:var(--cyan); color:var(--cyan); font-size:0.68rem;">
    <strong>processManager</strong> (per station) &mdash; getOrStart() &middot; stop() &middot; handleUsage()<br>
    <span style="color:var(--text-dim);">Fuel gauge &middot; Activity log &middot; Resume ID &middot; Replacement logic</span>
  </div>

  <div class="arch-full" style="border-color:var(--magenta); color:var(--magenta); font-size:0.68rem;">
    <strong>PSEUDO CONTEXT WINDOW</strong><br>
    <span style="color:var(--text-dim);">SUPERVISOR-SIDE data structure &middot; Built from handler callback messages<br>
    Compaction runs HERE (Go), not in station</span>
  </div>

  <div class="arch-full" style="border-color:var(--cyan); font-size:0.68rem;">
    <span style="color:var(--cyan);"><strong>agentrun</strong></span> <span style="color:var(--text-dim);">&mdash; Engine &middot; Process &middot; RunTurn() &middot; Send() &middot; Output() &middot; Message types</span><br>
    <span style="color:var(--text-dim);">Claude Streamer + InputFormatter &middot; ContextUsedTokens &middot; StopMaxTokens &middot; ResumeID</span>
  </div>

  <div class="arch-full" style="border-color:var(--red); color:var(--red); font-size:0.68rem;">
    <strong>External Claude Code CLI Processes</strong> &mdash; stdin/stdout JSON &middot; 200K limit &middot; opaque internal context<br>
    <span style="font-weight:700;">BLACK BOX &mdash; we CANNOT modify anything inside this layer</span>
  </div>
 </div>

 <h3>What We Control vs. Don't Control</h3>
 <div class="two-col">
  <div class="col-card" style="border-color:var(--green);">
    <h4 style="color:var(--green);">&check; We Control</h4>
    <p style="color:var(--text-dim); font-size:0.65rem; text-transform:uppercase; letter-spacing:0.1em;">Full Access</p>
    <ul>
      <li><strong>Process lifecycle</strong> &mdash; start, stop, replace via processManager</li>
      <li><strong>Every message</strong> &mdash; handler callback sees tool calls, thinking/reasoning, errors, results</li>
      <li><strong>Fuel gauge</strong> &mdash; ContextUsedTokens on MessageResult</li>
      <li><strong>Prompt injection</strong> &mdash; proc.Send() via stdin JSON protocol</li>
      <li><strong>ADK artifacts</strong> &mdash; SQLite-backed, cross-station, persisted</li>
      <li><strong>Resume ID</strong> &mdash; clear for fresh start, pass for continuity</li>
      <li><strong>Supervisor notification</strong> &mdash; notify + steering plugins</li>
      <li><strong>Git state</strong> &mdash; diff, status, stash from Go</li>
    </ul>
  </div>
  <div class="col-card" style="border-color:var(--red);">
    <h4 style="color:var(--red);">&cross; We Don't Control</h4>
    <p style="color:var(--text-dim); font-size:0.65rem; text-transform:uppercase; letter-spacing:0.1em;">Black Box</p>
    <ul>
      <li><strong>Station's internal context</strong> &mdash; can't read or modify token arrays</li>
      <li><strong>Station's internal state</strong> &mdash; see tool calls AND thinking (MessageThinking), but not what the model "remembers" from earlier context</li>
      <li><strong>/compact quality</strong> &mdash; opaque eviction, may drop critical context</li>
      <li><strong>Claude Code's internal API calls</strong> &mdash; abstracted by agentrun</li>
      <li><strong>Token counting mid-turn</strong> &mdash; usage only on MessageResult</li>
      <li><strong>What the station "remembers"</strong> &mdash; context rot is internal</li>
    </ul>
  </div>
 </div>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 03: INDUSTRY ALGORITHM CATALOG                     -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">03</span>Industry Algorithm Catalog</h2>

 <div class="boundary-box" style="border-color:var(--cyan); color:var(--cyan); font-size:0.72rem;">
  REFERENCE DESIGNS &mdash; these algorithms run inside their respective tools, NOT inside Foundry stations
 </div>

 <p>Eight context management algorithms from production systems, research papers, and open-source projects &mdash; studied as reference designs for our supervisor-side mirror. Each algorithm below manages its tool's OWN internal context. We adapt their patterns (sliding window, per-type pruning, rolling summaries) to operate on our external mirror of the station conversation, not on the station itself.</p>

 <!-- Gemini CLI -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">Gemini CLI Compaction</span>
    <span class="card-badge badge-cyan">2-Layer</span>
  </div>
  <p style="color:var(--text-dim);">Source: google-gemini/gemini-cli (TypeScript, Apache 2.0)</p>
  <p><strong>Algorithm:</strong> Tool Output Masking (continuous, every turn: protect newest 50K, mask older with head/tail 250ch previews) + Chat Compression (at 50% capacity: split 70/30, LLM generates <code>&lt;state_snapshot&gt;</code> XML, self-verification probe, anchored iteration).</p>
  <p><strong>Key Innovation:</strong> Two-pass summarization with self-verification. Structured XML output with rigid schema (overall_goal, active_constraints, key_knowledge, artifact_trail, file_system_state, recent_actions, task_state).</p>
  <div class="applicability"><strong>Applicability:</strong> <code>state_snapshot</code> schema anchored iteration directly applicable to continuation prompts. Masking pattern maps to our context buffer trimming. <span class="tag">continuation prompt schema</span></div>
 </div>

 <!-- ADK Python -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">ADK Python Compaction</span>
    <span class="card-badge badge-cyan">2-Algorithm</span>
  </div>
  <p style="color:var(--text-dim);">Source: adk-python/src/google/adk/apps/compaction.py</p>
  <p><strong>Sliding Window:</strong> Count unique invocation_ids since last compaction. At interval: LlmEventSummarizer generates summary, appended as event. Content builder replaces raw events with summary.</p>
  <p><strong>Token Threshold:</strong> Check prompt_token_count from usage metadata. Split index protects function call/response pairs. Rolling summary: previous compaction prepended as seed.</p>
  <p><strong>Go Gap:</strong> ADK Go has no compaction as of v0.6.0 (Mar 2026). Google plans official support (#298, collaborator confirmed "early March"). Community PR #300 open but conflicting + critical review issues. Community lib <code>achetronic/adk-utils-go</code> (31&#9733;, Apache-2.0) works TODAY as pure ADK plugin &mdash; recommended interim.</p>
  <div class="applicability"><strong>Applicability:</strong> Manages agent's OWN internal conversation history. For Foundry: we adapt these algorithms to run on our supervisor-side mirror (the PCW), not on the station itself. Sliding window, token estimation, summarization prompt, tool pair safety &mdash; all portable to the PCW as reference designs. <span class="tag">algorithm reference for our supervisor-side mirror</span></div>
 </div>

 <!-- CMV -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">CMV Deterministic Trimming</span>
    <span class="card-badge badge-cyan">3-Pass</span>
  </div>
  <p style="color:var(--text-dim);">Source: arXiv 2602.22402 &middot; CosmoNaught/claude-code-cmv</p>
  <p><strong>Algorithm:</strong> 3-pass deterministic: (1) Remove tool outputs beyond window, (2) Remove tool inputs beyond window, (3) Remove complete tool call/response pairs. No LLM needed.</p>
  <p><strong>Results:</strong> Up to 86% reduction for tool-heavy sessions. Evaluated on 76 real Claude Code sessions.</p>
  <div class="applicability"><strong>Applicability:</strong> Maps to our context buffer trimming. Recent N tool calls: full detail. Older: name + 1-line summary. <span class="tag">buffer trimming</span></div>
 </div>

 <!-- Focus Agent -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">Focus Agent Sawtooth</span>
    <span class="card-badge badge-amber">Active Pressure</span>
  </div>
  <p style="color:var(--text-dim);">Source: arXiv 2601.07190</p>
  <p><strong>Algorithm:</strong> Sawtooth memory pattern. Every 10-15 tool calls: inject start_focus (save state) &rarr; complete_focus (compress and continue). External pressure, not passive.</p>
  <p><strong>Critical Finding:</strong> Passive self-compression yields only 6% reduction. Aggressive external prompting required.</p>
  <div class="applicability"><strong>Applicability:</strong> Validates external checkpoint injection via proc.Send(). We must FORCE checkpoints, not hope for them. <span class="tag">preemptive handoff</span></div>
 </div>

 <!-- Claude Code 3-Layer -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">Claude Code 3-Layer</span>
    <span class="card-badge badge-amber">Internal</span>
  </div>
  <p><strong>Layers:</strong> (1) Microcompaction: hot tail / cold storage for tool outputs. (2) Auto-compaction at ~95% capacity. (3) Manual: /compact with focus parameter.</p>
  <p><strong>Post-compaction:</strong> Boundary marker + compressed state + re-read 5 most recent files + todo list + continuation instruction.</p>
  <div class="applicability"><strong>Applicability:</strong> We could trigger /compact via proc.Send(), but Gemini 3 Pro analysis recommends AGAINST it &mdash; opaque state, unverifiable quality, context rot accumulates. <span class="tag tag-red">skip /compact</span></div>
 </div>

 <!-- Amp -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">Amp (Sourcegraph) Handoff</span>
    <span class="card-badge badge-green">Replacement</span>
  </div>
  <p><strong>Pattern:</strong> /handoff spawns new agent with structured task summary instead of compressing. Retired compaction entirely.</p>
  <p><strong>Rationale:</strong> "Recursive summaries distorted earlier reasoning." Reframes exhaustion as a coordination problem, not a compression problem.</p>
  <div class="applicability"><strong>Applicability:</strong> Directly validates our replacement architecture. Key insight: don't compress, REPLACE. <span class="tag">core philosophy</span></div>
 </div>

 <!-- Cline -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">Cline new_task Tool</span>
    <span class="card-badge badge-green">Structured</span>
  </div>
  <p><strong>Pattern:</strong> Structured handoff block: Completed Work, Current State, Next Steps, References, Actionable Start. Automatable via .clinerules at configurable context %.</p>
  <p><strong>Key Features:</strong> "Failed Approaches" section prevents repeating dead ends. Configurable threshold. Agent writes its own handoff while context is still good.</p>
  <div class="applicability"><strong>Applicability:</strong> "Failed Approaches" section critical for our continuation prompt. Auto-trigger at threshold. <span class="tag">continuation prompt schema</span></div>
 </div>

 <!-- OPENDEV -->
 <div class="algo-card">
  <div class="algo-card-header">
    <span class="algo-card-title">OPENDEV 5-Stage Progressive</span>
    <span class="card-badge badge-magenta">5-Stage</span>
  </div>
  <p style="color:var(--text-dim);">Source: arXiv 2603.05344 (Mar 2026) &middot; anomalyco/opencode (TypeScript, Apache 2.0)</p>
  <p><strong>Algorithm:</strong> 5-stage progressive compaction: (0) Detect token pressure, (1) Per-type tool summarization (file reads &rarr; path only, commands &rarr; exit+key lines, search &rarr; matched lines), (2) Offload large outputs to temp files, keep references, (3) Agent-aware truncation hints, (4) LLM merge of old observations into summaries. Independent tool output pruning: protect last 40K tokens, prune if &gt;20K recoverable.</p>
  <p><strong>Key Innovation:</strong> Event-driven system reminders counter instruction fade-out &mdash; targeted guidance injected at iteration milestones, token pressure events, safety violations. Guardrail counters escalate reminder intensity. Addresses context rot at application level.</p>
  <p><strong>Also:</strong> Dual-agent (plan/execute with schema-level tool filtering), 5 model roles with fallback chains, priority-ordered prompt composition (5 tiers, 10-95), lazy MCP tool discovery, doom-loop detection.</p>
  <div class="applicability"><strong>Applicability:</strong> Per-type tool pruning (stages 1-3, free) directly applicable to PCW before LLM call. Event-driven reminders map to our notify plugin for supervisor anti-drift. Two-phase reduction (cheap prune THEN expensive summarize) improves our PCW efficiency. <span class="tag">per-type pruning</span> <span class="tag">instruction fade-out</span> <span class="tag">supervisor reminders</span></div>
 </div>

 <h3>Research Consensus</h3>
 <table>
  <tr>
    <th>Finding</th>
    <th>Source</th>
    <th>Implication for Foundry</th>
  </tr>
  <tr>
    <td>All 18 frontier models degrade with context length</td>
    <td><span class="ref">Context Rot</span> <span class="ref">Chroma Research</span></td>
    <td>Fresh replacement &gt; degraded compacted context</td>
  </tr>
  <tr>
    <td>Focused ~300 token prompts &gt;&gt; 113K full context dumps</td>
    <td><span class="ref">Context Rot</span></td>
    <td>Continuation prompt should be ~3-5K focused, not full dump</td>
  </tr>
  <tr>
    <td>Passive self-compression yields only 6%</td>
    <td><span class="ref">Focus Agent</span> <span class="ref">arXiv 2601.07190</span></td>
    <td>Must FORCE checkpoints externally via proc.Send()</td>
  </tr>
  <tr>
    <td>Can't rely on exhausted agent for its own handoff</td>
    <td><span class="ref">Handoff Paradox</span> <span class="ref">zen insight</span></td>
    <td>Build snapshot from external signals (buffer + git), not station self-report</td>
  </tr>
  <tr>
    <td>2K focused context agents &gt; 128K monolithic</td>
    <td><span class="ref">Graph of Agents</span> <span class="ref">arXiv 2509.21848</span></td>
    <td>Validates Foundry's multi-station architecture</td>
  </tr>
  <tr>
    <td>Recursive summaries distort earlier reasoning</td>
    <td><span class="ref">Amp</span> <span class="ref">Sourcegraph</span></td>
    <td>Skip compaction (/compact). Replace instead.</td>
  </tr>
  <tr>
    <td>Models prematurely shortcut when aware of limits</td>
    <td><span class="ref">Devin</span> <span class="ref">"Context Anxiety"</span></td>
    <td>Station should NOT know about replacement. Transparent.</td>
  </tr>
  <tr>
    <td>Instruction following degrades in long contexts (instruction fade-out)</td>
    <td><span class="ref">OPENDEV</span> <span class="ref">arXiv 2603.05344</span> <span class="ref">Gemini CLI #6474 (P0)</span></td>
    <td>Supervisor needs event-driven reminders via notify plugin at N-turn intervals</td>
  </tr>
  <tr>
    <td>Cheap tool pruning before expensive LLM summarization saves tokens</td>
    <td><span class="ref">OPENDEV</span> <span class="ref">5-stage progressive</span> <span class="ref">OpenCode</span> <span class="ref">40K protect/20K prune</span></td>
    <td>PCW should prune ToolRecord.Output per-type FIRST, then LLM-summarize remainder</td>
  </tr>
 </table>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 04: SOLUTION POOL                                  -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">04</span>Solution Pool &mdash; 8 Architectures</h2>

 <div class="boundary-box">
  ALL RECOVERY SOLUTIONS: SUPERVISOR-SIDE (Go) &mdash; STATION IS UNMODIFIED BLACK BOX
 </div>

 <!-- ─── SOLUTION 0 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">00</div>
    <div>
      <div class="solution-title">Task Scoping &mdash; Prevent Exhaustion at the Source</div>
      <div class="solution-subtitle">Teach the supervisor to dispatch smaller, well-scoped tasks that fit within 200K</div>
    </div>
    <span class="card-badge badge-green">ZERO CODE</span>
  </div>
  <div class="solution-body">
    <div class="algo-flow"><pre>
  Instead of: "Implement the entire user auth feature"
  Dispatch:   "Implement the JWT token parser in auth/token.go with tests"
  Then:       "Wire the token parser into the middleware"
  Then:       "Add integration tests for the auth middleware"

  GAP: The supervisor does NOT do this today.
  Current decomposition is by FUNCTION (which station), not by SIZE.
  The system prompt says "decompose tasks into station assignments"
  &mdash; meaning draft&rarr;inspect&rarr;build&rarr;review routing, not task sizing.
  Station tools accept { task: string } with no token budget metadata.
  The supervisor has zero awareness of the 200K station context limit.</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Zero Go code</strong> &mdash; prompt engineering only (supervisor system prompt changes)</li>
          <li><strong>Zero runtime cost</strong> &mdash; no LLM calls, no buffers, no persistence</li>
          <li><strong>Addresses root cause</strong> &mdash; exhaustion is a symptom of poor task scoping (when tasks are scope-reducible)</li>
          <li><strong>Relatively fast</strong> &mdash; can be deployed by editing coder.md.tpl, but requires non-trivial prompt design</li>
          <li><strong>Better quality</strong> &mdash; smaller tasks produce more focused, testable results</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>New capability, not existing</strong> &mdash; the supervisor currently decomposes by function (station routing), not by size. Teaching size-aware decomposition is new prompt engineering, not leveraging an existing behavior.</li>
          <li><strong>Not always possible</strong> &mdash; some tasks are inherently large (refactoring, multi-file features)</li>
          <li><strong>Supervisor cooperation</strong> &mdash; depends on LLM following scoping instructions reliably</li>
          <li><strong>No feedback loop</strong> &mdash; supervisor has no visibility into how much context a station consumed. Without usage data flowing back in results, the supervisor can't learn or adapt sizing.</li>
          <li><strong>No safety net</strong> &mdash; if scoping fails, station exhausts with no recovery</li>
          <li><strong>More supervisor turns</strong> &mdash; smaller tasks = more round trips</li>
          <li><strong>Doesn't handle external factors</strong> &mdash; station may exhaust from unexpected complexity</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">Graph of Agents</span> <span class="ref">Gemini 3 Pro review &mdash; "task scoping"</span></p>
    <p style="font-size:0.78rem; margin-top:0.5rem; color:var(--text-dim)"><strong>Recommendation:</strong> Try this FIRST regardless of which recovery solution you choose. Good task scoping reduces the need for recovery machinery.</p>
  </div>
 </div>

 <!-- ─── SOLUTION 1 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">01</div>
    <div>
      <div class="solution-title">External Buffer + Transparent Replacement</div>
      <div class="solution-subtitle">processManager builds context buffer from handler, replaces at threshold</div>
    </div>
    <span class="card-badge badge-cyan">REACTIVE + PREEMPTIVE</span>
  </div>
  <div class="solution-body">
    <div class="algo-flow"><pre>
  handler callback (every message)          processManager
  +------------------------------+         +----------------------+
  | MessageText + Tool ->        |-------->| ContextBuffer        |
  |   tool call record           |         |  +-- []TurnRecord    |
  | MessageResult ->             |         |  +-- generation count|
  |   usage snapshot             |         |  +-- anchor snapshot |
  | MessageError ->              |         |                      |
  |   error record               |         | shouldReplace()      |
  +------------------------------+         |  fuel > 85% -> YES   |
                                           |  StopMaxTokens -> YES|
          RunTurn() error?                 |  mid-turn death -> YES
          +-- YES -> buildSnapshot()       |                      |
          |         stop old process       | buildSnapshot():     |
          |         clear resume ID        |  ContextBuffer ->    |
          |         getOrStart(fresh)      |  git diff --stat ->  |
          |         RunTurn(continuation)  |  &lt;state_snapshot&gt;    |
          +-- NO, fuel > 85% -> replace   +----------------------+</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Zero overhead</strong> &mdash; buffers data we already process in handler</li>
          <li><strong>Works for mid-turn death</strong> (buffer built as messages arrive)</li>
          <li><strong>Transparent to supervisor</strong> &mdash; same tool, same result contract</li>
          <li><strong>No dependency on station cooperation</strong></li>
          <li><strong>No file I/O until replacement happens</strong></li>
          <li><strong>~200 lines of Go</strong>, all in agentrun.go</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>In-memory only</strong> &mdash; lost on app restart</li>
          <li><strong>Buffer captures thinking</strong> but earlier-turn context (what model "remembers") is lost</li>
          <li><strong>No LLM involvement in summarization</strong> &mdash; deterministic only</li>
          <li><strong>Git state capture adds latency</strong> at replacement time</li>
          <li><strong>No cross-station sharing</strong> of checkpoint data</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">CMV</span> <span class="ref">Focus Agent</span> <span class="ref">Handoff Paradox</span> <span class="ref">Context Rot</span></p>
  </div>
 </div>

 <!-- ─── SOLUTION 2 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">02</div>
    <div>
      <div class="solution-title">ADK Artifact Checkpoints + Replacement</div>
      <div class="solution-subtitle">Persist structured checkpoints as ADK artifacts after each turn</div>
    </div>
    <span class="card-badge badge-amber">PERSISTENT</span>
  </div>
  <div class="solution-body">
    <div class="algo-flow"><pre>
  After each successful RunTurn:
  +-------------------------------------------------------------+
  |  1. Build checkpoint from ContextBuffer + git state          |
  |  2. artifacts.Save("station-build-checkpoint", checkpoint)   |
  |  3. Checkpoint persists in SQLite (foundry-adk.db)           |
  +-------------------------------------------------------------+

  On replacement trigger:
  +-------------------------------------------------------------+
  |  1. artifacts.Load("station-build-checkpoint")               |
  |  2. Build continuation prompt from artifact + fresh git      |
  |  3. Stop old process, start fresh                            |
  |  4. RunTurn(fresh, continuation)                             |
  +-------------------------------------------------------------+

  Cross-station: inspect can Load("station-build-checkpoint")
  to understand what build did before reviewing.</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Survives app restarts</strong> &mdash; SQLite-backed persistence</li>
          <li><strong>Cross-station sharing</strong> &mdash; inspect reads build's checkpoint</li>
          <li><strong>Already wired</strong> &mdash; artifactService in agent.go</li>
          <li><strong>Versioned</strong> &mdash; each save increments version counter</li>
          <li><strong>Auditable</strong> &mdash; can inspect checkpoint history</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>Write I/O after every turn</strong> (SQLite, but still overhead)</li>
          <li><strong>Artifact size management</strong> &mdash; need cleanup policy</li>
          <li><strong>Adds coupling</strong> to artifact service availability</li>
          <li><strong>Stale artifact risk</strong> if app crashes between turn and save</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">MAPLE</span> <span class="ref">Codified Context</span> <span class="ref">GitHub Copilot Memory</span></p>
  </div>
 </div>

 <!-- ─── SOLUTION 3 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">03</div>
    <div>
      <div class="solution-title">Station-Internal Compaction via /compact</div>
      <div class="solution-subtitle">Trigger Claude Code's built-in compaction to extend station lifetime</div>
    </div>
    <span class="card-badge badge-red">RISKY</span>
  </div>
  <div class="solution-body">
    <div class="algo-flow"><pre>
  At 50-60% capacity:
  +-------------------------------------+
  |  proc.Send("/compact")              |  &larr; uses FormatInput() stdin JSON
  |  Wait for MessageResult             |
  |  Check ContextUsedTokens decreased  |
  |  If decreased -> continue normally  |
  |  If NOT -> fall back to replacement |
  +-------------------------------------+

  Risk: /compact is opaque. We don't know:
  * What was evicted vs kept
  * Whether critical context survived
  * Whether the station will hallucinate post-compact
  * Whether /compact even works via stdin JSON</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Extends station lifetime</strong> without process restart</li>
          <li><strong>Claude Code's compaction</strong> may preserve more internal state than our external snapshot</li>
          <li><strong>Zero implementation complexity</strong> (one proc.Send call)</li>
          <li><strong>Preserves station's accumulated reasoning</strong></li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>Opaque state</strong> &mdash; can't verify what was kept/dropped</li>
          <li><strong>Context rot</strong> &mdash; each compaction degrades quality</li>
          <li><strong>Amp retired this</strong> &mdash; "recursive summaries distorted reasoning"</li>
          <li><strong>No verification</strong> &mdash; can't parse stdout to confirm success</li>
          <li><strong>May not work</strong> &mdash; /compact via stdin JSON is untested</li>
          <li><strong>Gemini 3 Pro recommends AGAINST</strong> this approach</li>
        </ul>
      </div>
    </div>

    <div class="validation-box">
      <span class="review-badge review-badge-gemini">GEMINI 3 PRO</span>
      <p><strong>Validation Result: SKIP /compact</strong></p>
      <p>"Deterministic state materialization combined with a clean process restart provides a much more predictable state machine than relying on the CLI's internal compaction heuristics. /compact introduces opaque state, verification overhead, and unpredictable eviction."</p>
    </div>

    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">Amp (negative)</span> <span class="ref">Context Rot (negative)</span> <span class="ref">Claude Code 3-Layer (source)</span></p>
  </div>
 </div>

 <!-- ─── SOLUTION 4 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">04</div>
    <div>
      <div class="solution-title">ADK Compaction for Supervisor (Use Community Lib)</div>
      <div class="solution-subtitle">Drop-in ADK plugin for supervisor's 1M context &mdash; no custom code needed</div>
    </div>
    <span class="card-badge badge-amber">COMPLEMENTARY</span>
  </div>
  <div class="solution-body">
    <p style="color:var(--amber); font-size:0.78rem;">Update (Mar 2026): No need to port from Python &mdash; a production-ready community lib exists.</p>

    <div class="lib-card" style="border-color:var(--green);">
      <div class="lib-card-header">
        <span class="lib-card-title">achetronic/adk-utils-go</span>
        <span class="card-badge badge-green">RECOMMENDED</span>
      </div>
      <p><strong>Status:</strong> 31&#9733;, Apache-2.0, works with ADK v0.5.0+, pure ADK plugin (BeforeModel + AfterModel)</p>
      <p><strong>Strategies:</strong> ThresholdStrategy (token-based, calibrated heuristic with real PromptTokenCount correction) + SlidingWindowStrategy (turn-count, 30% recent keep, up to 3 retry passes)</p>
      <p><strong>Quality:</strong> 6,270 lines of tests (multiturn, singleshot, unit). Structured 4-section summarization prompt. Fallback on LLM failure. Todo preservation. Tool pair safety (safeSplitIndex with bidirectional walk). Dynamic word limits.</p>
      <p><strong>Crush integration:</strong> Ships with CrushRegistry using catwalk's embedded model DB &mdash; Foundry is a Crush fork.</p>
    </div>

    <div class="lib-card" style="border-color:var(--amber);">
      <div class="lib-card-header">
        <span class="lib-card-title">PR #300 (google/adk-go)</span>
        <span class="card-badge badge-red">NOT READY</span>
      </div>
      <p><strong>Status:</strong> Open, CONFLICTING merge state, critical Gemini code-assist review (race condition in compactor init, incomplete event filtering integration)</p>
      <p><strong>Architecture:</strong> Modifies ADK core &mdash; adds EventCompaction to EventActions, CompactionConfig to runner.Config. 2,546 additions across 17 files.</p>
      <p><strong>Google plan:</strong> Collaborator @dpasiukevich confirmed "early March" for official support (comment). Not in v0.6.0 (Mar 6). May come in next release &mdash; would supersede both alternatives.</p>
    </div>

    <div class="algo-flow"><pre>
  Using community lib (drop-in, ~10 lines to wire):
  +---------------------------------------------------------------+
  |  guard := contextguard.New(registry)                          |
  |  guard.Add("foundry", supervisorLLM,                          |
  |      contextguard.WithSlidingWindow(30))                      |
  |                                                                |
  |  runnr, _ := runner.New(runner.Config{                        |
  |      Agent:        myAgent,                                    |
  |      PluginConfig: guard.PluginConfig(),  // done.            |
  |  })                                                            |
  |                                                                |
  |  NOTE: This manages the SUPERVISOR's context, NOT stations.   |
  |  Stations are external processes -- their context is opaque.  |
  +---------------------------------------------------------------+</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Works TODAY</strong> &mdash; pure plugin, no ADK core changes needed</li>
          <li><strong>Proven algorithms</strong> &mdash; sliding window + token threshold, 6K+ lines of tests</li>
          <li><strong>Future-proofs supervisor</strong> for very long sessions (100+ turns)</li>
          <li><strong>Calibrated token estimation</strong> &mdash; real PromptTokenCount correction factor</li>
          <li><strong>Crush model registry included</strong> &mdash; knows our model hierarchy</li>
          <li><strong>~10 lines to wire</strong>, zero custom compaction code</li>
          <li><strong>Google official support imminent</strong> &mdash; can swap to native when it lands</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>Wrong bottleneck</strong> &mdash; supervisor has 1M, stations have 200K</li>
          <li><strong>Does NOT solve station exhaustion</strong> (that's Solution 6)</li>
          <li><strong>External dependency</strong> (31&#9733;, single maintainer)</li>
          <li><strong>CrushRegistry uses catwalk</strong> &mdash; Foundry removed catwalk; need custom ModelRegistry impl (~20 lines)</li>
          <li><strong>May be superseded</strong> by Google's official compaction in next ADK release</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">ADK Python Compaction</span> <span class="ref">Gemini CLI (similar pattern)</span> <span class="ref">achetronic/adk-utils-go</span> <span class="ref">google/adk-go#298</span></p>
  </div>
 </div>

 <!-- ─── SOLUTION 5 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">05</div>
    <div>
      <div class="solution-title">Supervisor-Orchestrated Handoff</div>
      <div class="solution-subtitle">Station returns "NEEDS_CONTINUATION" &mdash; supervisor explicitly re-dispatches</div>
    </div>
    <span class="card-badge badge-blue">EXPLICIT</span>
  </div>
  <div class="solution-body">
    <div class="algo-flow"><pre>
  Station tool returns to supervisor:
  +-----------------------------------------------------+
  |  stationOutput{                                      |
  |    Result: "NEEDS_CONTINUATION",                     |
  |    Metadata: {                                       |
  |      reason: "context_exhausted",                    |
  |      completed: ["parsed input", "wrote tests"],     |
  |      remaining: ["validate output", "wire main"],    |
  |      files_modified: ["parser.go", "parse_test.go"], |
  |      failed_approaches: ["regex parser too slow"],   |
  |    }                                                 |
  |  }                                                   |
  +-----------------------------------------------------+

  Supervisor (prompted to handle this):
  +-----------------------------------------------------+
  |  "Station build exhausted context. It completed      |
  |   parsing and tests. Remaining: validation + wiring. |
  |   Re-dispatch build with focused continuation."      |
  |                                                      |
  |  -> station("Continue: validate output and wire into |
  |     main pipeline. Previous work: parser.go and      |
  |     parse_test.go complete. Don't retry regex        |
  |     approach -- it's too slow.")                     |
  +-----------------------------------------------------+</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Supervisor retains full control and visibility</strong></li>
          <li><strong>Can adjust strategy</strong> (smaller tasks, different station)</li>
          <li><strong>Explicit</strong> &mdash; no hidden magic, easy to debug</li>
          <li><strong>Supervisor can ask operator</strong> before re-dispatching</li>
          <li><strong>Works with any backend</strong>, not Claude-specific</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>Adds turns</strong> &mdash; supervisor must reason about handoff</li>
          <li><strong>Depends on LLM following handoff protocol</strong> correctly</li>
          <li><strong>Supervisor prompt complexity increases</strong></li>
          <li><strong>Supervisor context grows</strong> with each handoff attempt</li>
          <li><strong>Can't handle mid-turn death</strong> (no structured output)</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">Amp /handoff</span> <span class="ref">Cline new_task</span> <span class="ref">MAPLE</span></p>
  </div>
 </div>

 <!-- ─── SOLUTION 6 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">06</div>
    <div>
      <div class="solution-title">Pseudo Context Window + ADK Sliding Window Compaction</div>
      <div class="solution-subtitle">Live compacted mirror of station conversation &mdash; always ready as continuation prompt</div>
    </div>
    <span class="card-badge badge-magenta">MOST SOPHISTICATED</span>
  </div>
  <div class="solution-body">
    <p style="font-size:0.72rem; color:var(--text-dim); margin-bottom:1rem;">All code below runs in the Go supervisor process (<code>internal/agent/agentrun.go</code>), inside the processManager that owns the station. The station CLI is unaware of the PCW's existence. The handler callback captures data as the station streams stdout JSON. <code>runOneShot()</code> (a SEPARATE fresh LLM call, not the station) does summarization. The exhausted station is never asked to synthesize its own state.</p>

    <p>The handler callback in <code>newStationTool()</code> sees every message &mdash; tool calls, thinking, results, errors. Instead of building a snapshot at replacement time, we maintain a <strong>pseudo context window</strong>: a live, compacted mirror of the station's conversation on the supervisor side. ADK Python's sliding window compaction algorithm is adapted to keep the window within token budget. When replacement triggers, the window IS the continuation prompt &mdash; no construction step needed.</p>

    <div class="algo-flow"><pre>
  handler callback (continuous)                PseudoContextWindow (SUPERVISOR-SIDE)
  +------------------------------+         +----------------------------------------+
  | MessageText + Tool ->        |-------->| RetentionWindow: []TurnRecord           |
  |   ToolRecord (name, I/O)     |         |   +-- Turn N-4: task + tools + result   |
  | MessageThinking ->           |         |   +-- Turn N-3: task + tools + result   |
  |   thinking content           |         |   +-- Turn N-2: task + tools + result   |
  | MessageResult ->             |         |   +-- Turn N-1: task + tools + result   |
  |   usage, result text         |         |   +-- Turn N:   [in progress...]        |
  | MessageError ->              |         |                                         |
  |   error record               |         | CompactedSummary: string                |
  +------------------------------+         |   (LLM-generated rolling summary of    |
                                           |    turns before retention window)       |
  afterTurn() -- sliding window:           |                                         |
  +------------------------------+         | Compaction (ADK Python style):          |
  | TurnsSinceCompact++          |         |   sliding window: every N turns ->      |
  |                              |         |     summarize old turns into             |
  | if TurnsSinceCompact >=      |         |     CompactedSummary (LLM call)         |
  |    CompactionInterval:       |         |   token threshold: if TokenEstimate     |
  |   -> summarize old turns     |         |     > budget -> force compaction        |
  |   -> prepend to Compacted    |         |                                         |
  |   -> trim RetentionWindow    |         | On replacement -> window IS the prompt: |
  |   -> reset counter           |         |   "Previous work: {CompactedSummary}    |
  |                              |         |    Recent: {RetentionWindow}             |
  | if TokenEstimate > budget:   |         |    Continue: {original task}"           |
  |   -> force compaction        |         +----------------------------------------+
  +------------------------------+</pre></div>

    <p style="font-size:0.72rem; color:var(--text-dim); margin:1rem 0 0.5rem;"><strong>KEY DIFFERENCES FROM SOLUTION 1:</strong></p>
    <table class="diff-table">
      <tr><th>Aspect</th><th>S1: External Buffer</th><th>S6: Pseudo Context Window</th></tr>
      <tr><td>Trimming</td><td>CMV-style deterministic only</td><td>ADK sliding window + LLM summarization</td></tr>
      <tr><td>Summary quality</td><td>Deterministic &mdash; no LLM involvement</td><td>LLM-generated rolling summary (via runOneShot)</td></tr>
      <tr><td>Snapshot construction</td><td>Built at replacement time</td><td>Always ready &mdash; window IS the prompt</td></tr>
      <tr><td>Data captured</td><td>Tool calls + results + errors</td><td>Tool calls + thinking + results + errors</td></tr>
      <tr><td>Compaction trigger</td><td>None (fixed-size buffer)</td><td>Dual: turn-count interval + token threshold</td></tr>
      <tr><td>Failed approaches</td><td>In snapshot, may be trimmed</td><td>Never compacted &mdash; protected across all compactions</td></tr>
      <tr><td>LLM cost</td><td>Zero</td><td>One runOneShot call per compaction cycle</td></tr>
    </table>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Always ready</strong> &mdash; no snapshot construction step on replacement</li>
          <li><strong>LLM summarization via existing runOneShot()</strong> &mdash; high-quality rolling summaries</li>
          <li><strong>Captures thinking</strong> (MessageThinking visible via agentrun) &mdash; reasoning is preserved</li>
          <li><strong>ADK-proven compaction algorithm</strong> &mdash; sliding window + token threshold</li>
          <li><strong>Failed approaches are NEVER compacted</strong> &mdash; protected from eviction</li>
          <li><strong>Dual compaction triggers</strong> prevent unbounded growth</li>
          <li><strong>Transparent to supervisor</strong> &mdash; same tool, same result contract</li>
          <li><strong>Maps cleanly to existing processManager lifecycle</strong></li>
          <li><strong>Can persist to ADK artifacts</strong> for restart survival</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>LLM cost</strong> &mdash; one runOneShot() call per compaction cycle (~every 5 turns)</li>
          <li><strong>Compaction latency</strong> &mdash; LLM summarization adds ~2-5s per cycle (non-blocking)</li>
          <li><strong>Token estimation is approximate</strong> &mdash; genai/tokenizer available but adds complexity</li>
          <li><strong>Rolling summaries can distort early context</strong> (mitigated by overlap + protected fields)</li>
          <li><strong>In-memory by default</strong> &mdash; needs artifact persistence for restart survival</li>
          <li><strong>More complex than S1</strong> &mdash; sliding window + LLM integration (~350 LOC vs ~200 LOC)</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">ADK Python Compaction</span> <span class="ref">CMV</span> <span class="ref">Context Rot</span> <span class="ref">Cline new_task</span> <span class="ref">Amp /handoff</span> <span class="ref">Gemini CLI state_snapshot</span></p>
  </div>
 </div>

 <!-- ─── SOLUTION 7 ─── -->
 <div class="solution">
  <div class="solution-header">
    <div class="solution-num">07</div>
    <div>
      <div class="solution-title">Graceful Handoff &mdash; Warn, Wrap, Replace</div>
      <div class="solution-subtitle">At threshold, tell station to wrap up current work before replacing</div>
    </div>
    <span class="card-badge badge-green">COOPERATIVE</span>
  </div>
  <div class="solution-body">
    <div class="algo-flow"><pre>
  At 80% capacity:
  +-----------------------------------------------------------------+
  |  proc.Send("IMPORTANT: You are approaching your context limit.  |
  |   Please finish your current subtask, commit or save your work, |
  |   and write a brief handoff summary of what's done and what     |
  |   remains. Do NOT start new subtasks.")                         |
  |                                                                  |
  |  Wait for MessageResult (station wraps up)                      |
  |                                                                  |
  |  Extract: result text (station's own handoff summary)           |
  |  + git diff --stat (ground truth)                               |
  |                                                                  |
  |  Stop old process, start fresh with:                            |
  |  "Continue: {station's handoff} + {git state}"                  |
  +-----------------------------------------------------------------+

  Key difference from S1-S6: the station is ASKED to wrap up cleanly
  instead of being killed mid-work. The station's own summary is used
  as input (while it still has good context), cross-checked with git.</pre></div>

    <div class="pros-cons">
      <div class="pros">
        <h4>Advantages</h4>
        <ul>
          <li><strong>Station writes its own handoff</strong> &mdash; while context is still good (80%, not 100%)</li>
          <li><strong>Clean state</strong> &mdash; station commits/saves before replacement, no half-finished edits</li>
          <li><strong>Simple</strong> &mdash; one proc.Send() + wait + replace. ~100 LOC.</li>
          <li><strong>Cross-validated</strong> &mdash; station's summary checked against git diff (ground truth)</li>
          <li><strong>No LLM summarization cost</strong> &mdash; station does the summarization as part of its work</li>
          <li><strong>Station's own reasoning preserved</strong> &mdash; it knows what it was thinking better than our external mirror</li>
        </ul>
      </div>
      <div class="cons">
        <h4>Disadvantages</h4>
        <ul>
          <li><strong>Depends on station cooperation</strong> &mdash; station may ignore the instruction or produce poor handoff</li>
          <li><strong>Timing risk</strong> &mdash; station may exhaust DURING the wrap-up phase (80% + wrap-up = 100%)</li>
          <li><strong>Handoff Paradox applies partially</strong> &mdash; station at 80% is degraded, may miss things</li>
          <li><strong>Can't handle mid-turn death</strong> &mdash; needs a running station to cooperate</li>
          <li><strong>Adds latency</strong> &mdash; wrap-up phase can take significant time</li>
          <li><strong>Context anxiety risk</strong> &mdash; telling station about limits may cause premature shortcuts</li>
        </ul>
      </div>
    </div>
    <p style="font-size:0.78rem"><strong>Research backing:</strong> <span class="ref">Cline new_task (agent writes handoff)</span> <span class="ref">Focus Agent (external pressure)</span> <span class="ref">Gemini 3 Pro review &mdash; "graceful handoff"</span></p>
    <p style="font-size:0.78rem; margin-top:0.5rem; color:var(--text-dim);"><strong>Note:</strong> Can be combined with S1 or S6 as a fallback &mdash; if graceful handoff fails (mid-turn death), fall back to buffer-based replacement.</p>
  </div>
 </div>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 05: COMPARISON MATRIX                              -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">05</span>Comparison Matrix</h2>

 <p style="font-size:0.78rem; margin-bottom:1rem;">S0 (Task Scoping) is not in this matrix &mdash; it's a prerequisite that should be applied regardless of which recovery solution you choose. S2 (ADK Artifacts) and S4 (ADK Lib) are complementary layers, not primary recovery solutions.</p>

 <table>
  <tr>
    <th>Criterion</th>
    <th>S1: Deterministic Buffer</th>
    <th>S3: /compact</th>
    <th>S5: Supervisor Handoff</th>
    <th>S6: Pseudo Context Window</th>
    <th>S7: Graceful Handoff</th>
  </tr>
  <tr>
    <td>Solves station exhaustion</td>
    <td class="cell-good">&check; Kill + replace</td>
    <td class="cell-warn">&#9684; Delays only</td>
    <td class="cell-good">&check; Re-dispatch</td>
    <td class="cell-good">&check; Kill + replace</td>
    <td class="cell-good">&check; Warn + replace</td>
  </tr>
  <tr>
    <td>Mid-turn death recovery</td>
    <td class="cell-good">&check; Buffer built live</td>
    <td class="cell-bad">&cross; No path</td>
    <td class="cell-bad">&cross; No output</td>
    <td class="cell-good">&check; Window built live</td>
    <td class="cell-bad">&cross; Needs running station</td>
  </tr>
  <tr>
    <td>Implementation complexity</td>
    <td>~200 LOC</td>
    <td>~20 LOC</td>
    <td>~150 LOC + prompt</td>
    <td>~350 LOC</td>
    <td>~100 LOC</td>
  </tr>
  <tr>
    <td>Runtime cost</td>
    <td class="cell-good">$0 &mdash; no LLM calls</td>
    <td class="cell-good">$0 &mdash; one proc.Send</td>
    <td class="cell-good">$0 &mdash; supervisor reasons</td>
    <td class="cell-warn">~$0.01-0.05 per compaction</td>
    <td class="cell-good">$0 &mdash; station does wrap-up</td>
  </tr>
  <tr>
    <td>Time to ship</td>
    <td>Days</td>
    <td>Hours</td>
    <td>Days + prompt tuning</td>
    <td>Week+</td>
    <td>Days</td>
  </tr>
  <tr>
    <td>Continuation quality</td>
    <td>Predictable (deterministic)</td>
    <td class="cell-bad">Opaque (unverifiable)</td>
    <td>Supervisor reasons about it</td>
    <td>LLM-enhanced summaries</td>
    <td>Station's own summary</td>
  </tr>
  <tr>
    <td>Failure mode severity</td>
    <td class="cell-good">Low &mdash; deterministic, no surprises</td>
    <td class="cell-bad">High &mdash; opaque rot, hallucination</td>
    <td class="cell-warn">Medium &mdash; LLM may not follow protocol</td>
    <td class="cell-warn">Medium &mdash; LLM summary can distort (falls back to S1)</td>
    <td class="cell-warn">Medium &mdash; station may ignore warning</td>
  </tr>
  <tr>
    <td>Prevents "context anxiety"</td>
    <td class="cell-good">&check; Station never knows</td>
    <td class="cell-warn">&#9684; May signal</td>
    <td class="cell-good">&check; Station doesn't know</td>
    <td class="cell-good">&check; Station never knows</td>
    <td class="cell-bad">&cross; Explicitly warns station</td>
  </tr>
  <tr>
    <td>Clean state on replacement</td>
    <td class="cell-warn">&#9684; May have half-finished edits</td>
    <td>&mdash; N/A</td>
    <td class="cell-warn">&#9684; Depends on when it triggers</td>
    <td class="cell-warn">&#9684; May have half-finished edits</td>
    <td class="cell-good">&check; Station saves/commits first</td>
  </tr>
  <tr>
    <td>Composability</td>
    <td>Base layer for S6, S7</td>
    <td>Standalone only</td>
    <td>Overlay on any solution</td>
    <td>Enhances S1 with LLM</td>
    <td>Combines with S1 or S6 as fallback</td>
  </tr>
  <tr>
    <td style="color:var(--amber);"><strong>BEST FOR</strong></td>
    <td><strong>Start here</strong> &mdash; simple, reliable, zero cost</td>
    <td><strong>Avoid</strong> &mdash; opaque, unverifiable</td>
    <td><strong>Explicit control</strong> &mdash; when supervisor needs to adjust strategy</td>
    <td><strong>High-quality continuation</strong> &mdash; when data shows S1 isn't enough</td>
    <td><strong>Clean handoff</strong> &mdash; when half-finished state is a problem</td>
  </tr>
 </table>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 06: DEEP DIVE — PCW (S6)                           -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">06</span>Deep Dive &mdash; Pseudo Context Window (S6)</h2>

 <div class="boundary-box" style="border-color:var(--magenta); color:var(--magenta);">
  ENTIRE ARCHITECTURE RUNS SUPERVISOR-SIDE &mdash; STATION IS UNMODIFIED BLACK BOX
 </div>

 <p style="font-size:0.78rem;"><strong>Why a deep dive for S6?</strong> This is the most complex solution (~350 LOC) with the most moving parts. It is detailed here so readers can evaluate whether the additional complexity is justified for their use case. S6 is an enhancement of S1 (deterministic buffer) &mdash; it adds LLM-powered rolling summarization via <code>runOneShot()</code> and ADK-style sliding window compaction. If S1 proves sufficient in practice, S6 may not be needed. If continuation quality is poor with S1, S6 is the upgrade path.</p>

 <p style="font-size:0.78rem; margin-top:0.75rem;"><strong>Key tradeoff:</strong> S6 produces higher-quality continuation prompts than S1, but introduces LLM cost per compaction cycle (~$0.01-0.05/cycle), rolling summary drift risk (mitigated by overlap + protected fields), and additional complexity. S1 is the deterministic fallback if S6's LLM summarization fails.</p>

 <p style="font-size:0.78rem; margin-top:0.75rem;"><strong>The S3/S6 distinction:</strong> This document advises against S3 (/compact) because "recursive summaries distort reasoning" (Amp). S6 also uses LLM summarization &mdash; but with key differences: (1) S6 summarization runs in the supervisor process (verifiable, controllable), not inside the opaque station; (2) we control what gets summarized and what's protected (FailedApproaches never compacted); (3) if summarization quality degrades, we can fall back to S1-style deterministic buffer. S3's /compact offers none of these controls.</p>

 <h3 style="color:var(--magenta);">Data Structures</h3>
 <div class="data-struct">
  <div class="data-struct-title">DATA STRUCTURES</div>
 <pre style="color:var(--text); font-size:0.72rem;">
 <span style="color:var(--text-dim);">// PseudoContextWindow — live compacted mirror of station conversation</span>
 type PseudoContextWindow struct {
    CompactedSummary    string        <span style="color:var(--text-dim);">// rolling LLM-generated summary of old turns</span>
    RetentionWindow     []TurnRecord  <span style="color:var(--text-dim);">// recent N turns with full detail</span>
    Generation          int           <span style="color:var(--text-dim);">// replacement count</span>
    TokenEstimate       int           <span style="color:var(--text-dim);">// estimated tokens in window</span>
    CompactionInterval  int           <span style="color:var(--text-dim);">// turns between compaction (e.g., 5)</span>
    OverlapSize         int           <span style="color:var(--text-dim);">// turns to keep as overlap (e.g., 2)</span>
    TokenThreshold      int           <span style="color:var(--text-dim);">// max tokens before forced compaction (e.g., 15000)</span>
    TurnsSinceCompact   int           <span style="color:var(--text-dim);">// counter</span>
    FailedApproaches    []string      <span style="color:var(--text-dim);">// accumulated across turns (never compacted)</span>
 }

 type TurnRecord struct {
    Task      string        <span style="color:var(--text-dim);">// what was asked</span>
    Thinking  string        <span style="color:var(--text-dim);">// from MessageThinking (visible via agentrun)</span>
    ToolCalls []ToolRecord  <span style="color:var(--text-dim);">// name + input + output</span>
    Result    string        <span style="color:var(--text-dim);">// station's response</span>
    Error     string        <span style="color:var(--text-dim);">// if turn failed</span>
    Usage     UsageSnapshot <span style="color:var(--text-dim);">// ContextUsedTokens at end of turn</span>
 }

 type ToolRecord struct {
    Name   string  <span style="color:var(--text-dim);">// "Read", "Edit", "Bash", etc.</span>
    Input  string  <span style="color:var(--text-dim);">// file path, command, etc. (trimmed for older turns)</span>
    Output string  <span style="color:var(--text-dim);">// head/tail preview (full for recent, masked for old)</span>
 }

 <span style="color:var(--text-dim);">// afterTurn — called after each RunTurn completion</span>
 func (w *PseudoContextWindow) afterTurn(turn TurnRecord, summarizer func(string) string) {
    w.RetentionWindow = append(w.RetentionWindow, turn)
    w.TurnsSinceCompact++
    w.TokenEstimate += estimateTokens(turn)

    <span style="color:var(--text-dim);">// Sliding window compaction (ADK Python algorithm)</span>
    if w.TurnsSinceCompact >= w.CompactionInterval || w.TokenEstimate > w.TokenThreshold {
        cutoff := len(w.RetentionWindow) - w.OverlapSize
        if cutoff > 0 {
            old := w.RetentionWindow[:cutoff]
            seed := w.CompactedSummary <span style="color:var(--text-dim);">// rolling: previous summary as seed</span>
            w.CompactedSummary = summarizer(seed + renderTurns(old))
            w.RetentionWindow = w.RetentionWindow[cutoff:]
            w.TurnsSinceCompact = 0
            w.TokenEstimate = estimateWindowTokens(w)
        }
    }
 }

 <span style="color:var(--text-dim);">// buildPrompt — the window IS the continuation prompt</span>
 func (w *PseudoContextWindow) buildPrompt(originalTask string, gitState string) string {
    <span style="color:var(--text-dim);">// Always ready — no snapshot construction needed</span>
    return fmt.Sprintf(`You are continuing work that a previous agent started.

 &lt;previous_work_summary&gt;%s&lt;/previous_work_summary&gt;

 &lt;recent_activity&gt;%s&lt;/recent_activity&gt;

 &lt;failed_approaches&gt;%s&lt;/failed_approaches&gt;

 &lt;git_state&gt;%s&lt;/git_state&gt;

 Continue: %s
 Verify that files cited above still match before proceeding.`,
        w.CompactedSummary,
        renderTurns(w.RetentionWindow),
        strings.Join(w.FailedApproaches, "\n"),
        gitState,
        originalTask)
 }
 </pre>
 </div>

 <h3 style="color:var(--magenta);">Integration Point &mdash; newStationTool() Closure</h3>
 <div class="code-block">
 <pre style="color:var(--text);">
 <span style="color:var(--text-dim);">// In newStationTool closure — after RunTurn completes:</span>

 <span style="color:var(--text-dim);">// 1. Record turn in pseudo context window (continuous)</span>
 pm.window.afterTurn(turn, func(text string) string {
    <span style="color:var(--text-dim);">// LLM summarization via existing runOneShot()</span>
    result, _ := a.runOneShot(ctx, sessionID, "Summarize this conversation:\n"+text)
    return result.Text
 })

 <span style="color:var(--text-dim);">// 2. Check if replacement needed</span>
 if pm.shouldReplace(sessionID) {
    <span style="color:var(--text-dim);">// Window is ALREADY the prompt — no construction step</span>
    prompt := pm.window.buildPrompt(input.Task, captureGitState(pm.cwd))

    <span style="color:var(--text-dim);">// Persist window to ADK artifact for restart survival</span>
    pm.persistWindow(tctx, sessionID)

    <span style="color:var(--text-dim);">// Replace: stop → clear → fresh start</span>
    pm.stop(ctx, sessionID)
    clearResumeID(tctx, pm.station)
    freshProc, _, _ := pm.getOrStart(ctx, sessionID, "", "")

    <span style="color:var(--text-dim);">// Continue on fresh process</span>
    pm.window.Generation++
    err = agentrun.RunTurn(ctx, freshProc, prompt, handler)

    <span style="color:var(--text-dim);">// Notify supervisor (awareness, not control)</span>
    a.notifier.Send(sessionID, fmt.Sprintf(
        "Station %s replaced (gen %d) — window had %d turns compacted",
        pm.station, pm.window.Generation, pm.window.TurnsSinceCompact))
 }
 </pre>
 </div>

 <h3 style="color:var(--magenta);">Continuation Prompt Output Example</h3>
 <div class="code-block">
 <pre style="color:var(--text);">
 <span style="color:var(--text-dim);">&lt;!-- Generated by window.buildPrompt() — always ready --&gt;</span>

 You are continuing work that a previous agent started but couldn't finish
 (context window exhausted, generation 2).

 <span style="color:var(--cyan);">&lt;previous_work_summary&gt;</span>
 <span style="color:var(--text-dim);">&lt;!-- CompactedSummary: LLM-generated rolling summary --&gt;</span>
 The previous agent implemented a recursive descent parser in internal/parser.go,
 wrote comprehensive unit tests in parse_test.go (all passing), and began work on
 output validation. Key decisions: chose recursive descent over regex after finding
 regex too slow for nested structures.
 <span style="color:var(--cyan);">&lt;/previous_work_summary&gt;</span>

 <span style="color:var(--cyan);">&lt;recent_activity&gt;</span>
 <span style="color:var(--text-dim);">&lt;!-- RetentionWindow: last N turns with full detail --&gt;</span>
 Turn 7: Read schema.go, analyzed validation requirements
  Tools: Read(schema.go), Read(types.go)
  Thinking: "Need to validate nested fields recursively..."

 Turn 8: Started validator implementation [INTERRUPTED at 85% context]
  Tools: Read(parser.go), Edit(validator.go +23 lines), Bash(go build — pass)
  Result: validator.go partially written, compiles but incomplete
 <span style="color:var(--cyan);">&lt;/recent_activity&gt;</span>

 <span style="color:var(--cyan);">&lt;failed_approaches&gt;</span>
 <span style="color:var(--text-dim);">&lt;!-- NEVER compacted — protected across all generations --&gt;</span>
 - Regex parser: too slow for nested structures, abandoned (gen 1)
 - Table-driven validation: too rigid, switched to recursive (gen 2)
 <span style="color:var(--cyan);">&lt;/failed_approaches&gt;</span>

 <span style="color:var(--cyan);">&lt;git_state&gt;</span>
 Modified: internal/parser.go (+45, -12)
 Modified: internal/parser_test.go (+80)
 Modified: internal/validator.go (+23) &lt;- may be incomplete
 <span style="color:var(--cyan);">&lt;/git_state&gt;</span>

 Continue: Implement output validation and wire into main pipeline.
 Verify that files cited above still match before proceeding.
 </pre>
 </div>

 <h3 style="color:var(--magenta);">Fuel Gauge Thresholds</h3>
 <div class="fuel-bar">
  <div class="fuel-segment" style="width:50%; background:var(--green);">NORMAL &mdash; window accumulates</div>
  <div class="fuel-segment" style="width:20%; background:var(--cyan);">COMPACTION active</div>
  <div class="fuel-segment" style="width:15%; background:var(--amber);">PERSIST window</div>
  <div class="fuel-segment" style="width:15%; background:var(--red);">REPLACE</div>
 </div>
 <div style="display:flex; justify-content:space-between; font-size:0.6rem; color:var(--text-dim); margin-top:0.25rem;">
  <span>50% &mdash; sliding window kicks in</span>
  <span>70% &mdash; persist to artifact</span>
  <span>85% &mdash; PREEMPTIVE REPLACE</span>
  <span>100% &mdash; MID-TURN DEATH</span>
 </div>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 07: MULTI-MODEL VALIDATION                         -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">07</span>Multi-Model Validation</h2>

 <!-- Internal Reviews -->
 <div class="review-block">
  <div class="review-header">
    <span class="review-badge review-badge-gemini">GEMINI 3 PRO</span>
    <span class="review-title">Architecture Validation (thinking mode: max)</span>
  </div>
  <p><strong>On /compact:</strong> "Skip or deprioritize. Relying on it in an automated orchestration pipeline introduces significant risk. We do not control the summarization/eviction logic. Deterministic state materialization combined with a clean process restart provides a much more predictable state machine."</p>
  <p><strong>On preemptive vs reactive:</strong> "Waiting for a hard context limit error is too late. If the context window is completely full, the agent may not have enough output tokens remaining to generate a high-quality handoff document. The supervisor must predict exhaustion and trigger state materialization BEFORE hitting the ceiling."</p>
  <p><strong>On filesystem as memory:</strong> "Claude Code is highly optimized for reading local files. Use the filesystem as the definitive state store. Do not build external vector stores or complex memory databases."</p>
  <p><strong>On 'Failed Approaches':</strong> "The biggest risk is that the agent forgets HOW it failed previously and repeats the same mistakes. Ensure the handoff explicitly documents failed approaches so the fresh instance doesn't retry dead ends."</p>
  <p><strong>Recommended threshold:</strong> 85% capacity for preemptive replacement. Track proxy metrics: interaction turn count, accumulated stdout byte size, execution duration as backup heuristics when ContextUsedTokens unavailable.</p>
 </div>

 <div class="review-block">
  <div class="review-header">
    <span class="review-badge review-badge-claude">CLAUDE OPUS 4.6</span>
    <span class="review-title">Internal Analysis Synthesis</span>
  </div>
  <p><strong>Key insight:</strong> The handler callback in <code>newStationTool()</code> already processes every message. The ContextBuffer is not new infrastructure &mdash; it's structured logging of data we already handle. Cost: near zero.</p>
  <p><strong>ADK artifacts vs session state:</strong> Artifacts are the right choice over session state because: (1) cross-station visibility (inspect reads build's checkpoint), (2) versioned (each save increments), (3) auditable history, (4) already wired in agent.go line ~287. Session state is per-station, unversioned.</p>
  <p><strong>On the layering:</strong> Each layer works independently. Deploy Layer 1 (buffer) first &mdash; it's practically free. Layer 2 (artifacts) adds persistence. Layer 3 (replacement) adds recovery. This is NOT over-engineered because each layer is small (~100 LOC each) and provides value alone.</p>
  <p><strong>On Solution 4 (supervisor compaction):</strong> Independent and complementary. The supervisor's 1M context won't exhaust in normal use, but for very long discovery sessions (100+ turns), compaction prevents degradation. Update: Community lib <code>achetronic/adk-utils-go</code> (31&#9733;, 6K+ test lines) works today as drop-in ADK plugin. Google official support imminent (#298). No custom code needed &mdash; defer to post-station-continuity.</p>
  <p><strong>On OPENDEV (arXiv 2603.05344):</strong> Three actionable insights: (1) Per-type tool pruning is FREE &mdash; do it before any LLM call. File reads &rarr; path, commands &rarr; exit code + key lines, edits &rarr; diff summary. (2) Instruction fade-out is a real P0 problem (confirmed by Gemini CLI #6474). Event-driven reminders via our notify plugin (already built in adk-go-extras) directly counter this. (3) Two-phase reduction (cheap prune THEN expensive summarize) should be the PCW default &mdash; no reason to LLM-summarize a 500-line bash output when "exit 0, 47 tests passed" suffices.</p>
 </div>

 <h3>External Reviews (Document Sent for Independent Analysis)</h3>
 <p style="font-size:0.78rem; margin-bottom:1rem;">The document was sent to Gemini 3 Pro and Claude Opus 4.6 via the prompt: "Read it, tell us which solution you'd pick and why, what you'd try first, what risks you see, what would change your mind." Both were given the single-build-station constraint. Their recommendations converged significantly.</p>

 <div class="review-block" style="border-color:var(--blue);">
  <div class="review-header">
    <span class="review-badge review-badge-gemini">GEMINI 3 PRO</span>
    <span class="review-title">External Independent Review</span>
  </div>
  <p><strong>Recommendation:</strong> Hybrid S7 + S1 + Gemini CLI <code>&lt;state_snapshot&gt;</code> XML schema. S7 as proactive telemetry trigger, S1 as infallible fallback buffer, structured XML snapshot (not freeform S6 summary) as the continuation payload. Explicitly rejects S6 &mdash; "catastrophic context rot from recursive summaries."</p>
  <p><strong>Try first:</strong> S1 + S0 exclusively. "Master the IPC piping and OS signal management before adding telemetry or XML snapshots."</p>
  <p><strong>Critical finding &mdash; CLI vulnerabilities:</strong> Identified two active bugs that threaten ALL handler-based solutions: (1) JSON stdout truncation at fixed char boundaries (4K/6K/8K/16K) &mdash; <a href="https://github.com/anthropics/claude-code/issues/2904">#2904</a>, (2) CLI hangs indefinitely after final result, creating zombie processes &mdash; <a href="https://github.com/anthropics/claude-code/issues/25629">#25629</a>. "Any viable recovery system must implement heuristic JSON repair algorithms."</p>
  <p><strong>Key innovation:</strong> Advocates Gemini CLI's rigid <code>&lt;state_snapshot&gt;</code> XML schema (overall_goal, active_constraints, artifact_trail, task_state) instead of freeform LLM summaries. "Structured XML prevents semantic drift. Functions as an artificial hippocampus."</p>
  <p><strong>What would change their mind:</strong> (1) Streaming token telemetry would eliminate S7 blind spot. (2) LLM inference cost dropping 10x makes S6 viable. (3) Station context expanding to 1M makes all recovery obsolete.</p>
 </div>

 <div class="review-block" style="border-color:var(--magenta);">
  <div class="review-header">
    <span class="review-badge review-badge-claude">CLAUDE OPUS 4.6</span>
    <span class="review-title">External Independent Review (with single-station constraint)</span>
  </div>
  <p><strong>Recommendation:</strong> S0 + S7 (primary) + S1 (fallback). Trigger S7 at 75-78% (not 85%) because with one station, you need more buffer for wrap-up. S1 as fallback when S7 fails. Skip S6 entirely &mdash; "don't add complexity speculatively."</p>
  <p><strong>Try first:</strong> "S0, literally today. Edit coder.md.tpl. Then instrument and measure for a week."</p>
  <p><strong>Critical insight &mdash; supervisor IS the recovery layer:</strong> "With one build station, the supervisor already sits in the sequential dispatch loop. Return structured NEEDS_CONTINUATION to supervisor instead of doing transparent replacement inside the tool closure. The whole 'transparent replacement' design becomes unnecessary."</p>
  <p><strong>On half-finished edits:</strong> "#1 practical failure mode. Files may contain half-written functions. git diff shows changes but cannot indicate completeness. May need pre-turn file snapshots or git stash."</p>
  <p><strong>On the document:</strong> "Handler callback data &ne; station's full internal state. <code>runOneShot()</code> is calibrated for 40 tokens (titles), not 500-2000 (summarization). ProcessActivity is capped for UI, not archival. The code is further from any solution than the document implies."</p>
  <p><strong>What would change their mind:</strong> (1) If exhaustion &gt;30% of sessions, S6 justified immediately. (2) If git diff + task alone succeeds &gt;90%, entire solution space is over-engineered. (3) If Claude Code ships official continuation API, architecture changes fundamentally.</p>
 </div>

 <h3>Cross-Review Consensus</h3>
 <table>
  <tr>
    <th>Decision</th>
    <th>Gemini 3 Pro</th>
    <th>Claude Opus 4.6</th>
    <th>Consensus</th>
  </tr>
  <tr>
    <td>Skip /compact</td>
    <td class="cell-good">&check; Opaque, risky</td>
    <td class="cell-good">&check; Unpredictable</td>
    <td style="color:var(--green);">AGREED</td>
  </tr>
  <tr>
    <td>Preemptive replacement at 85%</td>
    <td class="cell-good">&check; Before ceiling</td>
    <td class="cell-good">&check; Based on research</td>
    <td style="color:var(--green);">AGREED</td>
  </tr>
  <tr>
    <td>Failed Approaches section</td>
    <td class="cell-good">&check; Critical</td>
    <td class="cell-good">&check; From Cline</td>
    <td style="color:var(--green);">AGREED</td>
  </tr>
  <tr>
    <td>Per-type tool pruning before LLM summarization</td>
    <td class="cell-good">&check; Free, reduces LLM load</td>
    <td class="cell-good">&check; OPENDEV validates</td>
    <td style="color:var(--green);">AGREED</td>
  </tr>
 </table>

 <h3>Critiques and Open Disagreements</h3>
 <p style="font-size:0.78rem; margin-bottom:0.75rem;">These critiques were raised during validation and are NOT fully resolved. They represent genuine tradeoff decisions that the reader should weigh.</p>

 <table>
  <tr>
    <th>Critique</th>
    <th>Source</th>
    <th>Implication</th>
    <th>Status</th>
  </tr>
  <tr>
    <td>Task scoping may eliminate the problem</td>
    <td><span class="ref">Gemini 3 Pro review</span></td>
    <td>If the supervisor dispatches well-scoped tasks, stations may rarely exhaust. S0 may be sufficient.</td>
    <td style="color:var(--amber);">UNRESOLVED &mdash; needs empirical data</td>
  </tr>
  <tr>
    <td>S3/S6 summarization contradiction</td>
    <td><span class="ref">Gemini 3 Pro review</span></td>
    <td>Document rejects S3 for "recursive summaries distort reasoning" but S6 uses LLM rolling summaries. Distinction: S6 is supervisor-controlled and verifiable, but the drift risk is real.</td>
    <td style="color:var(--cyan);">ACKNOWLEDGED &mdash; addressed in S6 deep dive intro</td>
  </tr>
  <tr>
    <td>Git state may be sufficient for continuation</td>
    <td><span class="ref">Gemini 3 Pro review</span></td>
    <td>"The codebase IS the state." A fresh station with git diff + original task may be enough. Complex PCW machinery may be over-engineering.</td>
    <td style="color:var(--amber);">UNRESOLVED &mdash; needs A/B testing</td>
  </tr>
  <tr>
    <td>Graceful handoff may be simpler and better</td>
    <td><span class="ref">Gemini 3 Pro review</span></td>
    <td>Warn at 80%, let station finish cleanly, use station's own summary. Preserves station reasoning, no LLM cost. Now added as S7.</td>
    <td style="color:var(--green);">INCORPORATED &mdash; added as S7</td>
  </tr>
  <tr>
    <td>File read pruning trap</td>
    <td><span class="ref">Gemini 3 Pro review</span></td>
    <td>If per-type pruning reduces file reads to path-only, the fresh station won't have file contents and must re-read them. This is acceptable (station will re-read), but adds latency.</td>
    <td style="color:var(--cyan);">ACCEPTABLE &mdash; station re-reads are cheap</td>
  </tr>
  <tr>
    <td>Prefer deterministic-first, add LLM later</td>
    <td><span class="ref">Gemini 3 Pro review</span></td>
    <td>Ship S1 (deterministic, zero cost), measure continuation quality, add S6 (LLM) only if data shows it's needed. Avoids premature complexity.</td>
    <td style="color:var(--green);">VALID &mdash; reflected in decision framework (Section 01)</td>
  </tr>
  <tr>
    <td>Persistence mechanism disagreement</td>
    <td><span class="ref">Gemini 3 Pro vs Claude Opus 4.6</span></td>
    <td>Gemini prefers filesystem (.foundry/ files); Claude prefers ADK artifacts (cross-station, versioned). Both valid &mdash; may do both.</td>
    <td style="color:var(--amber);">UNRESOLVED &mdash; implementation decision</td>
  </tr>
  <tr>
    <td>CLI JSON truncation breaks all handler solutions</td>
    <td><span class="ref">Gemini external review</span> <span class="ref">claude-code#2904</span></td>
    <td>Stdout truncated at fixed char boundaries. Standard JSON parsers panic. All buffer-based solutions (S1, S6, S7) depend on clean JSON parsing. Need heuristic JSON repair or partial extraction.</td>
    <td style="color:var(--red);">CRITICAL &mdash; unresolved upstream dependency</td>
  </tr>
  <tr>
    <td>Half-finished file edits are the #1 practical risk</td>
    <td><span class="ref">Claude external review</span> <span class="ref">Gemini cross-validation</span></td>
    <td>Mid-turn death leaves files with half-written functions. Fresh station can't tell modified files are incomplete. May need pre-turn git stash or atomic rollback.</td>
    <td style="color:var(--red);">CRITICAL &mdash; not solved by any current solution</td>
  </tr>
  <tr>
    <td>Supervisor should be the recovery layer, not processManager</td>
    <td><span class="ref">Claude external review (single-station constraint)</span></td>
    <td>With one build station, supervisor already sits in sequential dispatch loop. Return structured NEEDS_CONTINUATION instead of transparent in-tool replacement. Simplifies architecture significantly.</td>
    <td style="color:var(--amber);">STRONG &mdash; changes architectural approach</td>
  </tr>
  <tr>
    <td>Both external reviewers reject S6</td>
    <td><span class="ref">Gemini + Claude external reviews</span></td>
    <td>Gemini: "catastrophic context rot." Claude: "don't add complexity speculatively." Both independently recommend S1 (deterministic) until data proves otherwise. Cross-model consensus against S6 as initial approach.</td>
    <td style="color:var(--amber);">STRONG &mdash; S6 deferred further</td>
  </tr>
  <tr>
    <td>Use structured XML schema, not freeform summaries</td>
    <td><span class="ref">Gemini external review</span></td>
    <td>Gemini CLI's <code>&lt;state_snapshot&gt;</code> XML schema (overall_goal, active_constraints, artifact_trail, task_state) prevents semantic drift that freeform summaries suffer from. If LLM summarization is ever used, enforce rigid schema.</td>
    <td style="color:var(--green);">INCORPORATED &mdash; applies to S6 if built</td>
  </tr>
 </table>

 <!-- Section 08 intentionally removed — this is an analysis document, not an implementation plan.
     An implementation roadmap should be created AFTER a solution is selected. -->

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- SECTION 08: FALSIFIABILITY                                 -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <h2><span class="section-num">08</span>Falsifiability &mdash; What Would Prove This Wrong</h2>

 <table>
  <tr>
    <th>Assumption</th>
    <th>Confidence</th>
    <th>What Would Invalidate It</th>
  </tr>
  <tr>
    <td>Recovery machinery is needed at all</td>
    <td class="cell-warn">MEDIUM &mdash; no frequency data yet</td>
    <td>If task scoping (S0) reduces exhaustion to &lt;5% of sessions, recovery machinery is premature optimization. Need baseline measurement before building S1+.</td>
  </tr>
  <tr>
    <td>Git state + original task is insufficient for continuation</td>
    <td class="cell-warn">MEDIUM &mdash; intuitively yes for complex work, but untested</td>
    <td>If a fresh station with only git diff + task description completes work &gt;80% of the time, the entire PCW is unnecessary. A/B test against simple re-dispatch.</td>
  </tr>
  <tr>
    <td>Handler callback captures enough context for continuation</td>
    <td class="cell-good">HIGH &mdash; tool calls + thinking are the primary work record</td>
    <td>If the station's accumulated internal context (earlier turns it "remembers") is critical beyond what tool calls + thinking capture, and the fresh station consistently fails to complete work</td>
  </tr>
  <tr>
    <td>LLM summarization via runOneShot() is good enough</td>
    <td class="cell-good">HIGH &mdash; same infra already used for title generation; small model is fast</td>
    <td>If summarization quality is poor (too lossy) and fresh stations consistently miss context. If latency of runOneShot is too high (&gt;5s) and blocks station progress.</td>
  </tr>
  <tr>
    <td>85% is the right replacement threshold</td>
    <td class="cell-warn">MEDIUM &mdash; research-backed but not empirically validated</td>
    <td>If stations consistently exhaust between 85-100% in a single tool call (no chance to check threshold). May need lower threshold for tool-heavy sessions.</td>
  </tr>
  <tr>
    <td>Fresh context is better than compacted context</td>
    <td class="cell-good">HIGH &mdash; Context Rot paper, Amp experience, Graph of Agents</td>
    <td>If Claude Code's /compact produces measurably better continuation quality than our pseudo window. Would need A/B testing.</td>
  </tr>
  <tr>
    <td>Rolling summaries don't distort earlier reasoning</td>
    <td class="cell-warn">MEDIUM &mdash; Amp retired compaction for this reason, but our overlap mitigates</td>
    <td>If multi-generation rolling summaries (summary of summary of summary...) lose critical early decisions. Mitigated by: overlap preservation, failed approaches never compacted, generation count limiting.</td>
  </tr>
  <tr>
    <td>ADK artifacts are the right persistence layer</td>
    <td class="cell-good">HIGH &mdash; already wired, versioned, cross-station</td>
    <td>If artifact writes cause measurable latency or SQLite contention. If serialized window exceeds artifact size limits.</td>
  </tr>
  <tr>
    <td>Supervisor doesn't need explicit control of replacement</td>
    <td class="cell-warn">MEDIUM &mdash; transparency may hide important failures</td>
    <td>If the supervisor needs to adjust strategy after replacement (e.g., break work into smaller chunks) and can't because it doesn't know replacement happened. Mitigated by notify plugin.</td>
  </tr>
  <tr>
    <td>Per-type tool pruning preserves enough context for continuation</td>
    <td class="cell-good">HIGH &mdash; OPENDEV production-validated, tool outputs are verbose by nature</td>
    <td>If pruned tool outputs (path-only for reads, exit+key lines for commands) miss critical details that the LLM summarizer would have captured. Mitigated by: retention window keeps recent N turns at full detail.</td>
  </tr>
  <tr>
    <td>Supervisor instruction fade-out is addressable with periodic reminders</td>
    <td class="cell-warn">MEDIUM &mdash; Gemini CLI #6474 (P0) fixed with PRs, but our use case differs</td>
    <td>If Gemini 1M context is resistant to instruction fade-out at supervisor-level turn counts (~20-50). May not need reminders until 100+ turns. Mitigated by: reminders are cheap (notify plugin, no LLM call).</td>
  </tr>
 </table>

 <!-- ═══════════════════════════════════════════════════════════ -->
 <!-- FOOTER                                                     -->
 <!-- ═══════════════════════════════════════════════════════════ -->

 <footer>
  <strong>FOUNDRY ENGINEERING</strong> &middot; DOCUMENT FND-042-ANALYSIS &middot; CONTEXT EXHAUSTION RECOVERY<br><br>
  <strong>Research:</strong> #40 (60+ papers) &middot; #41 (13 tools, 10 frameworks) &middot; #42 (this analysis)<br>
  <strong>Validated:</strong> Gemini 3 Pro (thinking: max) &middot; Claude Opus 4.6 (extended thinking) &middot; Bias review &middot; External independent reviews (Gemini 3 Pro + Claude Opus 4.6)<br>
  <strong>CLI vulnerabilities:</strong> <a href="https://github.com/anthropics/claude-code/issues/2904">claude-code#2904</a> (JSON truncation) &middot; <a href="https://github.com/anthropics/claude-code/issues/25629">claude-code#25629</a> (zombie processes) &middot; Context Forge &middot; Gas Town<br>
  <strong>Cross-referenced:</strong> Context Rot &middot; Focus Agent &middot; CMV &middot; Amp &middot; Cline &middot; Gemini CLI &middot; ADK Python &middot; OPENDEV &middot; Graph of Agents<br>
  <strong>Solutions:</strong> S0 (task scoping) &middot; S1 (deterministic buffer) &middot; S2 (ADK artifacts) &middot; S3 (/compact) &middot; S4 (ADK lib) &middot; S5 (supervisor handoff) &middot; S6 (PCW) &middot; S7 (graceful handoff)<br>
  <strong>OPENDEV:</strong> arXiv 2603.05344 (5-stage progressive, instruction fade-out, per-type pruning) &middot; anomalyco/opencode<br>
  <strong>ADK Go:</strong> google/adk-go#298 (ADR-010) &middot; PR#300 (conflicting) &middot; achetronic/adk-utils-go (31&#9733;)
 </footer>

 </body>
 </html>
No results found