Skip to content

Instantly share code, notes, and snippets.

@stvhay
Created March 19, 2026 00:55
Show Gist options
  • Select an option

  • Save stvhay/8efe4dedb0919e002e2d3f690953b520 to your computer and use it in GitHub Desktop.

Select an option

Save stvhay/8efe4dedb0919e002e2d3f690953b520 to your computer and use it in GitHub Desktop.
CAST --- GHIDRA Failure

Root Cause Analysis (CAST): PR #135 — Incomplete Ghidra Integration

Method

CAST (Causal Analysis based on Systems Theory) — identifies why controls failed to prevent a loss by analyzing the control structure, not assigning blame.


1. Loss

ID Description Type
L-1 Development time spent on a PR that cannot be merged (1846 lines, 13 commits, 2 review cycles) mission
L-2 PR delivers integration scaffolding but not analytical substance — core capability (decompilation + code comparison) was never built mission

2. Hazards

ID Description Losses
H-1 Implementation diverges from requirements without detection until review L-1, L-2
H-2 Plan contains correct requirements but implementation omits core capabilities L-1, L-2
H-3 Verification processes confirm passing tests without validating analytical correctness L-1, L-2

3. Safety Constraints Violated

ID Constraint Hazards
SC-1 Implementation must satisfy acceptance criteria before declaring work complete H-1, H-2
SC-2 Testing must validate that the pipeline produces meaningful analytical output, not just that code runs H-3
SC-3 Each implementation phase must be verified against the plan before proceeding to the next H-2

4. Event Sequence

  1. Issue #128 created with clear acceptance criteria including "modification report clearly shows what vendor changed in GPL code"
  2. Implementation plan posted — correctly specifies decompilation, code comparison, ternary classification, Kconfig reconstruction
  3. Plan explicitly notes: "For each decompiled function: diff decompiled output vs upstream source" and "Classify: unmodified, modified, vendor-added"
  4. Implementation begins — Nix flake, Jython scripts, orchestrator, CPython integration built sequentially
  5. Jython scripts extract function names, symbols, strings — but never call DecompInterface
  6. Cross-reference script implements name-matching only — upstream_match or vendor_added, no modified category
  7. modified_count field declared in dataclass but never populated
  8. -noanalysis flag included in orchestrator (skipping even basic Ghidra analysis)
  9. 20 unit tests written — all using mocked JSON, none testing against real firmware
  10. Pipeline never run end-to-end against actual firmware
  11. PR submitted claiming "Closes #128" with all tests passing
  12. First code review catches 4 blocking issues — surface-level bugs
  13. Fix commits address surface issues: regex escaping, -noanalysis removal, modified_count removal, .issue cleanup
  14. Second code review reveals fundamental gap: no decompilation, no code comparison, name-matching only
  15. PR left open, marked "do not merge — redesign required"

5. Control Structure

Developer (human)
    │
    ├── Brainstorming Plugin ──► Planning Plugin
    │                                │
    │                    Implementation Agent (Claude Code)
    │                                │
    │              ┌─────────────────┼─────────────────┐
    │          TDD Plugin    Verification Plugin    CI Pipeline
    │                                │
    │                    Code Review Plugin ◄─── (caught the gap)
    │
    └── Merge Decision (correctly blocked)

6. Component Analysis

6.1 Implementation Agent (Claude Code)

Contribution: Built Jython scripts that extract metadata (function names, symbols, strings) but never invoked DecompInterface. Built cross-reference with binary name-matching instead of the ternary code-comparison specified in the plan. Included -noanalysis flag. Submitted PR claiming to close #128.

Why it seemed correct: The plan was large (18 phases, ~12 files). The agent focused on structural completeness — producing all files the plan called for. Each file individually follows project patterns correctly. All 20 tests passed. CI was green. Observable quality signals were positive. The plan's decompilation requirement was embedded in prose descriptions rather than as explicit checkpoints.

Mental model flaws:

Belief Reality
Extracting function names/symbols = "Ghidra analysis" Core value is decompilation — pseudo-C output enabling code comparison
Function name found in upstream = "upstream match" A function can share a name but have completely different implementation
20 passing tests + CI green = implementation meets requirements Tests validated plumbing, not substance — mocked JSON assumed the hard part was solved
Producing all planned files = plan was executed Several files were structurally present but functionally hollow

6.2 Planning Plugin

Contribution: Produced a plan that correctly identified all required capabilities but embedded them as descriptive prose within phase narratives rather than as discrete, verifiable steps. Decompilation was mentioned in Phase 4's cross-reference description, not in Phase 2 where the Jython scripts were specified.

Mental model flaws:

Belief Reality
Detailed prose plan with correct content will be executed correctly Critical capabilities can be lost in translation from prose to code
Describing what a script should do is sufficient Without "this function must call X API" steps, implementers build structurally similar but functionally different code

6.3 TDD / Verification Plugins

Contribution: 20 tests written, all using mocked JSON. Mocked data encoded assumptions about Ghidra output that were never validated. Verification confirmed "893 tests pass, zero lint errors" — true, but irrelevant to whether the pipeline answers its core question.

Mental model flaws:

Belief Reality
High test count + CI green = correct Tests validated plumbing, not substance
Mocked data matching expected schema validates pipeline Mocked data assumed the hard part (decompilation) was solved
Verification = "tests pass and lint clean" Should include "does the output answer the question the issue asks?"

6.4 Code Review Plugin

Contribution: First review caught surface bugs. Second review identified the fundamental gap. This is the control that worked — it prevented merging a hollow pipeline.

6.5 Developer (Human)

Contribution: Wrote clear acceptance criteria. Performed thorough reviews. Made the correct decision not to merge. Requested RCA. Controls worked at this level.


7. Systemic Factors

7.1 Plan-to-Implementation Gap

The plan correctly specified decompilation and code comparison. The implementation omitted them. No checkpoint between "plan approved" and "PR submitted" verified that each plan phase was actually implemented as specified. The plan was treated as a starting signal, not as a living checklist.

7.2 Mocked-Test Blindness

100% of new tests used mocked data. The mocked data encoded assumptions about what Ghidra would produce — assumptions never validated. The test suite provided a strong "green" signal that masked the absence of the core capability. Analogous to testing a calculator by mocking the arithmetic engine — the UI works, but 2+2 might equal 5.

7.3 Structural Completeness Bias

The development process optimizes for observable quality signals: file count, test count, lint cleanliness, pattern adherence, CI status. All excellent in PR #135. But they measure structural completeness (are all pieces present and well-formed?) rather than analytical correctness (does the pipeline answer the question?). 16 files, 1846 lines, 20 tests, zero lint errors — none of that addressed the core requirement.

7.4 Missing End-to-End Validation

The project methodology requires "scripts are the source of truth" and "anyone running our scripts must arrive at identical conclusions." The implementation was never run against real firmware. The -noanalysis flag's presence is strong evidence: code was written to be structurally correct, not functionally tested.

7.5 Breadth Over Depth

Issue #128 is ambitious — Ghidra integration, upstream cross-referencing, Kconfig reconstruction, U-Boot config, header generation. Under pressure to deliver full scope, the implementation went wide (all files, all patterns) rather than deep (one capability done correctly end-to-end). A narrower initial scope — "decompile one kernel module and compare 5 functions against upstream" — would have forced confrontation with the hard problems before building infrastructure.


8. Recommendations

ID Recommendation Addresses Priority
R-1 Require real-data smoke test before PR submission for analysis pipelines. At minimum: one real binary through the pipeline, output inspected. End-to-end validation gap High
R-2 Acceptance criteria as explicit checklist — plans should transform issue acceptance criteria into numbered verification steps, each with a concrete "done when" condition. Plan-to-implementation gap High
R-3 Phase gates in plans — each phase must have verification criteria confirmed before proceeding. "Phase N done when: [observable output]" Plan-to-implementation gap High
R-4 Scope-to-depth: prove the hard part first — for complex features, implement one vertical slice end-to-end (including real data) before going wide. Prove decompilation works before building infrastructure around it. Breadth over depth High
R-5 First review includes requirements traceability — code review begins with "walk the acceptance criteria and point to the code that satisfies each one" before examining code quality. Structural completeness bias Medium
R-6 Flag mocked-only test suites — when all tests for a new feature use mocked data, verification should flag this as a risk and require justification or an integration test plan. Mocked-test blindness Medium
R-7 Distinguish scaffolding from substance in plans. Mark which steps are infrastructure vs. core analytical capability. Prioritize substance. Structural completeness bias Medium
R-8 Post-brainstorming risk register — brainstorming outputs "what could go wrong" alongside the design. "Risk: we build plumbing without substance" would have been catchable. Brainstorming gap Low

9. Key Takeaway

The plan was correct. The implementation was structurally faithful but analytically hollow.

Every observable quality signal was green: clear issue, thorough plan, well-structured code, passing tests, clean CI. But the core capability — decompiling binary code and comparing it against upstream source — was never built.

The systemic cause is a gap between plan approval and implementation verification. No checkpoint between "plan approved" and "PR submitted" verified that the specified capabilities were actually implemented. Verification confirmed structural quality but never asked "does this answer the question?"

The code review process caught the gap — this control worked. But it caught it late, after 1846 lines and two review cycles.

For analysis pipelines: prove the hard part works first, then build infrastructure around it. Structural completeness is not analytical correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment