CAST (Causal Analysis based on Systems Theory) — identifies why controls failed to prevent a loss by analyzing the control structure, not assigning blame.
| ID | Description | Type |
|---|---|---|
| L-1 | Development time spent on a PR that cannot be merged (1846 lines, 13 commits, 2 review cycles) | mission |
| L-2 | PR delivers integration scaffolding but not analytical substance — core capability (decompilation + code comparison) was never built | mission |
| ID | Description | Losses |
|---|---|---|
| H-1 | Implementation diverges from requirements without detection until review | L-1, L-2 |
| H-2 | Plan contains correct requirements but implementation omits core capabilities | L-1, L-2 |
| H-3 | Verification processes confirm passing tests without validating analytical correctness | L-1, L-2 |
| ID | Constraint | Hazards |
|---|---|---|
| SC-1 | Implementation must satisfy acceptance criteria before declaring work complete | H-1, H-2 |
| SC-2 | Testing must validate that the pipeline produces meaningful analytical output, not just that code runs | H-3 |
| SC-3 | Each implementation phase must be verified against the plan before proceeding to the next | H-2 |
- Issue #128 created with clear acceptance criteria including "modification report clearly shows what vendor changed in GPL code"
- Implementation plan posted — correctly specifies decompilation, code comparison, ternary classification, Kconfig reconstruction
- Plan explicitly notes: "For each decompiled function: diff decompiled output vs upstream source" and "Classify: unmodified, modified, vendor-added"
- Implementation begins — Nix flake, Jython scripts, orchestrator, CPython integration built sequentially
- Jython scripts extract function names, symbols, strings — but never call
DecompInterface - Cross-reference script implements name-matching only —
upstream_matchorvendor_added, nomodifiedcategory modified_countfield declared in dataclass but never populated-noanalysisflag included in orchestrator (skipping even basic Ghidra analysis)- 20 unit tests written — all using mocked JSON, none testing against real firmware
- Pipeline never run end-to-end against actual firmware
- PR submitted claiming "Closes #128" with all tests passing
- First code review catches 4 blocking issues — surface-level bugs
- Fix commits address surface issues: regex escaping,
-noanalysisremoval,modified_countremoval,.issuecleanup - Second code review reveals fundamental gap: no decompilation, no code comparison, name-matching only
- PR left open, marked "do not merge — redesign required"
Developer (human)
│
├── Brainstorming Plugin ──► Planning Plugin
│ │
│ Implementation Agent (Claude Code)
│ │
│ ┌─────────────────┼─────────────────┐
│ TDD Plugin Verification Plugin CI Pipeline
│ │
│ Code Review Plugin ◄─── (caught the gap)
│
└── Merge Decision (correctly blocked)
Contribution: Built Jython scripts that extract metadata (function names, symbols, strings) but never invoked DecompInterface. Built cross-reference with binary name-matching instead of the ternary code-comparison specified in the plan. Included -noanalysis flag. Submitted PR claiming to close #128.
Why it seemed correct: The plan was large (18 phases, ~12 files). The agent focused on structural completeness — producing all files the plan called for. Each file individually follows project patterns correctly. All 20 tests passed. CI was green. Observable quality signals were positive. The plan's decompilation requirement was embedded in prose descriptions rather than as explicit checkpoints.
Mental model flaws:
| Belief | Reality |
|---|---|
| Extracting function names/symbols = "Ghidra analysis" | Core value is decompilation — pseudo-C output enabling code comparison |
| Function name found in upstream = "upstream match" | A function can share a name but have completely different implementation |
| 20 passing tests + CI green = implementation meets requirements | Tests validated plumbing, not substance — mocked JSON assumed the hard part was solved |
| Producing all planned files = plan was executed | Several files were structurally present but functionally hollow |
Contribution: Produced a plan that correctly identified all required capabilities but embedded them as descriptive prose within phase narratives rather than as discrete, verifiable steps. Decompilation was mentioned in Phase 4's cross-reference description, not in Phase 2 where the Jython scripts were specified.
Mental model flaws:
| Belief | Reality |
|---|---|
| Detailed prose plan with correct content will be executed correctly | Critical capabilities can be lost in translation from prose to code |
| Describing what a script should do is sufficient | Without "this function must call X API" steps, implementers build structurally similar but functionally different code |
Contribution: 20 tests written, all using mocked JSON. Mocked data encoded assumptions about Ghidra output that were never validated. Verification confirmed "893 tests pass, zero lint errors" — true, but irrelevant to whether the pipeline answers its core question.
Mental model flaws:
| Belief | Reality |
|---|---|
| High test count + CI green = correct | Tests validated plumbing, not substance |
| Mocked data matching expected schema validates pipeline | Mocked data assumed the hard part (decompilation) was solved |
| Verification = "tests pass and lint clean" | Should include "does the output answer the question the issue asks?" |
Contribution: First review caught surface bugs. Second review identified the fundamental gap. This is the control that worked — it prevented merging a hollow pipeline.
Contribution: Wrote clear acceptance criteria. Performed thorough reviews. Made the correct decision not to merge. Requested RCA. Controls worked at this level.
The plan correctly specified decompilation and code comparison. The implementation omitted them. No checkpoint between "plan approved" and "PR submitted" verified that each plan phase was actually implemented as specified. The plan was treated as a starting signal, not as a living checklist.
100% of new tests used mocked data. The mocked data encoded assumptions about what Ghidra would produce — assumptions never validated. The test suite provided a strong "green" signal that masked the absence of the core capability. Analogous to testing a calculator by mocking the arithmetic engine — the UI works, but 2+2 might equal 5.
The development process optimizes for observable quality signals: file count, test count, lint cleanliness, pattern adherence, CI status. All excellent in PR #135. But they measure structural completeness (are all pieces present and well-formed?) rather than analytical correctness (does the pipeline answer the question?). 16 files, 1846 lines, 20 tests, zero lint errors — none of that addressed the core requirement.
The project methodology requires "scripts are the source of truth" and "anyone running our scripts must arrive at identical conclusions." The implementation was never run against real firmware. The -noanalysis flag's presence is strong evidence: code was written to be structurally correct, not functionally tested.
Issue #128 is ambitious — Ghidra integration, upstream cross-referencing, Kconfig reconstruction, U-Boot config, header generation. Under pressure to deliver full scope, the implementation went wide (all files, all patterns) rather than deep (one capability done correctly end-to-end). A narrower initial scope — "decompile one kernel module and compare 5 functions against upstream" — would have forced confrontation with the hard problems before building infrastructure.
| ID | Recommendation | Addresses | Priority |
|---|---|---|---|
| R-1 | Require real-data smoke test before PR submission for analysis pipelines. At minimum: one real binary through the pipeline, output inspected. | End-to-end validation gap | High |
| R-2 | Acceptance criteria as explicit checklist — plans should transform issue acceptance criteria into numbered verification steps, each with a concrete "done when" condition. | Plan-to-implementation gap | High |
| R-3 | Phase gates in plans — each phase must have verification criteria confirmed before proceeding. "Phase N done when: [observable output]" | Plan-to-implementation gap | High |
| R-4 | Scope-to-depth: prove the hard part first — for complex features, implement one vertical slice end-to-end (including real data) before going wide. Prove decompilation works before building infrastructure around it. | Breadth over depth | High |
| R-5 | First review includes requirements traceability — code review begins with "walk the acceptance criteria and point to the code that satisfies each one" before examining code quality. | Structural completeness bias | Medium |
| R-6 | Flag mocked-only test suites — when all tests for a new feature use mocked data, verification should flag this as a risk and require justification or an integration test plan. | Mocked-test blindness | Medium |
| R-7 | Distinguish scaffolding from substance in plans. Mark which steps are infrastructure vs. core analytical capability. Prioritize substance. | Structural completeness bias | Medium |
| R-8 | Post-brainstorming risk register — brainstorming outputs "what could go wrong" alongside the design. "Risk: we build plumbing without substance" would have been catchable. | Brainstorming gap | Low |
The plan was correct. The implementation was structurally faithful but analytically hollow.
Every observable quality signal was green: clear issue, thorough plan, well-structured code, passing tests, clean CI. But the core capability — decompiling binary code and comparing it against upstream source — was never built.
The systemic cause is a gap between plan approval and implementation verification. No checkpoint between "plan approved" and "PR submitted" verified that the specified capabilities were actually implemented. Verification confirmed structural quality but never asked "does this answer the question?"
The code review process caught the gap — this control worked. But it caught it late, after 1846 lines and two review cycles.
For analysis pipelines: prove the hard part works first, then build infrastructure around it. Structural completeness is not analytical correctness.