Skip to content

Instantly share code, notes, and snippets.

@etscrivner
Last active April 5, 2026 15:53
Show Gist options
  • Select an option

  • Save etscrivner/9c7c13773ccd882c0b5626818819afb0 to your computer and use it in GitHub Desktop.

Select an option

Save etscrivner/9c7c13773ccd882c0b5626818819afb0 to your computer and use it in GitHub Desktop.
Specification of statistical quality control approach to software dark factory orchestration.

Autonomous Product Factory — System Specification

Overview

A system for autonomously building software products using AI agents. Agents operate in structured loops, progressing through a specification pipeline from idea to validated implementation — with minimal or no human intervention.

Influences

  • ralph-loop — simple iterative agent loops with completion detection
  • ai-prd-workflow — structured pipeline from idea to tested code
  • Attractor (StrongDM) — NLSpec-driven agent implementation, own-your-stack philosophy
  • StrongDM Factory — non-interactive development, scenario-based validation, satisfaction metrics
  • Cleanroom Software Engineering / Statistical Usage Testing — usage-weighted probabilistic validation, principled stopping criteria (lightweight adaptation — see Layer 4 and References)

Architecture

Layer 1: Orchestrator

A phase-aware, state-persistent orchestration engine.

Responsibilities:

  • Manage progression through specification phases
  • Persist state between agent invocations (file-based)
  • Spawn and coordinate parallel agents (one per RFC)
  • Detect completion via structured signals (not string matching)
  • Handle failures with retry, escalation, or human-in-the-loop breakpoints

Parameters:

Parameter Description Default
Idea Product description — the starting point required
Phase Which phase(s) to run all
MaxLoops Per-phase iteration cap 10
Parallel Enable parallel RFC implementation false
ThrottleLimit Max concurrent parallel agents 3
Checkpoint Pause for human approval between phases false
ValidationSessions Generated sessions per validation run 15
MaxValidationPasses Max validation loop iterations 3
DesignedThreshold Hard pass/fail gate for hand-crafted scenarios 0.70

Layer 2: Specification Pipeline

Each phase produces artifacts that feed the next. All artifacts are files in a workspace directory.

workspace/
├── idea.md                 # Raw input — the product idea
├── prd.md                  # Product Requirements Document
├── prd-verification.md     # Gap analysis of the PRD
├── features.md             # Extracted features (MoSCoW prioritization)
├── rules.md                # Technical constraints and standards
├── rfcs/
│   ├── rfc-001.md          # Scoped, implementable work units
│   └── ...
├── scenarios/
│   ├── scenario-001.md     # Hand-crafted validation scenarios (weights embedded)
│   └── ...
├── src/                    # Generated source code
├── reviews/                # Agent-generated code review reports
├── logs/                   # Per-phase agent output logs
├── validation/             # Scoring results, confidence, fix audit trail
└── status.json             # Orchestrator state and phase tracking

Phase Progression:

Phase Input Output Description
1. PRD idea.md prd.md, prd-verification.md Requirements elicitation + verify-revise loop
2. Features prd.md features.md MoSCoW-prioritized feature list
3. Rules prd.md, features.md rules.md Technical constraints, standards
4. RFCs features.md, rules.md rfcs/.md, scenarios/.md Scoped implementation units + validation scenarios
5. Implementation rfc-NNN.md, rules.md src/* Code generation per RFC (parallelizable)
6. Review src/*, rfc-NNN.md reviews/*.md Spec-conformance review with fix loops
7. Validation src/, scenarios/.md, generated/*.md confidence.json Satisfaction scoring + statistical certification

PRD creation and verification are a single phase with an internal verify-revise loop. Validation scenarios are authored during the RFC phase (phase 4), before implementation begins, to prevent the implementing agent from optimizing for them.

Layer 3: Agent Loops

Each phase runs one or more agent loops: invoke an LLM, check structured output, persist results, decide whether to continue.

Agent types:

  • Spec agents (phases 1–4) — produce documentation artifacts. Single-threaded, sequential phases.
  • Implement agents (phase 5) — one per RFC, can run in parallel via isolated branches/worktrees. Write code to src/.
  • Review agents (phase 6) — validate code against specs with fix loops. Independent from implement agents.
  • Scorer agents (phase 7) — execute scenarios against built code, score satisfaction across three dimensions.

Loop mechanics:

for each iteration:
  1. Read current state from workspace files
  2. Invoke LLM with phase-specific prompt + relevant artifacts
  3. Parse structured output (JSON signals, not string matching)
  4. Write artifacts to workspace
  5. Evaluate completion condition
  6. If not complete and under max loops: continue
  7. If complete or max loops: advance to next phase or report

Structured signals:

{
  "status": "complete|in_progress|blocked|failed",
  "phase": "implementation",
  "artifact": "src/auth.py",
  "summary": "Implemented JWT auth per RFC-001",
  "remaining": []
}

Layer 4: Validation

Draws from Cleanroom Software Engineering's Statistical Usage Testing (usage-weighted probabilistic validation). The full Cleanroom approach requires modeling software as an explicit Markov-chain state machine — we adapt the key benefits using lightweight techniques suited to non-deterministic AI agent systems.

The key insight: the agent that writes code must not evaluate its own output. Scenarios, session generation, and scoring are each performed by separate agent invocations with different prompts.

4a. Scenario Library (Primary Validation Layer)

Hand-crafted, semantically rich scenarios that encode domain knowledge and known critical paths.

Authoring rules:

  • Written during phase 4 (RFC breakdown), NOT during implementation
  • Describe end-to-end user journeys, not unit-level assertions
  • Stored separately from code so implement agents cannot optimize for them
  • Example: "User adds an expense, sets a budget, exceeds it, receives alert"

Weighted scenario pool (adapted from Musa's operational profiles):

Rather than treating all scenarios equally, each scenario carries a probability weight reflecting how likely that usage pattern is in the real world. Weights are embedded directly in each scenario/session markdown file (e.g., **Weight**: 0.20), not in a separate index file. The bootstrap resamples proportionally — common paths influence the CI more than edge cases.

Weight sources (in order of preference):

  1. Production usage logs (when available)
  2. Domain expert estimates (for hand-crafted scenarios)
  3. LLM-generated (for generated sessions — the LLM assigns weights based on implicit usage model)
  4. Uniform distribution (the neutral starting point — refine as data arrives)

4b. LLM-Generated Sessions (Secondary Generative Layer)

The scenario library is curated but finite. To cover the combinatorial long tail — paths nobody thought to write — an LLM generates additional plausible user journeys.

How it works:

  • A dedicated session generator agent receives the PRD, feature list, scenario library, and source code (to ground sessions in actual capabilities)
  • It generates N additional end-to-end sessions with varied personas (novice, power user, confused user, adversarial user)
  • Sessions include realistic mistakes, backtracking, and edge cases
  • Sessions are weighted toward common paths but include rare ones — weights are embedded directly in each session file
  • Sessions are generated once at the start of validation and reused across passes — the population is stable, only the scores change between passes

Prompt template (simplified):

Read prd.md, features.md, all scenario-*.md files, and the source code in src/.
Generate {N} realistic end-to-end user sessions. Only test features that
exist in the source code. Vary personas (novice, power user, confused,
adversarial). Include realistic mistakes, backtracking, and edge cases.
Weight toward common usage paths but include some rare critical ones.
Do NOT duplicate the existing scenario library. Each session must include
a probability weightall weights across sessions must sum to 1.0.

This replaces the Markov chain's random-walk test generation. The LLM implicitly encodes a "usage model" from training data — it knows how people actually use software — without requiring an explicit state graph. Reading the source code grounds sessions in actual capabilities, preventing hallucinated features. The trade-off: less mathematically precise than formal path probabilities, but for a system whose outputs are themselves non-deterministic, that precision would be false anyway.

4c. Satisfaction Scoring

Not boolean pass/fail — a continuous metric measuring how well the software serves the user across a session.

Scoring dimensions:

  • Functional completeness (weight 0.4): Did the user achieve their goal? (0.0–1.0)
  • Behavioral correctness (weight 0.4): Did the system respond correctly at each step? (0.0–1.0)
  • Error handling (weight 0.2): Were errors caught and communicated gracefully? (0.0–1.0)
  • Composite satisfaction: Weighted average = (functional × 0.4) + (behavioral × 0.4) + (errorHandling × 0.2)

A score of 0.85 means "mostly works, edge cases remain." A score of 0.60 means "core flow works but significant gaps." The dimensions help diagnose where to focus iteration.

4d. Reliability Certification

Instead of full Sequential Probability Ratio Testing with Markov-derived path probabilities, we use BCa (bias-corrected and accelerated) bootstrap confidence interval narrowing on satisfaction scores. BCa corrects for bias and skewness in the bootstrap distribution, which matters when scores cluster near the ceiling (0.95+). This gives principled stopping criteria without the mathematical overhead. Minimum sample size: 20.

The process:

  1. Score all sessions (hand-crafted scenarios + LLM-generated sessions)
  2. Collect per-dimension satisfaction scores (functional, behavioral, error handling, composite)
  3. Compute weighted BCa bootstrap confidence intervals on mean satisfaction
  4. Evaluate against thresholds

Stopping rules:

CI width threshold:    0.10   (stop testing when CI is this narrow)
Ship threshold:        0.85   (lower bound of CI must exceed this)
Fail threshold:        0.85   (upper bound of CI below thisstop and fix)
Min samples:           20     (below this, always keep-testing)

Pass 1:  35 sessions scored, mean 0.93, 95% CI [0.87, 0.97]  → width 0.10, borderline
Pass 2:  same 35 sessions re-scored, mean 0.95, 95% CI [0.90, 0.98]  → SHIP

Decision logic:

  • CI lower bound > ship threshold AND CI width < threshold on ALL dimensions → ship
  • CI upper bound < ship threshold on ANY dimension → fix (apply code change, clear all results)
  • Otherwise → keep testing (targeted fixes on weak scenarios, clear all results, re-score)
  • Max validation passes reached → fail with full diagnostics (per-dimension CIs, fix history, weak dimensions)

Keep-testing with targeted fixes:

When the CI is too wide to decide, keep-testing identifies scenarios scoring below 0.75 and applies targeted code fixes before re-scoring. This is distinct from the fix decision (which is triggered by conclusive failure — CI upper bound below threshold or designed test gate failure).

The flow:

  1. Identify scenarios with composite < 0.75
  2. If any exist: invoke a fix agent with the weak scenarios as context, commit the change
  3. Log the fix to fix-log.jsonl (tagged keep-testing-fix to distinguish from full fix decisions)
  4. Clear ALL cached results — the fix may have changed behavior for any scenario, not just the targeted ones
  5. Re-score the full stable session population on the next pass

If no scenarios are below 0.75 (the CI is wide due to scorer noise rather than clear failures), results are still cleared for a fresh re-score. The LLM scorer is stochastic, so independent measurements on the same population reduce noise and narrow the CI.

If keep-testing does not converge within MaxValidationPasses, validation fails with full diagnostics — per-dimension CIs, the complete fix-log history, and the weakest dimensions — so a human knows exactly where the system couldn't self-correct.

Validation Workspace Structure

workspace/
├── scenarios/
│   ├── scenario-001.md           # Hand-crafted scenarios (weights embedded in file)
│   └── ...
├── validation/
│   ├── generated/                # LLM-generated sessions (stable across passes)
│   │   ├── session-001.md
│   │   └── ...
│   ├── results/
│   │   ├── scenario-001-result.json   # Per-scenario satisfaction scores
│   │   ├── session-001-result.json    # Per-session satisfaction scores
│   │   └── ...
│   ├── confidence.json           # Final certification decision + per-dimension CIs
│   └── fix-log.jsonl             # Structured audit trail of fix cycles

Why This Works (Trade-offs vs. Full SUT)

Full Cleanroom SUT This System
Markov chain usage model (explicit state machine) Weighted scenario pool + LLM-as-usage-model
Mathematically exact path probabilities Approximate, implicit in weights + LLM knowledge
Random walks through state graph LLM-generated sessions (semantically meaningful)
MTTF reliability certification via SPRT BCa bootstrap CIs on satisfaction scores with CI-width stopping
Requires software modeled as state machine No structural requirements on the software
Expensive to build, cheap to run Cheap to build, moderate to run (LLM token cost)
Deterministic test generation Non-deterministic (appropriate for non-deterministic agents)
Proven at IBM/NASA for deterministic systems Adapted for AI agent systems where outputs are inherently stochastic

State Management

status.json tracks orchestrator state:

{
  "idea": "A CLI expense tracker with transaction CRUD and reports",
  "currentPhase": "implementation",
  "phases": {
    "prd": { "status": "complete", "loops": 7 },
    "features": { "status": "complete", "loops": 1 },
    "rules": { "status": "complete", "loops": 1 },
    "rfcs": { "status": "complete", "loops": 1 },
    "implementation": {
      "status": "in_progress",
      "rfcs": {
        "rfc-001": { "status": "complete", "loops": 1 },
        "rfc-002": { "status": "in_progress", "loops": 3 },
        "rfc-003": { "status": "pending" }
      }
    },
    "review": { "status": "pending" },
    "validation": {
      "status": "pending",
      "certification": {
        "runs": 0,
        "meanSatisfaction": null,
        "ciLower": null,
        "ciUpper": null,
        "decision": "pending"
      }
    }
  }
}

Design Decisions

Stateless agents, stateful filesystem

Each agent invocation is stateless. All memory lives in workspace files. This is simpler, more debuggable, and avoids context window limits.

Specs as the source of truth

Code is generated FROM specs. If code and spec disagree, the spec wins and code is regenerated. This prevents drift and keeps agents grounded.

Separation of concerns in validation

The agent that writes code must NOT write its own validation scenarios. Scenarios are authored during the RFC phase (phase 4) by a different agent invocation with a different prompt — before implementation begins. LLM-generated sessions are produced by yet another agent. The session generator does read source code to ground sessions in actual capabilities (avoiding hallucinated features), but it operates under a different prompt with no access to the implementation agent's reasoning or test-awareness.

Usage-weighted probabilistic validation

Adapted from Cleanroom/SUT: test what users actually do, not just what's possible. Weight scenarios by expected usage frequency. Make probabilistic reliability claims (confidence intervals on satisfaction), not boolean pass/fail. Use principled stopping criteria (CI narrowing), not "run all tests and hope."

Parallel where possible, sequential where necessary

Phases 1–4 are sequential (each depends on prior output). Phase 5 (implementation) parallelizes across RFCs via isolated branches. Phases 6–7 run sequentially after implementation completes.


Future Work

Drift Detection

When the system is deployed, monitor whether real usage patterns match the assumptions baked into scenario weights and LLM-generated sessions.

Planned lightweight approach (replaces formal KL-divergence between Markov chains):

  • Distribution comparison: Kolmogorov-Smirnov test between current satisfaction score distribution and the baseline established at certification time. Significant divergence triggers re-validation.
  • Percentile tracking: Monitor the 10th percentile satisfaction score over time. If it degrades past a threshold, real-world usage is hitting paths that validation missed.
  • Weight drift: Compare actual feature/path usage frequencies against scenario weights. When they diverge significantly, update weights and re-certify.

Digital Twin Universe

  • Behavioral clones of external services (auth providers, payment APIs, etc.)
  • Enables realistic scenario execution without live service dependencies
  • Currently uses mocks and stubs

Open Questions

  1. Human checkpoints: Which phases benefit most from human review before proceeding?
  2. Change management: When specs change mid-build, how do we propagate updates to affected RFCs and code?
  3. Agent model selection: Should different phases use different models (e.g., stronger model for PRD, faster model for implementation)?
  4. Cost controls: Token budget per phase? Per agent loop? Per validation run?
  5. Scenario weight calibration: How quickly can we move from LLM-generated weights to data-driven weights? What's the minimum production data needed?
  6. Scorer independence: Should satisfaction scoring be done by a different model than session generation? (Different is safer — avoids self-evaluation bias.)

References

Cleanroom Software Engineering

The validation approach in Layer 4 is a lightweight adaptation of Cleanroom Software Engineering's Statistical Usage Testing (SUT). These are the foundational works:

  • Mills, H.D., Dyer, M., & Linger, R.C. (1987). "Cleanroom Software Engineering." IEEE Software, 4(5), 19–25. https://doi.org/10.1109/MS.1987.231413 — The original paper introducing Cleanroom: box-structure specification, incremental development under statistical quality control, and the argument that testing should certify reliability rather than find bugs.

  • Linger, R.C. (1994). "Cleanroom Process Model." IEEE Software, 11(2), 50–58. https://doi.org/10.1109/52.268956 — Refined process model with detailed treatment of statistical testing and certification. Good overview of how usage models drive test case generation.

  • Whittaker, J.A. & Thomason, M.G. (1994). "A Markov Chain Model for Statistical Software Testing." IEEE Transactions on Software Engineering, 20(10), 812–824. https://doi.org/10.1109/32.328991 — The formal basis for Markov-chain usage models in SUT. Defines how to derive test cases from state-transition probabilities. Our LLM-as-usage-model approach is a pragmatic substitute for this when explicit state machines are impractical.

  • Currit, P.A., Dyer, M., & Mills, H.D. (1986). "Certifying the Reliability of Software." IEEE Transactions on Software Engineering, SE-12(1), 3–11. https://doi.org/10.1109/TSE.1986.6312915 — Early paper on using statistical methods to make quantified reliability claims about software. Introduces the idea that testing can certify mean time to failure (MTTF) rather than just find defects.

  • Prowell, S.J., Trammell, C.J., Linger, R.C., & Poore, J.H. (1999). Cleanroom Software Engineering: Technology and Process. Addison-Wesley. — The comprehensive textbook. Chapters 7–9 cover statistical testing in detail: usage model construction, test case generation, and reliability certification via SPRT.

Bootstrap Methods

  • Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. — The standard reference. Chapter 14 covers BCa (bias-corrected and accelerated) confidence intervals, which we use instead of basic percentile intervals to handle skewed score distributions near the ceiling.

  • DiCiccio, T.J. & Efron, B. (1996). "Bootstrap Confidence Intervals." Statistical Science, 11(3), 189–228. https://doi.org/10.1214/ss/1032280214 — Detailed treatment of BCa intervals with theoretical justification for why they have better coverage properties than percentile intervals, especially for skewed distributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment