A system for autonomously building software products using AI agents. Agents operate in structured loops, progressing through a specification pipeline from idea to validated implementation — with minimal or no human intervention.
- ralph-loop — simple iterative agent loops with completion detection
- ai-prd-workflow — structured pipeline from idea to tested code
- Attractor (StrongDM) — NLSpec-driven agent implementation, own-your-stack philosophy
- StrongDM Factory — non-interactive development, scenario-based validation, satisfaction metrics
- Cleanroom Software Engineering / Statistical Usage Testing — usage-weighted probabilistic validation, principled stopping criteria (lightweight adaptation — see Layer 4 and References)
A phase-aware, state-persistent orchestration engine.
Responsibilities:
- Manage progression through specification phases
- Persist state between agent invocations (file-based)
- Spawn and coordinate parallel agents (one per RFC)
- Detect completion via structured signals (not string matching)
- Handle failures with retry, escalation, or human-in-the-loop breakpoints
Parameters:
| Parameter | Description | Default |
|---|---|---|
| Idea | Product description — the starting point | required |
| Phase | Which phase(s) to run | all |
| MaxLoops | Per-phase iteration cap | 10 |
| Parallel | Enable parallel RFC implementation | false |
| ThrottleLimit | Max concurrent parallel agents | 3 |
| Checkpoint | Pause for human approval between phases | false |
| ValidationSessions | Generated sessions per validation run | 15 |
| MaxValidationPasses | Max validation loop iterations | 3 |
| DesignedThreshold | Hard pass/fail gate for hand-crafted scenarios | 0.70 |
Each phase produces artifacts that feed the next. All artifacts are files in a workspace directory.
workspace/
├── idea.md # Raw input — the product idea
├── prd.md # Product Requirements Document
├── prd-verification.md # Gap analysis of the PRD
├── features.md # Extracted features (MoSCoW prioritization)
├── rules.md # Technical constraints and standards
├── rfcs/
│ ├── rfc-001.md # Scoped, implementable work units
│ └── ...
├── scenarios/
│ ├── scenario-001.md # Hand-crafted validation scenarios (weights embedded)
│ └── ...
├── src/ # Generated source code
├── reviews/ # Agent-generated code review reports
├── logs/ # Per-phase agent output logs
├── validation/ # Scoring results, confidence, fix audit trail
└── status.json # Orchestrator state and phase trackingPhase Progression:
| Phase | Input | Output | Description |
|---|---|---|---|
| 1. PRD | idea.md | prd.md, prd-verification.md | Requirements elicitation + verify-revise loop |
| 2. Features | prd.md | features.md | MoSCoW-prioritized feature list |
| 3. Rules | prd.md, features.md | rules.md | Technical constraints, standards |
| 4. RFCs | features.md, rules.md | rfcs/.md, scenarios/.md | Scoped implementation units + validation scenarios |
| 5. Implementation | rfc-NNN.md, rules.md | src/* | Code generation per RFC (parallelizable) |
| 6. Review | src/*, rfc-NNN.md | reviews/*.md | Spec-conformance review with fix loops |
| 7. Validation | src/, scenarios/.md, generated/*.md | confidence.json | Satisfaction scoring + statistical certification |
PRD creation and verification are a single phase with an internal verify-revise loop. Validation scenarios are authored during the RFC phase (phase 4), before implementation begins, to prevent the implementing agent from optimizing for them.
Each phase runs one or more agent loops: invoke an LLM, check structured output, persist results, decide whether to continue.
Agent types:
- Spec agents (phases 1–4) — produce documentation artifacts. Single-threaded, sequential phases.
- Implement agents (phase 5) — one per RFC, can run in parallel via isolated branches/worktrees. Write code to
src/. - Review agents (phase 6) — validate code against specs with fix loops. Independent from implement agents.
- Scorer agents (phase 7) — execute scenarios against built code, score satisfaction across three dimensions.
Loop mechanics:
for each iteration:
1. Read current state from workspace files
2. Invoke LLM with phase-specific prompt + relevant artifacts
3. Parse structured output (JSON signals, not string matching)
4. Write artifacts to workspace
5. Evaluate completion condition
6. If not complete and under max loops: continue
7. If complete or max loops: advance to next phase or reportStructured signals:
{
"status": "complete|in_progress|blocked|failed",
"phase": "implementation",
"artifact": "src/auth.py",
"summary": "Implemented JWT auth per RFC-001",
"remaining": []
}Draws from Cleanroom Software Engineering's Statistical Usage Testing (usage-weighted probabilistic validation). The full Cleanroom approach requires modeling software as an explicit Markov-chain state machine — we adapt the key benefits using lightweight techniques suited to non-deterministic AI agent systems.
The key insight: the agent that writes code must not evaluate its own output. Scenarios, session generation, and scoring are each performed by separate agent invocations with different prompts.
Hand-crafted, semantically rich scenarios that encode domain knowledge and known critical paths.
Authoring rules:
- Written during phase 4 (RFC breakdown), NOT during implementation
- Describe end-to-end user journeys, not unit-level assertions
- Stored separately from code so implement agents cannot optimize for them
- Example: "User adds an expense, sets a budget, exceeds it, receives alert"
Weighted scenario pool (adapted from Musa's operational profiles):
Rather than treating all scenarios equally, each scenario carries a probability weight reflecting how likely that usage pattern is in the real world. Weights are embedded directly in each scenario/session markdown file (e.g., **Weight**: 0.20), not in a separate index file. The bootstrap resamples proportionally — common paths influence the CI more than edge cases.
Weight sources (in order of preference):
- Production usage logs (when available)
- Domain expert estimates (for hand-crafted scenarios)
- LLM-generated (for generated sessions — the LLM assigns weights based on implicit usage model)
- Uniform distribution (the neutral starting point — refine as data arrives)
The scenario library is curated but finite. To cover the combinatorial long tail — paths nobody thought to write — an LLM generates additional plausible user journeys.
How it works:
- A dedicated session generator agent receives the PRD, feature list, scenario library, and source code (to ground sessions in actual capabilities)
- It generates N additional end-to-end sessions with varied personas (novice, power user, confused user, adversarial user)
- Sessions include realistic mistakes, backtracking, and edge cases
- Sessions are weighted toward common paths but include rare ones — weights are embedded directly in each session file
- Sessions are generated once at the start of validation and reused across passes — the population is stable, only the scores change between passes
Prompt template (simplified):
Read prd.md, features.md, all scenario-*.md files, and the source code in src/.
Generate {N} realistic end-to-end user sessions. Only test features that
exist in the source code. Vary personas (novice, power user, confused,
adversarial). Include realistic mistakes, backtracking, and edge cases.
Weight toward common usage paths but include some rare critical ones.
Do NOT duplicate the existing scenario library. Each session must include
a probability weight — all weights across sessions must sum to 1.0.This replaces the Markov chain's random-walk test generation. The LLM implicitly encodes a "usage model" from training data — it knows how people actually use software — without requiring an explicit state graph. Reading the source code grounds sessions in actual capabilities, preventing hallucinated features. The trade-off: less mathematically precise than formal path probabilities, but for a system whose outputs are themselves non-deterministic, that precision would be false anyway.
Not boolean pass/fail — a continuous metric measuring how well the software serves the user across a session.
Scoring dimensions:
- Functional completeness (weight 0.4): Did the user achieve their goal? (0.0–1.0)
- Behavioral correctness (weight 0.4): Did the system respond correctly at each step? (0.0–1.0)
- Error handling (weight 0.2): Were errors caught and communicated gracefully? (0.0–1.0)
- Composite satisfaction: Weighted average = (functional × 0.4) + (behavioral × 0.4) + (errorHandling × 0.2)
A score of 0.85 means "mostly works, edge cases remain." A score of 0.60 means "core flow works but significant gaps." The dimensions help diagnose where to focus iteration.
Instead of full Sequential Probability Ratio Testing with Markov-derived path probabilities, we use BCa (bias-corrected and accelerated) bootstrap confidence interval narrowing on satisfaction scores. BCa corrects for bias and skewness in the bootstrap distribution, which matters when scores cluster near the ceiling (0.95+). This gives principled stopping criteria without the mathematical overhead. Minimum sample size: 20.
The process:
- Score all sessions (hand-crafted scenarios + LLM-generated sessions)
- Collect per-dimension satisfaction scores (functional, behavioral, error handling, composite)
- Compute weighted BCa bootstrap confidence intervals on mean satisfaction
- Evaluate against thresholds
Stopping rules:
CI width threshold: 0.10 (stop testing when CI is this narrow)
Ship threshold: 0.85 (lower bound of CI must exceed this)
Fail threshold: 0.85 (upper bound of CI below this → stop and fix)
Min samples: 20 (below this, always keep-testing)
Pass 1: 35 sessions scored, mean 0.93, 95% CI [0.87, 0.97] → width 0.10, borderline
Pass 2: same 35 sessions re-scored, mean 0.95, 95% CI [0.90, 0.98] → SHIPDecision logic:
- CI lower bound > ship threshold AND CI width < threshold on ALL dimensions → ship
- CI upper bound < ship threshold on ANY dimension → fix (apply code change, clear all results)
- Otherwise → keep testing (targeted fixes on weak scenarios, clear all results, re-score)
- Max validation passes reached → fail with full diagnostics (per-dimension CIs, fix history, weak dimensions)
Keep-testing with targeted fixes:
When the CI is too wide to decide, keep-testing identifies scenarios scoring below 0.75 and applies targeted code fixes before re-scoring. This is distinct from the fix decision (which is triggered by conclusive failure — CI upper bound below threshold or designed test gate failure).
The flow:
- Identify scenarios with composite < 0.75
- If any exist: invoke a fix agent with the weak scenarios as context, commit the change
- Log the fix to
fix-log.jsonl(taggedkeep-testing-fixto distinguish from fullfixdecisions) - Clear ALL cached results — the fix may have changed behavior for any scenario, not just the targeted ones
- Re-score the full stable session population on the next pass
If no scenarios are below 0.75 (the CI is wide due to scorer noise rather than clear failures), results are still cleared for a fresh re-score. The LLM scorer is stochastic, so independent measurements on the same population reduce noise and narrow the CI.
If keep-testing does not converge within MaxValidationPasses, validation fails with full diagnostics — per-dimension CIs, the complete fix-log history, and the weakest dimensions — so a human knows exactly where the system couldn't self-correct.
workspace/
├── scenarios/
│ ├── scenario-001.md # Hand-crafted scenarios (weights embedded in file)
│ └── ...
├── validation/
│ ├── generated/ # LLM-generated sessions (stable across passes)
│ │ ├── session-001.md
│ │ └── ...
│ ├── results/
│ │ ├── scenario-001-result.json # Per-scenario satisfaction scores
│ │ ├── session-001-result.json # Per-session satisfaction scores
│ │ └── ...
│ ├── confidence.json # Final certification decision + per-dimension CIs
│ └── fix-log.jsonl # Structured audit trail of fix cycles| Full Cleanroom SUT | This System |
|---|---|
| Markov chain usage model (explicit state machine) | Weighted scenario pool + LLM-as-usage-model |
| Mathematically exact path probabilities | Approximate, implicit in weights + LLM knowledge |
| Random walks through state graph | LLM-generated sessions (semantically meaningful) |
| MTTF reliability certification via SPRT | BCa bootstrap CIs on satisfaction scores with CI-width stopping |
| Requires software modeled as state machine | No structural requirements on the software |
| Expensive to build, cheap to run | Cheap to build, moderate to run (LLM token cost) |
| Deterministic test generation | Non-deterministic (appropriate for non-deterministic agents) |
| Proven at IBM/NASA for deterministic systems | Adapted for AI agent systems where outputs are inherently stochastic |
status.json tracks orchestrator state:
{
"idea": "A CLI expense tracker with transaction CRUD and reports",
"currentPhase": "implementation",
"phases": {
"prd": { "status": "complete", "loops": 7 },
"features": { "status": "complete", "loops": 1 },
"rules": { "status": "complete", "loops": 1 },
"rfcs": { "status": "complete", "loops": 1 },
"implementation": {
"status": "in_progress",
"rfcs": {
"rfc-001": { "status": "complete", "loops": 1 },
"rfc-002": { "status": "in_progress", "loops": 3 },
"rfc-003": { "status": "pending" }
}
},
"review": { "status": "pending" },
"validation": {
"status": "pending",
"certification": {
"runs": 0,
"meanSatisfaction": null,
"ciLower": null,
"ciUpper": null,
"decision": "pending"
}
}
}
}Each agent invocation is stateless. All memory lives in workspace files. This is simpler, more debuggable, and avoids context window limits.
Code is generated FROM specs. If code and spec disagree, the spec wins and code is regenerated. This prevents drift and keeps agents grounded.
The agent that writes code must NOT write its own validation scenarios. Scenarios are authored during the RFC phase (phase 4) by a different agent invocation with a different prompt — before implementation begins. LLM-generated sessions are produced by yet another agent. The session generator does read source code to ground sessions in actual capabilities (avoiding hallucinated features), but it operates under a different prompt with no access to the implementation agent's reasoning or test-awareness.
Adapted from Cleanroom/SUT: test what users actually do, not just what's possible. Weight scenarios by expected usage frequency. Make probabilistic reliability claims (confidence intervals on satisfaction), not boolean pass/fail. Use principled stopping criteria (CI narrowing), not "run all tests and hope."
Phases 1–4 are sequential (each depends on prior output). Phase 5 (implementation) parallelizes across RFCs via isolated branches. Phases 6–7 run sequentially after implementation completes.
When the system is deployed, monitor whether real usage patterns match the assumptions baked into scenario weights and LLM-generated sessions.
Planned lightweight approach (replaces formal KL-divergence between Markov chains):
- Distribution comparison: Kolmogorov-Smirnov test between current satisfaction score distribution and the baseline established at certification time. Significant divergence triggers re-validation.
- Percentile tracking: Monitor the 10th percentile satisfaction score over time. If it degrades past a threshold, real-world usage is hitting paths that validation missed.
- Weight drift: Compare actual feature/path usage frequencies against scenario weights. When they diverge significantly, update weights and re-certify.
- Behavioral clones of external services (auth providers, payment APIs, etc.)
- Enables realistic scenario execution without live service dependencies
- Currently uses mocks and stubs
- Human checkpoints: Which phases benefit most from human review before proceeding?
- Change management: When specs change mid-build, how do we propagate updates to affected RFCs and code?
- Agent model selection: Should different phases use different models (e.g., stronger model for PRD, faster model for implementation)?
- Cost controls: Token budget per phase? Per agent loop? Per validation run?
- Scenario weight calibration: How quickly can we move from LLM-generated weights to data-driven weights? What's the minimum production data needed?
- Scorer independence: Should satisfaction scoring be done by a different model than session generation? (Different is safer — avoids self-evaluation bias.)
The validation approach in Layer 4 is a lightweight adaptation of Cleanroom Software Engineering's Statistical Usage Testing (SUT). These are the foundational works:
-
Mills, H.D., Dyer, M., & Linger, R.C. (1987). "Cleanroom Software Engineering." IEEE Software, 4(5), 19–25. https://doi.org/10.1109/MS.1987.231413 — The original paper introducing Cleanroom: box-structure specification, incremental development under statistical quality control, and the argument that testing should certify reliability rather than find bugs.
-
Linger, R.C. (1994). "Cleanroom Process Model." IEEE Software, 11(2), 50–58. https://doi.org/10.1109/52.268956 — Refined process model with detailed treatment of statistical testing and certification. Good overview of how usage models drive test case generation.
-
Whittaker, J.A. & Thomason, M.G. (1994). "A Markov Chain Model for Statistical Software Testing." IEEE Transactions on Software Engineering, 20(10), 812–824. https://doi.org/10.1109/32.328991 — The formal basis for Markov-chain usage models in SUT. Defines how to derive test cases from state-transition probabilities. Our LLM-as-usage-model approach is a pragmatic substitute for this when explicit state machines are impractical.
-
Currit, P.A., Dyer, M., & Mills, H.D. (1986). "Certifying the Reliability of Software." IEEE Transactions on Software Engineering, SE-12(1), 3–11. https://doi.org/10.1109/TSE.1986.6312915 — Early paper on using statistical methods to make quantified reliability claims about software. Introduces the idea that testing can certify mean time to failure (MTTF) rather than just find defects.
-
Prowell, S.J., Trammell, C.J., Linger, R.C., & Poore, J.H. (1999). Cleanroom Software Engineering: Technology and Process. Addison-Wesley. — The comprehensive textbook. Chapters 7–9 cover statistical testing in detail: usage model construction, test case generation, and reliability certification via SPRT.
-
Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. — The standard reference. Chapter 14 covers BCa (bias-corrected and accelerated) confidence intervals, which we use instead of basic percentile intervals to handle skewed score distributions near the ceiling.
-
DiCiccio, T.J. & Efron, B. (1996). "Bootstrap Confidence Intervals." Statistical Science, 11(3), 189–228. https://doi.org/10.1214/ss/1032280214 — Detailed treatment of BCa intervals with theoretical justification for why they have better coverage properties than percentile intervals, especially for skewed distributions.