asabaylus · May 1, 2026 18:55 · asabaylus · May 1, 2026
diff --git a/gistfile1.txt b/gistfile1.txt
 # Agent Maturity Assessment

 A diagnostic for engineering organization health in the AI-agentic coding era. The question this assessment answers: **is this org capable of shipping safely with humans and agents working in parallel, on a codebase that doesn't degrade with every iteration?**

 This skill owns the criteria, the scoring rubric, and the audit output format. It runs against either a whole organization or a specific scope (team, product line, repo).

 ## When to use

 - **One-shot audit**: assess an organization's current state during onboarding, or a specific team / repo / acquired company.
 - **Recurring**: re-run quarterly against the same org to track movement, or against new sub-teams as they form or get acquired.
 - **Spot-check**: a single repo or service can be scored against just the items that apply (note which items were skipped and why).

 The artifact is the deliverable. Always produce the written audit using the template at the bottom of this file — never just give a verbal summary.

 ## The 12 criteria

 Each item scores **1.0** (pass), **0.5** (partial), or **0.0** (fail). Be conservative: if it's not visibly true, it's 0.5. If there's no evidence at all, it's 0. Categories carry weights applied at the end.

 Each item has:
 - **Score levels** — what 1.0 / 0.5 / 0.0 looks like
 - **Repo check or diagnostic** — concrete way to gather evidence
 - **Why it matters** — one-line rationale

 ### Category A — Engineering basics (weight 1.0×)

 Non-negotiable foundations. Failure here multiplies risk on everything else.

 **1. Reproducible dev environments**
 - 1.0 — Clone-to-green-build in <30 min via devcontainer, Nix, or a single setup script. Same path works for an agent.
 - 0.5 — README exists but bootstrap takes >2 hours or has known broken steps.
 - 0.0 — "Ask Bob, he knows the trick."
 - *Repo check:* `.devcontainer/`, `flake.nix`, `setup.sh`, or equivalent. Run it from a clean machine.
 - *Diagnostic commands:*
  - `ls .devcontainer/ flake.nix setup.sh scripts/bootstrap* 2>/dev/null` — bootstrap surface
  - `time bash <bootstrap-script>` on a clean machine to verify the <30 min claim
  - `gh repo view <org>/<bootstrap-deps-repo> 2>/dev/null` for any external bootstrap repo identified during scope mapping
 - *Why it matters:* Onboarding latency is the first multiplier on team velocity, and agents need bootstrappable environments too. If a human can't get green in 30 minutes, an agent definitely can't.

 **2. Sub-day integration cadence with measured outcomes**
 - 1.0 — Code integrates to mainline at least daily. PRs are small and merge sub-day. All four DORA metrics (deployment frequency, lead time, change-fail rate, MTTR) are tracked and visible. Branching model can be trunk-based, GitHub flow, or short-lived Git flow — what matters is the absence of long-lived branches and the presence of measured integration discipline.
 - 0.5 — Some metrics tracked, but cadence is weekly, PRs sit for days, or feature branches routinely outlive a sprint.
 - 0.0 — Long-lived feature branches as the norm, release trains measured in months, no metrics.
 - *Repo check:* age distribution of merged PRs over the last 90 days; presence of any DORA dashboard.
 - *Diagnostic commands:*
  - `gh pr list --state merged --limit 200 --search "merged:>$(date -d '90 days ago' +%Y-%m-%d)" --json mergedAt,createdAt,additions,deletions,reviews,author` — cadence + lead time + PR size + review counts in one call
  - `gh api "repos/{owner}/{repo}/branches?per_page=100" --paginate --jq '.[] | {name, last_commit_sha: .commit.sha}'` then resolve commit dates → branch staleness distribution
  - `gh run list --workflow=deploy*.yml --limit 100 --json conclusion,createdAt,name --branch <default>` — deployment frequency proxy and change-fail rate (failed conclusions / total)
  - For monorepos with deploys in adjacent infra/CD repos: rerun the `gh run list` against `<org>/<cd-repo>`
 - *Why it matters:* Integration cadence is the leading indicator of engineering performance. With agents in the loop the case is stronger — agents work fastest when changes validate against current main immediately, and long-lived branches accumulate integration debt humans have to resolve later.

 **3. Testability and the agent inner loop**
 - 1.0 — The application is *built* to be tested: real seams (DI, ports/adapters, deep modules with clean interfaces) so behaviors can be verified at module boundaries without spinning up the world. Unit tests are sub-second; the full suite runs in minutes; flaky tests are treated as bugs and fixed within a sprint. A single command runs the suite headlessly with machine-parseable output. TDD-style inner loops — write the test, make it pass, refactor — are the *default* mode of working with AI.
 - 0.5 — Tests exist and mostly run, but the application has known untestable areas, the suite is slow enough to break flow, flaky tests get re-run rather than fixed, or TDD with agents is occasional rather than default.
 - 0.0 — Manual QA, flaky-and-ignored test suite, or no seams in the application — agents can technically run `npm test` but the signal is garbage.
 - *Repo check:* run the suite, time it, check failure rate over the last 50 CI runs; sample a recent feature PR and look at whether tests were written before or after the implementation.
 - *Diagnostic commands:*
  - `time <test-command>` (e.g. `time pnpm test`, `time dotnet test`) — full suite duration
  - `find . -name "*.test.*" -o -name "*.spec.*" -o -name "*Tests.cs" 2>/dev/null | wc -l` — test file count as a sanity floor
  - `gh run list --workflow=ci.yml --limit 50 --json conclusion --jq '[.[] | .conclusion] | group_by(.) | map({status: .[0], count: length})'` — flake/fail rate
  - `grep -rE "\\|\\|\\s*true|continue-on-error:\\s*true" .github/workflows/ 2>/dev/null` — CI swallowing failures (any hit = item probably 0.0 regardless of test count)
  - For QA in adjacent repo (e.g. `<org>/qa-e2e`): `gh repo view <org>/<qa-repo>` and inspect its CI run history the same way
 - *Why it matters:* Humans can reason around bad tests ("yeah, that test is garbage, but I know the code works"). Agents can't — they follow the signal. The test suite is the rate limit on agent throughput; agents without fast, trustworthy feedback outrun their headlights and produce thrash.

 **4. Observability before features**
 - 1.0 — Structured logs, distributed traces, error budgets defined, on-call with runbooks. New features ship instrumented.
 - 0.5 — Logs and metrics exist but tracing is partial; runbooks stale.
 - 0.0 — "We grep CloudWatch when something breaks."
 - *Repo check:* OTel libraries in deps, dashboards exist, error budget docs, recency of last runbook update.
 - *Diagnostic commands:*
  - `grep -rEh "OpenTelemetry|opentelemetry|Microsoft\\.ApplicationInsights|datadog|prometheus|grafana|loki|tempo|sentry|honeycomb|newrelic|splunk" --include="*.csproj" --include="package.json" --include="go.mod" --include="requirements*.txt" --include="Cargo.toml" --include="pom.xml" --include="build.gradle*" 2>/dev/null` — instrumentation / agent libs (Grafana itself is viz; this catches the Grafana Cloud agent, faro SDK, Loki/Tempo clients that feed it)
  - `find . \( -path "*/grafana/*.json" -o -path "*/dashboards/*.json" -o -name "*.libsonnet" -o -path "*/prometheus/*.yml" -o -path "*/alerts/*.yml" \) -not -path "*/node_modules/*" 2>/dev/null` — committed Grafana dashboards, Jsonnet, Prometheus alert rules
  - `find . -ipath "*runbook*" -o -ipath "*incident*" -o -ipath "*sli*" -o -ipath "*slo*" 2>/dev/null` — runbook / SLO presence
  - `git log --since="180 days ago" --oneline -- docs/runbooks/ docs/ops/ 2>/dev/null | wc -l` — recency of operational docs
  - For dashboards/alerts in an adjacent repo (e.g. `<org>/observability`, `<org>/grafana-dashboards`): rerun the dashboard-file `find` there — score across both
 - *Why it matters:* You can't fix what you can't see. AI accelerates ship rate, which accelerates incident rate — observability is the safety net that makes acceleration survivable.

 ### Category B — Knowledge & context (weight 1.5×)

 This is what's gotten *more* important with LLMs, not less. Agents perform at the level of context the org provides them, and codebase shape determines whether agents can navigate it at all. Weighted highest because this category compounds — a team that gets B right tends to fix everything else.

 **5. Design discipline as a first-class practice**
 - 1.0 — ADRs are current and dated. ARCHITECTURE.md exists per active repo. A **ubiquitous language glossary** is checked in, referenced in agent context, and the team enforces its terms in code, docs, and conversation. Design happens *before* code generation: agents are pointed at planning skills (e.g., "interview-me-until-shared-understanding" patterns) that force a shared design concept before any code is written. ADR/glossary commits are visible in the last 90 days — design is an ongoing investment, not a one-time write.
 - 0.5 — Some design artifacts exist but are stale; ubiquitous language is implicit (people just know the terms); planning happens informally before some agent work but not consistently.
 - 0.0 — Tribal knowledge. Architecture lives in one staff engineer's head. Agents are turned loose without shared design concept and produce confidently wrong code.
 - *Repo check:* `docs/adr/`, `ARCHITECTURE.md`, glossary or ubiquitous-language file; check git log on those paths for recency; sample an agent-driven PR for evidence of upfront design vs. straight-to-code.
 - *Diagnostic commands:*
  - `find . -ipath "*adr*" -name "*.md" 2>/dev/null | head; find . -iname "ARCHITECTURE.md" -o -iname "GLOSSARY.md" -o -iname "*ubiquitous*" 2>/dev/null` — design surface
  - `git log --since="90 days ago" --oneline -- docs/adr/ ARCHITECTURE.md 2>/dev/null | wc -l` — ongoing investment vs. one-time write
  - For ADRs in a central docs repo: `gh api "repos/<org>/<docs-repo>/contents/adr" --jq '.[].name'`
 - *Why it matters:* Specs-to-code without design discipline produces software entropy — each iteration makes the codebase worse. Investing in design daily is what keeps tactical AI execution aligned with strategic intent. The ubiquitous language is the bridge between domain experts, engineers, and agents — without it, every translation step introduces drift.

 **6. Codebase composed of deep modules**
 - 1.0 — The codebase is structured as **deep modules**: few large modules, each with substantial functionality hidden behind a simple, stable interface. Public interfaces are small and intentional; implementations can be sizeable but encapsulated. When agents add code, they add it inside an existing deep module's boundary or create a new module with a clear interface — they don't sprinkle helpers across the codebase.
 - 0.5 — Some areas well-modularized; others are shallow / sprinkly. Agents tend to add code in surface-level helpers rather than respecting boundaries. A handful of god-classes exist but are known and bounded.
 - 0.0 — Sprawling shallow modules with leaky interfaces; 4000-line god files alongside 30-line helper files with no clear pattern. Agents can't navigate the module map and produce code that crosses arbitrary boundaries.
 - *Repo check:* file size distribution, public API surface per module, sample two random modules and see whether you can summarize each one's purpose in a sentence; drop one into an LLM and ask it to explain.
 - *Why it matters:* AI excels at filling in implementation when given a clean interface; it produces sprawl when given no constraints. Deep modules give agents the right *shape* of problem to solve. Shallow codebases compound entropy with every agent-driven change.

 **7. Repo-local agent context**
 - 1.0 — `CLAUDE.md` / `AGENTS.md` / skill files checked into the repo. Team-level prompt and skill libraries are versioned. Agents joining the team get the same onboarding humans get. Agent context references the ubiquitous language and the module map (items 5 + 6).
 - 0.5 — Some individuals have personal CLAUDE.md files; nothing shared at the repo level.
 - 0.0 — No agent context anywhere; people copy-paste instructions into chat each time.
 - *Repo check:* `CLAUDE.md`, `AGENTS.md`, `.claude/`, `.cursor/rules/`, `.skills/`, or equivalent. Read one — does it teach the agent something the engineer wouldn't have to be told?
 - *Diagnostic commands:*
  - `find . -maxdepth 4 \( -iname "CLAUDE.md" -o -iname "AGENTS.md" -o -name ".claude" -o -name ".cursor" -o -name ".skills" -o -name "memory-bank" \) -not -path "./node_modules/*" -not -path "./.git/*" 2>/dev/null` — agent-context surface
  - For each found file/dir: `wc -l` and `git log -1 --format="%ar" -- <path>` to gauge depth and recency
  - For shared agent context in adjacent repo (e.g. `<org>/claude-skills`, `<org>/.github`): `gh repo view <org>/<repo>` and check whether this repo references it
 - *Why it matters:* Agents perform at the level of context the repo provides them. Ad-hoc personal prompts mean each engineer's agent operates at a different standard; checked-in context means everyone (and every agent) gets the same baseline.

 ### Category C — AI governance & quality (weight 1.25×)

 The new control plane.

 **8. Sanctioned, governed AI tooling**
 - 1.0 — Approved model list, ZDR posture documented, secrets scanning on agent outputs, clear policy on what can / can't be sent to third parties, paid seats budgeted.
 - 0.5 — Tooling is paid for but governance is loose; or governance is tight but everyone uses personal accounts anyway.
 - 0.0 — Shadow AI. People paste prod data into free-tier chatbots.
 - *Diagnostic:* ask any IC "what model are you using and is the company paying for it?"
 - *Why it matters:* Shadow AI is shadow IT with worse confidentiality and IP risk. Governance now is cheaper than recovering from a leak later.

 **9. Human review on every PR regardless of authorship**
 - 1.0 — AI-generated code is reviewed by a human who understands it well enough to defend it in a postmortem. "The agent wrote it" is not a shield.
 - 0.5 — Reviews happen but are cursory; AI-authored PRs get rubber-stamped.
 - 0.0 — Auto-merge on agent PRs, or no review process at all.
 - *Repo check:* PR review settings, review depth on a sample of recent AI-tagged PRs.
 - *Diagnostic commands:*
  - `find . -name "CODEOWNERS" 2>/dev/null` — review enforcement file
  - `gh api "repos/{owner}/{repo}/branches/<default>/protection" 2>/dev/null` — branch protection rules (auth scope permitting)
  - `gh pr list --state merged --limit 50 --json reviews,author,additions,deletions --jq '[.[] | {pr: .number, author: .author.login, reviewers: [.reviews[].author.login] | unique, lines: (.additions + .deletions)}]'` — review depth and non-author reviewer presence per PR
  - For org-level review policy in `<org>/.github`: `gh api "repos/<org>/.github/contents/" --jq '.[].name'`
 - *Why it matters:* AI-authored code that no human can defend is technical debt with no owner. Review discipline is what keeps the org accountable for what it ships.

 **10. Evals for AI-touched code paths**
 - 1.0 — If LLMs are in the product → offline eval suite + prod telemetry. If LLMs are in the dev loop → adoption, throughput, and defect rate measured honestly (not just "everyone loves it").
 - 0.5 — Vibes-based confidence; some metrics but no rigor.
 - 0.0 — No evals, no measurement, no idea if the AI helps or hurts.
 - *Repo check:* `evals/`, `benchmarks/`, internal AI tooling dashboards.
 - *Why it matters:* Without evals, you can't tell whether AI is helping or hurting — you're managing on vibes. Evals are also the only way to catch silent regressions in AI-driven product features.

 **11. Blast-radius controls for agent actions**
 - 1.0 — Scoped credentials per agent, dry-run modes, audit logs of every agent-triggered write, documented rollback paths. The "agent shipped a migration to prod at 2am" scenario has been red-teamed.
 - 0.5 — Some controls exist but are inconsistent; audit logs partial.
 - 0.0 — Agents have prod write access via human-equivalent creds; no audit trail.
 - *Diagnostic:* "what's the dumbest possible agent action that could break prod, and would we know within 5 minutes?"
 - *Diagnostic commands:*
  - `grep -rEh "azure/login@|aws-actions/configure-aws-credentials@|google-github-actions/auth@" .github/workflows/ 2>/dev/null` — OIDC adoption (presence of `with: client-id:` rather than `secrets.AWS_ACCESS_KEY_ID` is the green flag)
  - `gh api "repos/{owner}/{repo}/environments" --jq '.environments[] | {name: .name, has_protection: (.protection_rules | length > 0)}' 2>/dev/null` — env-scoped deploys with reviewers
  - `find infra/ terraform/ -name "*.tf" 2>/dev/null | xargs grep -lE "service_account|workload_identity|managed_identity|user_assigned_identity" 2>/dev/null` — scoped per-workload identities
  - `grep -rEh "azurerm_role_assignment|google_project_iam|aws_iam_role" infra/ terraform/ 2>/dev/null | wc -l` — IAM blast-radius posture
  - For Terraform/IAM in adjacent infra repo (e.g. `<org>/infra`): clone shallow and rerun the same greps there
 - *Why it matters:* Autonomous agents will eventually do something stupid. The question is whether the blast radius is bounded by design or by luck.

 ### Category D — Hiring (weight 1.0×)

 **12. Interviews assess judgment under AI augmentation**
 - 1.0 — Candidates use AI in interviews and are evaluated on critique, decomposition, recognizing wrong answers, and shipping correct work. The bar is "great judgment with AI", not "no AI allowed".
 - 0.5 — AI is allowed but interviewers don't know how to assess its use; or it's banned for "purity" reasons.
 - 0.0 — Old-style whiteboard-only interviews; or no real technical bar at all.
 - *Diagnostic:* shadow an interview loop or read the rubric.
 - *Why it matters:* Hiring is a forward-looking bet. The skill that matters in the AI-agentic era isn't "can write code without AI" — it's "can use AI well." Interviews that don't measure that bet on the wrong skill.

 ## Scoring

 **Raw score**: sum of all 12 item scores. Max 12.

 **Weighted score** (recommended primary metric):

 ```
 A_total = sum(items 1–4)   × 1.00     // max 4.00
 B_total = sum(items 5–7)   × 1.50     // max 4.50
 C_total = sum(items 8–11)  × 1.25     // max 5.00
 D_total = sum(item 12)     × 1.00     // max 1.00
                          ──────────
 weighted = A + B + C + D
 max      = 14.50
 score%   = (weighted / 14.50) × 100
 ```

 If any item is scored `n/a`, drop it from both numerator and max for that audit and note it in the Summary.

 **Bands**:

 | Band | Score % | Interpretation |
 |------|---------|---------------|
 | Excellent | 90%+ | Genuinely rare. Confirm with a second pass — first audits often score too generously. |
 | Healthy | 75–89% | Targeted fixes will compound. |
 | Functional but slow | 60–74% | Real risk of being out-shipped by AI-native competitors. Where most orgs actually live. |
 | Significant dysfunction | 40–59% | Treat as a turnaround. |
 | Triage | <40% | Stop new feature work until basics are in. |

 The bar: **<11/12 raw and <80% weighted means there's leverage to capture.**

 ## How to run an audit

 1. **Decide scope.** Whole org, one product line, one repo, or one team. Score the appropriate level — don't average across heterogeneous teams (a 14-person backend team and a 3-person ML team should be scored separately).
 2. **Environment preflight** (see *Environment preflight* below). Probe for `gh` CLI / GitHub MCP / git access and select an evidence-fidelity tier before running any diagnostics. **Always announce the tier you're running at** so the audit is reproducible.
 3. **Map adjacent repos** (see *Handling multi-repo scope* below). CI templates, Terraform modules, QA suites, runbooks, and shared agent context often live in sibling repos. Capture the list before scoring.
 4. **Gather evidence per item.** Don't take anyone's word for it. For each item, do at least one of: read the repo (and its adjacents), run the diagnostic commands listed under that criterion at the highest fidelity tier available, ask a non-leadership IC the diagnostic question, or check the relevant dashboard/settings page.
 5. **Score conservatively.** When in doubt, 0.5. Revise up next quarter if evidence appears.
 6. **Write the audit** using the template below. The artifact is the deliverable. **Each "Why this score" cell is one sentence, ≤ 25 words** — pick the single most decisive piece of evidence, save the rest for Top 3 fixes / Strengths / Notes.
 7. **Decide on distribution.** First audit at a new role is usually best kept internal until the calibration has been validated. Re-run in 90 days.

 ### Environment preflight

 **First, read `docs/audits/CONFIG.md` if it exists.** That file is scaffolded by the `setup-agent-maturity-assessment` skill and declares the GitHub auth method, the canonical org/repo/branch, the pre-approved list of adjacent repos in scope, and the audit cadence. When it's present, use its declared values as the source of truth — skip the runtime probes below for the parts CONFIG.md already answers, and treat the runtime probes as drift-detection only.

 If CONFIG.md is **missing** or its declared auth method fails the probe (e.g. CONFIG says "gh" but `gh auth status` errors), fall back to the full preflight below and surface the gap in **Notes for re-audit** so the user can re-run `setup-agent-maturity-assessment` later.

 The diagnostic commands assume `gh` CLI is in `$PATH` and authenticated. In a sandboxed runtime (e.g. Cowork) this is often not true even if `gh` is installed on the host. Run this preflight before scoring and select the tier:

 ```bash
 # Tier 1 — gh CLI authenticated → highest fidelity (full GitHub API access)
 command -v gh >/dev/null 2>&1 && gh auth status >/dev/null 2>&1 && echo "tier=1 gh"

 # Tier 2 — GitHub MCP server connected → equivalent fidelity via MCP tools
 # (Detect via host capabilities; in Claude Code, look for tools named like
 #  list_pull_requests, get_workflow_runs, get_branch_protection.)

 # Tier 3 — git + filesystem only → reduced fidelity
 git -C . rev-parse --is-inside-work-tree >/dev/null 2>&1 && echo "tier=3 git-only"
 ```

 **Tier behavior:**

 | Tier | Available | Use for |
 |------|-----------|---------|
 | 1. `gh` authenticated | All `gh pr list`, `gh api`, `gh run list` commands | Default. Highest-fidelity audits. |
 | 2. GitHub MCP | Equivalent MCP-routed tools | Use when running in a sandbox where `gh` isn't on the host but a GitHub MCP is connected. |
 | 3. git + filesystem only | `git log`, `find`, `grep` | Fallback. Items 2, 3, 9, 11 score against approximations (merge commits as PR proxies, no branch-protection visibility, no review-depth metrics). |

 **At Tier 3, the audit MUST:**
 - State "Tier 3 (git-only) audit — limited GitHub-side evidence" in the **Summary's One-line take**.
 - Add an entry to **Notes for re-audit** listing which items were scored against fallback evidence and what to re-verify when running at Tier 1.
 - Never auto-promote a Tier 3 score to 1.0 on items 2, 3, 9, or 11 — the missing GitHub-side data could pull them down. Cap those at 0.5 unless filesystem evidence alone is sufficient.

 **To upgrade Tier 3 → Tier 1 in Cowork (or any sandbox):** add a GitHub MCP server. Cowork's curated MCP registry doesn't currently bundle one, so add it as a custom MCP via Settings → MCP Servers, pointing at GitHub's official `github/github-mcp-server` (remote-hostable) or Anthropic's reference implementation. Auth flows through your GitHub OAuth/PAT scoped to the orgs you want to audit — no creds touch the sandbox.

 **Optional — host-side probe script.** When the sandbox is stuck at Tier 3 but the user has `gh` on their host, ask them to run this and paste the output back. The audit can incorporate the results without any creds entering the sandbox.

 ```bash
 #!/usr/bin/env bash
 # audit-gh-probe.sh — run on host, paste output to Claude
 set -euo pipefail
 REPO="${1:?usage: audit-gh-probe.sh <owner/repo>}"
 SINCE="$(date -d '90 days ago' +%Y-%m-%d 2>/dev/null || date -v-90d +%Y-%m-%d)"

 echo "### gh-pr-list (cadence + lead time + review depth) ###"
 gh pr list --repo "$REPO" --state merged --limit 200 \
  --search "merged:>$SINCE" \
  --json number,mergedAt,createdAt,additions,deletions,reviews,author

 echo "### gh-branch-protection ###"
 gh api "repos/$REPO/branches/$(gh repo view "$REPO" --json defaultBranchRef --jq .defaultBranchRef.name)/protection" 2>&1 || true

 echo "### gh-environments ###"
 gh api "repos/$REPO/environments" --jq '.environments[] | {name, has_protection: (.protection_rules | length > 0)}' 2>&1 || true

 echo "### gh-deploy-runs ###"
 gh run list --repo "$REPO" --workflow=deploy --limit 100 \
  --json conclusion,createdAt,name 2>&1 || true

 echo "### gh-ci-runs (flake/fail rate) ###"
 gh run list --repo "$REPO" --workflow=ci.yml --limit 50 \
  --json conclusion 2>&1 || true
 ```

 ### Handling multi-repo scope

 A real engineering org doesn't fit in one repo. CI workflow templates, Terraform/OpenTofu modules, QA / E2E suites, runbooks and dashboards, and shared agent-context skill libraries frequently live in adjacent repos. Auditing only the primary repo under-scores items that depend on those external sources.

 **If `docs/audits/CONFIG.md` exists, use its `## Adjacent repos` table as the seed list** — those repos are already approved to be in scope. Re-run the detection commands below only as **drift detection** to catch new adjacent repos that have been added since the last setup. Surface any new findings in the audit's *Adjacent repos consulted* section and recommend a re-run of `setup-agent-maturity-assessment` if the list has grown.

 If CONFIG.md is missing, run the full detection from scratch:

 **Detection — run these from the primary repo before scoring:**

 ```bash
 # 1. External GitHub Actions referenced from this repo's workflows
 grep -rhE "uses:\s*[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+" .github/workflows/ 2>/dev/null \
  | grep -oE "[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+(@[a-zA-Z0-9_.-]+)?" | sort -u

 # 2. Terraform / OpenTofu modules sourced from external Git
 grep -rhE "source\s*=\s*\".*\"" infra/ terraform/ 2>/dev/null \
  | grep -E "git::|github\.com/" | sort -u

 # 3. Submodules
 git submodule status 2>/dev/null

 # 4. Generic cross-repo references in docs and scripts
 grep -rEh "github\.com/[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+" \
  docs/ scripts/ .github/ README.md 2>/dev/null \
  | grep -oE "github\.com/[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+" | sort -u
 ```

 **For each adjacent repo discovered:**
 - Score the relevant criterion *across both repos*. Examples: if reusable workflows live in `<org>/ci-templates`, item #2 (cadence) and item #9 (review) evidence comes from both. If Terraform modules live in `<org>/infra-modules`, item #11 (blast-radius) needs both.
 - Use `gh repo view <org>/<repo>` and targeted `gh api`/`gh search` calls to inspect — don't clone unless necessary.
 - If access is blocked (private repo, no permission), score against what's visible and flag in **Notes for re-audit**.
 - List every adjacent repo consulted in the audit's **Adjacent repos consulted** section so a re-auditor can reproduce.

 **Org-level criteria (#8 governance, #12 hiring) are inherently outside any one repo.** Look for them in `<org>/.github` policy repo, internal handbook, IT/security docs. If you can't reach those, mark `n/a` with the reason.

 ## Audit output template

 Always produce this exact structure. The per-criterion tables ARE the report — they should be readable in one pass, especially when comparing audits across multiple repos.

 **Rules for filling out the score tables:**
 - Fill in every row. Use `n/a` with a one-line reason if an item genuinely doesn't apply to the scope (then exclude that item from both numerator and max in the score math).
 - The **Why this score** column is **one sentence, ≤ 25 words**. State the single most decisive piece of evidence — the thing that pushed the score up or down. No bullet lists, no multi-clause sentences stitched with semicolons, no "but also" hedging.
 - If you have more to say, save it for **Top 3 fixes**, **Strengths to preserve**, or **Notes for re-audit**. The table is for the verdict, not the working.
 - Score in the column as `0`, `0.5`, `1`, or `n/a` — nothing else.

 ```markdown
 # Agent Maturity Assessment — <scope> — <YYYY-MM-DD>

 ## Summary
 - Raw score: X / 12
 - Weighted score: XX.X%
 - Band: **<band name>** (<band % range>)
 - Evidence tier: **<1: gh / 2: GitHub MCP / 3: git-only>** (see *Environment preflight*)
 - One-line take: <single sentence>

 ### Maturity scale (where this audit lands)

 | Band | % range | This audit |
 |------|---------|:----------:|
 | Excellent | 90%+ | |
 | Healthy | 75–89% | |
 | Functional but slow | 60–74% | |
 | Significant dysfunction | 40–59% | |
 | Triage | <40% | |

 Mark the row this audit falls in with `◉` in the right column; leave the others blank. This makes relative position visible at a glance and survives copy-paste to Slack / a doc / a slide.

 ## Scores

 ### A. Engineering basics (weight 1.0×)
 | # | Item | Score | Why this score |
 |---|------|-------|----------------|
 | 1 | Reproducible dev environments | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 2 | Sub-day integration cadence with measured outcomes | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 3 | Testability and agent inner loop | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 4 | Observability before features | 0/0.5/1 | <one sentence, ≤ 25 words> |

 Subtotal: X.X × 1.00 = X.X / 4.00

 ### B. Knowledge & context (weight 1.5×)
 | # | Item | Score | Why this score |
 |---|------|-------|----------------|
 | 5 | Design discipline as a practice | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 6 | Codebase composed of deep modules | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 7 | Repo-local agent context | 0/0.5/1 | <one sentence, ≤ 25 words> |

 Subtotal: X.X × 1.50 = X.X / 4.50

 ### C. AI governance & quality (weight 1.25×)
 | # | Item | Score | Why this score |
 |---|------|-------|----------------|
 | 8 | Sanctioned, governed AI tooling | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 9 | Human review on every PR | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 10 | Evals for AI-touched code paths | 0/0.5/1 | <one sentence, ≤ 25 words> |
 | 11 | Blast-radius controls for agents | 0/0.5/1 | <one sentence, ≤ 25 words> |

 Subtotal: X.X × 1.25 = X.X / 5.00

 ### D. Hiring (weight 1.0×)
 | # | Item | Score | Why this score |
 |---|------|-------|----------------|
 | 12 | Judgment under AI augmentation | 0/0.5/1 | <one sentence, ≤ 25 words> |

 Subtotal: X.X × 1.00 = X.X / 1.00

 ## Top 3 fixes (highest leverage)
 1. **<item>** — why this one, what good looks like, suggested owner.
 2. **<item>** — …
 3. **<item>** — …

 ## Strengths to preserve
 - <thing the team is doing right that shouldn't get broken during change>
 - <ditto>

 ## Adjacent repos consulted
 - `<org>/<repo>` — <one-line: why it was relevant, e.g., "Reusable workflow `org/ci-templates/.github/workflows/deploy.yml` referenced by this repo's deploy.yml">
 - `<org>/<repo>` — …

 (If none: write "None — all evidence within scope repo.")

 ## Notes for re-audit
 - <calibration notes, things to recheck next quarter>
 ```

 **Worked example of a "Why this score" cell** (do not include this in actual audits):

 | Quality | Cell content |
 |---------|-------------|
 | Too long | `pnpm -r test resolves to nothing — no package implements test. ci.yml line 80: dotnet test \|\| true with comment 'no real tests yet'. Zero test files anywhere. Architecture is testable in principle but the inner loop runs nothing.` |
 | Too vague | `No tests exist.` |
 | Right size | `CI runs dotnet test \|\| true, no test files exist anywhere, and the architecture's seams sit unused.` |

 ## Operating principles

 - **Score conservatively.** Better to score 0.5 and revise up than to over-score on day one and have to explain why everything got "worse".
 - **Evidence beats assertions.** A team that says they have ADRs but the last one was committed two years ago scores 0.5, not 1.0.
 - **Don't average heterogeneous teams.** Score them separately and report side-by-side.
 - **Use it as a conversation tool, not a club.** The point is to find leverage, not to grade people.
 - **Re-score quarterly.** Movement matters more than absolute level.
 - **Calibrate against itself, not against other companies.** The first audit is the baseline; trends are the signal.

 ## Adapting the assessment

 As organizations mature and the AI tooling landscape shifts, expect items to be added, dropped, or re-weighted. Track changes to the assessment itself (not just individual audits) in an `audits/CHANGELOG.md` so historical scores remain interpretable.
No results found