Skip to content

Instantly share code, notes, and snippets.

@lissahyacinth
Created March 2, 2026 14:28
Show Gist options
  • Select an option

  • Save lissahyacinth/23464b259395d296824108548219e825 to your computer and use it in GitHub Desktop.

Select an option

Save lissahyacinth/23464b259395d296824108548219e825 to your computer and use it in GitHub Desktop.
Audit Skill for Claude Code
name description argument-hint allowed-tools
audit
Multi-phase codebase consistency audit. Scans for signals (TODOs, ignored tests, rot), builds shadow documentation of Claude's understanding per-file, then cross-references everything to surface inconsistencies and questions for the human.
[scan|shadow|crossref|all|reset|answer N]
Read, Grep, Glob, Bash, Agent, Write, Edit, AskUserQuestion

Codebase Consistency Audit

You maintain a set of audit documents in .claude/audit/ that represent YOUR understanding of this codebase. These documents are yours — the human can read them, but you own them. Never edit the user's source files during an audit.

Run the phase specified by $ARGUMENTS. Default to all if no argument given.

  • scan — Phase 1 only
  • shadow — Phase 2 only (runs Phase 1 first if signals.md doesn't exist)
  • crossref — Phase 3 only (runs Phase 2 first if shadow docs don't exist)
  • all — All three phases sequentially
  • answer — Present the top 1-3 unanswered questions interactively
  • answer N — Bring up question Q-{N} specifically
  • fix — Re-apply any unapplied auto-fixes from Tier 1
  • reset — Delete all files in .claude/audit/ except acknowledged.md (that's the human's), then recreate empty directories. Confirms to the user what was removed.

Directory structure

.claude/audit/
  signals.md              # Phase 1: Raw signals found in codebase
  shadow/                 # Phase 2: One doc per source file
    <flattened-path>.md   # e.g. src--main.rs.md, src--lib--config.rs.md
  domains/                # Domain sub-skills: understanding of external systems
    _index.md             # Maps detection patterns → domain files
    <domain-name>.md      # Per-domain understanding (DB schema, API contracts, etc.)
  findings.md             # Phase 3: Cross-reference results
  questions.md            # Phase 3: Questions needing human domain knowledge
  acknowledged.md         # Human-maintained — never modify this file

Create .claude/audit/ and subdirectories if they don't exist.


Phase 1: Scan

Fast parallel sweep for signals — markers in the code that indicate known debt, deferred work, or potential rot. This phase uses Grep extensively and should be quick.

Signal categories to scan for

Deferred work:

  • TODO, FIXME, HACK, XXX, WORKAROUND, TEMPORARY, TECH_DEBT
  • Language-specific: #[ignore], #[allow(...)] (Rust), @Ignore/@Disabled (Java/Kotlin), .skip/xit/xdescribe (JS), @pytest.mark.skip (Python), #if false/#if 0 (C/C++)
  • Commented-out code blocks (3+ consecutive commented lines that look like code, not prose)

Staleness indicators:

  • References to file paths in comments or docs — note these for Phase 3 verification
  • Version numbers in comments (e.g. "as of v2.3", "since 1.0")
  • Date references in comments (e.g. "added 2024-01-01", "temporary until Q2")
  • Dead links in markdown files

External system boundaries:

  • Detect imports/usage of database clients, HTTP clients, message queues, cloud SDKs, API clients, ORMs, schema definitions
  • For each detected boundary, record the domain (e.g. "postgresql", "redis", "stripe-api") and which files touch it
  • Check .claude/audit/domains/_index.md — if this domain already has a sub-skill, note the association. If not, this is a new domain that needs a stub created.

Convention breaks:

  • Files without any tests when sibling files have tests
  • Mixed conventions (e.g. some files use one error pattern, others use another — just note the signal, don't judge)

Output

Write .claude/audit/signals.md. Format:

# Signals

Last scan: {date}
Files scanned: {count}

## Deferred Work
| Signal | File | Line | Text |
|--------|------|------|------|
| TODO   | src/foo.rs | 42 | // TODO: handle timeout case |

## Staleness Indicators
| Type | File | Line | Text |
|------|------|------|------|
| path-ref | README.md | 15 | See `src/old_module.rs` for details |

## External System Boundaries
| Domain | Domain File | Files | Status |
|--------|-------------|-------|--------|
| postgresql | postgresql.md | src/db/users.py, src/db/orders.py | known |
| stripe-api | (none) | src/billing/charge.py | NEW — stub needed |

## Convention Observations
Brief notes on patterns observed — what the "usual" convention seems to be and where it deviates. Keep this short. This is not a style guide — it's noting what you see so Phase 3 can use it.

Domain stub creation

For each NEW domain detected, create a stub at .claude/audit/domains/<domain-name>.md:

# Domain: {domain-name}

Status: STUB — needs human input
Detected in: {list of files}
Detected via: {import patterns or API usage that triggered detection}

## What we think this is
{Brief description based on what the code appears to do with this system — e.g. "The code creates and deletes Kubernetes Node objects via the kube client library"}

## What we don't know
{List specific questions — not generic ones. Based on what the code assumes:}
- "The code assumes field `spec.nodeID` is a DNS-1123 subdomain — is this enforced by the external system or only by our validation?"
- "The code retries on 409 Conflict — is this the correct behavior for this API?"

## Schema / Contract
(To be filled in by human or by reading external documentation)

Also update .claude/audit/domains/_index.md to include the new domain mapping.

The _index.md format:

# Domain Index

Maps detection patterns to domain sub-skill files.

| Pattern | Domain File | Description |
|---------|-------------|-------------|
| `psycopg`, `sqlalchemy`, `SELECT`, `INSERT` | postgresql.md | PostgreSQL database interactions |
| `stripe.`, `stripe_client` | stripe-api.md | Stripe payment API |

Phase 2: Shadow Documentation

Build or update Claude's understanding of each significant source file. Use subagents to analyze files in parallel.

Selecting files to shadow

  1. Glob for all source files (detect language from project config files)
  2. Exclude: generated files, lock files, vendored dependencies, build output
  3. Prioritize: files referenced in signals.md, files with doc comments, files that are entry points or define public APIs
  4. Skip files under ~20 lines unless they contain signals

Shadow doc format

For each file, create .claude/audit/shadow/<flattened-path>.md where the path uses -- as separator (e.g. src/controller/mod.rs becomes src--controller--mod.rs.md).

The unit of analysis is the block, not the file. A block is a function, method, class, type definition, test case, or any named code unit. The file is just the namespace that contains blocks.

Each shadow doc is produced by a subagent that reads the source file.

The subagent should write:

# Shadow: {original path}

Last updated: {date}
Git checkpoint: {short commit hash}

## Imports
What this file imports from other files in this project (not external packages). List the specific names.

## Blocks

### `function_name(args)` — {type} (lines N-M)
**Comments say:** {exact quotes from doc comments or inline comments on this block}
**Code does:** {brief factual description of what the code actually does — inputs, outputs, branches}
**Internal deps:** {other blocks in this project this block calls or references}
**Signals:** {any TODOs, ignores, or other signals within this block}

### `ClassName` — {type} (lines N-M)
...

### `test_case_name` — test (lines N-M)
**Tests:** {what expressions/inputs are tested and what outputs are asserted}
**Coverage:** {what the test exercises — and notably, what it doesn't}
...

Key principles:

  • "Comments say" vs "Code does" are always separate. The whole point is finding where they disagree.
  • Test blocks record what they actually test, not what they claim to test. Quote the actual assertions.
  • Skip trivial blocks (simple re-exports, empty __init__, one-line accessors) unless they contain signals.

Domain-aware shadowing

Before spawning subagents, check .claude/audit/domains/_index.md and signals.md (External System Boundaries table). For each file being shadowed:

  1. Check if it touches any known domain (via the index patterns)
  2. If yes, read the domain sub-skill file and pass its content to the subagent as context
  3. For blocks that interact with an external system, add a Domain field noting what the block assumes about the system and whether it matches the domain doc

This is how domain knowledge propagates: the human seeds a domain doc once, and every block that touches that domain gets checked against it.

Example of a domain-aware block entry:

### `create_node(offering, config)` — function (lines 45-80)
**Comments say:** "Provisions a node via the Hetzner API and waits for it to join"
**Code does:** POSTs to /servers, polls status until "running", then waits for K8s node to appear
**Domain (stripe-api):**
  Assumes: POST /v1/charges returns 200 with `charge.id`
  Assumes: idempotency key prevents duplicate charges
  Matches domain doc: Yes / No / Partially / Domain doc is a stub

Subagent prompt template

When spawning subagents for shadow docs, use this prompt structure:

Read {file_path} completely. You are creating a shadow document — Claude's own understanding of this file, stored separately.

The unit of analysis is the BLOCK (function, class, type, test case), not the file. For each non-trivial block, extract:

  1. What the comments/docs SAY it does (quote them exactly)
  2. What the code ACTUALLY does (brief factual description)
  3. What other blocks in this project it calls or depends on
  4. Any signals: TODOs, ignored tests, commented-out code
  5. If it interacts with an external system, what does it assume about that system?

Separate "comments say" from "code does" — they are recorded independently because we are looking for disagreements between them.

For test blocks specifically: record what inputs are tested and what assertions are made. Note what IS tested and what is NOT (e.g. "tests Cond with literal values only — no tests with variable references in branches").

Write the shadow doc to {shadow_path}.

If domain context is available, append to the prompt:

This file interacts with the following external system(s). Here is our current understanding — check whether each block's assumptions match: {domain doc content}

Incremental updates via git

Shadow docs track the git commit hash they were built from. On each run:

  1. Run git log -1 --format=%H to get the current HEAD commit
  2. For each existing shadow doc, compare its recorded commit against HEAD
  3. Run git diff {shadow_commit}..HEAD -- {file_path} to check if the source file changed
  4. Skip files with no diff. Regenerate shadow docs for changed files.
  5. New shadow docs record the current HEAD as their checkpoint.

Shadow doc header becomes:

# Shadow: {original path}

Last updated: {date}
Git checkpoint: {commit hash (short)}

This is faster than hashing file contents and gives you a meaningful reference point — you can see exactly when Claude last looked at a file relative to the project history. It also means git log {checkpoint}..HEAD -- {path} shows you every change Claude hasn't processed yet.


Phase 3: Cross-reference

Compare everything against everything. This is where inconsistencies surface.

Checks to perform

Shadow vs Shadow:

  • File A's dependency list says it imports X from File B. Does File B actually export X?
  • File A's claim says "must run after init_pool()". Does any caller actually ensure this ordering?
  • Do two files that implement similar patterns (found via Convention Observations) actually agree on the pattern?

Shadow vs User Docs:

  • Read all user-maintained documentation (README, CLAUDE.md, ADRs, RFCs, etc. — discover these dynamically via Glob for *.md at the project root and common doc directories)
  • For each claim in user docs, check if any shadow doc contradicts it
  • For each architectural description, check if the shadow docs collectively tell a different story

Shadow vs Domains:

  • For each shadow doc with a Domain Interactions section, compare its assumptions against the domain sub-skill
  • If the domain doc says the API returns field X but the code expects field Y, that's a Contradiction
  • If the domain doc is a STUB, generate a Question asking the human to verify the code's assumptions
  • If multiple files interact with the same domain but assume different things about it (e.g. one file expects a 201, another expects a 200), that's a shadow-vs-shadow finding mediated by domain knowledge

Signals vs Shadow:

  • For each TODO/FIXME: does the shadow doc's "Purpose" suggest the TODO is still relevant, or has the surrounding code changed enough that the TODO might be stale?
  • For each ignored/skipped test: does the shadow doc explain why? If not, flag it.
  • For each file path reference found in Phase 1: does the referenced file actually exist?

ACKNOWLEDGED.md filter: Read .claude/audit/acknowledged.md (if it exists). Any finding ID listed there is suppressed — do not include it in findings.md. Do mention the count of suppressed findings in the summary.

Output — Resolution tiers

Every finding must be classified into one of three resolution tiers. This determines how it's presented and what happens next.

Tier 1: Auto-fix

The correct resolution is unambiguous — there's only one reasonable fix and it's mechanical.

Criteria:

  • One artifact is clearly stale/wrong, the other is clearly current
  • The fix is a simple text replacement, not a behavioral change
  • No domain knowledge is needed to determine which side is correct

Examples:

  • Doc comment references a file path that was renamed/moved
  • README says the config key is db_host → code reads database_url
  • Comment says "returns a list" → function returns a generator
  • Stale comment about conditional compilation when the module is always compiled

The finding includes the proposed diff — the exact before/after text. This is both the documentation of the finding and the spec for applying it.

### [F-{NNN}] {Short title}
**Resolution:** Auto-fix
**Source:** {file:line} says: "{quoted text}"
**Target:** {file:line} says/does: "{quoted text}"
**Proposed fix:**
In `{file}:{line}`:
- {old text}
+ {new text}

Auto-fixes are proposed, not applied during Phase 3. The summary presents them as a batch. The human can:

  • Approve all → Claude applies the batch
  • Reject individual fixes → those get promoted to Suggest
  • Reject all → nothing changes, findings stay in findings.md for later

/audit fix re-presents any unapplied auto-fix proposals.

Tier 2: Suggest

The contradiction is clear, but there are 2+ valid resolutions. Claude presents the options; the human picks one.

Criteria:

  • Both artifacts could be "the right one" depending on intent
  • The fix requires choosing between alternatives
  • A reasonable developer could go either way

Examples:

  • Comment says "4 pods" but code creates 3 → update the comment, or add a 4th pod?
  • delete error maps to CreationFailed → rename the variant, or add a DeletionFailed variant?
  • active_count is defined but never written → remove it, or implement the writes?
### [F-{NNN}] {Short title}
**Resolution:** Suggest
**Source:** {file:line} says: "{quoted text}"
**Target:** {file:line} says/does: "{quoted text}"
**Options:**
  A. {first resolution — describe the change}
  B. {second resolution — describe the change}

Present Suggest findings via AskUserQuestion with the options. Record the human's choice.

Tier 3: Raise

Genuinely needs domain knowledge. Can't determine the correct resolution from the codebase alone.

Criteria:

  • The inconsistency might be intentional (a design decision)
  • Resolution depends on planned features, external system behavior, or business logic
  • Claude cannot determine which side is "correct" without context the codebase doesn't contain

Examples:

  • Is this enum variant intentionally absent from the API schema, or was it missed?
  • Is this type planned-but-unimplemented, or dead code?
  • Does this config struct reflect intended future work or abandoned design?
### [F-{NNN}] {Short title}
**Resolution:** Raise
**Source:** {file:line} says: "{quoted text}"
**Target:** {file:line} says/does: "{quoted text}"
**Context:** {Why this can't be resolved from the codebase alone}
**Question:** {Specific question requiring domain knowledge}

Present Raise findings via AskUserQuestion. Record the human's answer.

Tier assignment rules

  • If in doubt between Auto-fix and Suggest, choose Suggest. Don't auto-fix anything where the "correct" side is ambiguous.
  • If in doubt between Suggest and Raise, choose Suggest and include "keep both as-is (this is intentional)" as an option.
  • Never auto-fix behavioral code changes. Auto-fix is for comments, docs, and references only.

findings.md format:

# Audit Findings

Last audit: {date}
Total: {n} findings ({a} auto-fix, {s} suggest, {r} raise, {k} suppressed)

## Auto-fix
{list of Tier 1 findings}

## Suggest
{list of Tier 2 findings}

## Raise
{list of Tier 3 findings}

questions.md:

Only Tier 3 (Raise) findings go here. Preserve any that already have an answer recorded.

### [Q-{NNN}] {Question title}
**Found in:** {file:line}
**Context:** {What you found and why it's ambiguous — quote both sides}
**Question:** {Specific question framed to elicit domain knowledge}
**Status:** Open | Answered
**Answer:** {recorded when human responds via `/audit answer`}

Numbering

Continue numbering from the highest existing F-NNN or Q-NNN in the audit files. Don't restart from 1.


Human interaction flow

After Phase 3 (end of crossref or all)

Present findings by tier, then act on them:

1. Auto-fix batch. List all Tier 1 fixes as a summary table, then apply them all. The user's normal tool-approval flow handles consent — each edit goes through their permission settings. If a fix is rejected, move it to Suggest tier.

2. Suggest choices. For each Tier 2 finding, use AskUserQuestion to present the options. Apply the chosen resolution immediately. Record the choice in findings.md.

3. Raise questions. For the top 1-3 Tier 3 findings, use AskUserQuestion to present them. Record answers in questions.md. Remaining Raise findings stay open for /audit answer.

/audit answer (no number)

Read questions.md, find all questions with Status: Open. Present the top 1-3 via AskUserQuestion.

/audit answer N

Read questions.md, find Q-{N} specifically. Present it with full context via AskUserQuestion. If Q-{N} is already answered, show the prior answer and ask if they want to revise it.

/audit fix

Re-run just the auto-fix tier — find any unfixed Tier 1 findings and apply them. Useful after a reset or when new auto-fixes were found by a fresh scan.


Summary

Phases 1 and 2 are internal work. Keep their output minimal:

  • Scan: "Scan complete. {N} files, {M} signals, {K} domains detected."
  • Shadow: "Shadow docs updated. {N} files documented."

Phase 3 flow:

  1. Print a one-line summary: "{A} auto-fixes, {S} suggestions, {R} questions raised"
  2. Apply auto-fixes (user approves via normal tool permissions)
  3. Present suggestions as choices
  4. Ask the top questions
  5. Record everything back to findings.md and questions.md

The human should never need to manually edit any audit file.


Rules

  • Never edit user source files. You only write to .claude/audit/.
  • Never suggest fixes. Surface inconsistencies. The human decides what's correct.
  • Never modify acknowledged.md. That file belongs to the human.

The Two-Artifact Rule

Every finding and every question MUST cite two specific artifacts that disagree. An "artifact" is a concrete, quotable thing: a line of code, a doc comment, a README section, a test case, a config entry.

  • Valid finding: "Line 30 of interp.py passes env to interp(). Line 47 does not. These are both recursive calls in the same function."
  • Valid question: "The comment on line 52 says 'we might not ever use the value in conditionals.' The Cond branch on line 47 evaluates both arms before selecting — is the comment referring to this behavior?"
  • Invalid finding: "Lambdas don't capture closures." (Nothing in the codebase claims they should. This is projecting expectations from other languages.)
  • Invalid question: "Is lazy evaluation planned?" (Speculative. The audit surfaces what IS, not what COULD BE.)

If you cannot point to two artifacts that disagree, you do not have a finding. Your general knowledge of how languages/frameworks/databases "usually" work is NOT an artifact. Only things written in THIS codebase count.

Evidence, not assertion

When a finding references behavior (e.g. "the test suite only passes because tests use literals"), you must quote the evidence. Show the actual test cases. Show the actual lines. If you can't produce the quote, downgrade the claim or drop it.

Contradictions vs Questions

A finding is a Contradiction when both sides are unambiguous and they factually conflict. "Every other branch passes env, this one doesn't" is a Contradiction — there's no interpretation where both are correct.

A finding is a Question only when both sides could be intentionally different and you need domain knowledge to know which. Frame questions to help the human document their intent, not to interrogate their decisions. "Is this the behavior the comment on line 52 refers to?" is good. "Should you fix this?" is not your job.

What is NOT a finding

  • Expectations from other languages, frameworks, or "best practices" that nothing in this codebase claims to follow
  • Speculation about future plans or intended features
  • Style preferences or "this could be better" observations
  • Things a linter would catch (unused imports, empty files, placeholder text) — note these in signals.md for completeness but do NOT elevate them to findings or questions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment