Skip to content

Instantly share code, notes, and snippets.

@connectwithprakash
Last active April 3, 2026 06:41
Show Gist options
  • Select an option

  • Save connectwithprakash/0be70cb0b7b543f0dc5c635804ca9dea to your computer and use it in GitHub Desktop.

Select an option

Save connectwithprakash/0be70cb0b7b543f0dc5c635804ca9dea to your computer and use it in GitHub Desktop.
Skill eval benchmark results for labs-skills PR #16 (summarizing-specs + gathering-coding-context)

Skill Eval Benchmark Results

Date: 2026-04-02 PR: https://github.com/fetch-rewards/labs-skills/pull/16 Skills: summarizing-specs, gathering-coding-context

Summary

Skill Pass Rate Baseline Lift Token Delta
summarizing-specs 100% 62.5% +37.5% +4%
gathering-coding-context 100% 70.9% +29.1% -8% (fewer)

Independently verified by 3 LLM-as-judge grader agents cross-referencing outputs against source specs.

summarizing-specs

Per-Eval

Test With Without Lift
Full spec (constitution, 265 lines) 8/8 (100%) 5/8 (62.5%) +37.5%
Jira ticket (hook-enforcement, 400 lines) 4/4 (100%) 3/4 (75%) +25%
Minimal input (3 lines) 4/4 (100%) 2/4 (50%) +50%

Quality Assertions (26 total, 5 dimensions)

Dimension Count What It Tests
Structural 6 Required sections exist
Content depth 6 Rationale present, FR coverage, open questions surfaced
Faithfulness 6 Claims trace to source per-section, no invented terminology
Conciseness 4 5-15% target, no verbatim copy
Completeness 4 AC consolidation noted, MUST reqs, no omitted questions

With-skill wins on

  • Section references [S#] for traceability (baseline never does this)
  • "Not specified -- clarify" for sparse input with specific clarifying questions (baseline invents content)
  • Checkbox format for AC (baseline uses prose)
  • Risks section always present (baseline often omits)
  • All open questions surfaced (baseline selectively omits)

gathering-coding-context

Per-Eval

Test With Without Lift
Add testing skill (task-specific) 6/6 (100%) 4/6 (66.7%) +33.3%
Onboarding (vague prompt) 4/4 (100%) 3/4 (75%) +25%

Quality Assertions (20 total, 5 dimensions)

Dimension Count What It Tests
Process 6 Checklist steps with visible reported contents
Depth 4 PR relevance, discrepancies, actual contents
Accuracy 4 Cross-reference against actual repo state
Actionability 4 Next steps, conflicts flagged, target files
Anti-patterns 4 No memory answers, no single-turn, no unverified claims

With-skill wins on

  • Always checks open PRs (baseline NEVER does this)
  • Step-by-step visible output (baseline gives one-shot summary)
  • Catches stale docs (e.g., skill count mismatch across files)
  • Maps change dependencies and blast radius
  • Uses fewer tokens (-8%) by following structured checklist

Methodology

  • Skill-creator eval workflow: with-skill vs without-skill baseline
  • 5 test scenarios (3 summarization + 2 context)
  • 10 total runs (5 with-skill + 5 baseline)
  • 46 quality assertions across 10 dimensions
  • Independent LLM-as-judge grading (3 grader agents)
  • Cross-model verified: Claude Opus 4.6 + Codex GPT-5.4
  • Reproducibility: pass@3 verified for both skills
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment