Skill Eval Benchmark Results
Date: 2026-04-02
PR: https://github.com/fetch-rewards/labs-skills/pull/16
Skills: summarizing-specs, gathering-coding-context
Skill
Pass Rate
Baseline
Lift
Token Delta
summarizing-specs
100%
62.5%
+37.5%
+4%
gathering-coding-context
100%
70.9%
+29.1%
-8% (fewer)
Independently verified by 3 LLM-as-judge grader agents cross-referencing outputs against source specs.
Test
With
Without
Lift
Full spec (constitution, 265 lines)
8/8 (100%)
5/8 (62.5%)
+37.5%
Jira ticket (hook-enforcement, 400 lines)
4/4 (100%)
3/4 (75%)
+25%
Minimal input (3 lines)
4/4 (100%)
2/4 (50%)
+50%
Quality Assertions (26 total, 5 dimensions)
Dimension
Count
What It Tests
Structural
6
Required sections exist
Content depth
6
Rationale present, FR coverage, open questions surfaced
Faithfulness
6
Claims trace to source per-section, no invented terminology
Conciseness
4
5-15% target, no verbatim copy
Completeness
4
AC consolidation noted, MUST reqs, no omitted questions
Section references [S#] for traceability (baseline never does this)
"Not specified -- clarify" for sparse input with specific clarifying questions (baseline invents content)
Checkbox format for AC (baseline uses prose)
Risks section always present (baseline often omits)
All open questions surfaced (baseline selectively omits)
Test
With
Without
Lift
Add testing skill (task-specific)
6/6 (100%)
4/6 (66.7%)
+33.3%
Onboarding (vague prompt)
4/4 (100%)
3/4 (75%)
+25%
Quality Assertions (20 total, 5 dimensions)
Dimension
Count
What It Tests
Process
6
Checklist steps with visible reported contents
Depth
4
PR relevance, discrepancies, actual contents
Accuracy
4
Cross-reference against actual repo state
Actionability
4
Next steps, conflicts flagged, target files
Anti-patterns
4
No memory answers, no single-turn, no unverified claims
Always checks open PRs (baseline NEVER does this)
Step-by-step visible output (baseline gives one-shot summary)
Catches stale docs (e.g., skill count mismatch across files)
Maps change dependencies and blast radius
Uses fewer tokens (-8%) by following structured checklist
Skill-creator eval workflow: with-skill vs without-skill baseline
5 test scenarios (3 summarization + 2 context)
10 total runs (5 with-skill + 5 baseline)
46 quality assertions across 10 dimensions
Independent LLM-as-judge grading (3 grader agents)
Cross-model verified: Claude Opus 4.6 + Codex GPT-5.4
Reproducibility: pass@3 verified for both skills