connectwithprakash · April 3, 2026 06:41
diff --git a/skill-eval-benchmark.md b/skill-eval-benchmark.md
Skill	Pass Rate	Baseline	Lift	Token Delta
summarizing-specs	100%	62.5%	+37.5%	+4%
gathering-coding-context	100%	70.9%	+29.1%	-8% (fewer)
Test	With	Without	Lift
Full spec (constitution, 265 lines)	8/8 (100%)	5/8 (62.5%)	+37.5%
Jira ticket (hook-enforcement, 400 lines)	4/4 (100%)	3/4 (75%)	+25%
Minimal input (3 lines)	4/4 (100%)	2/4 (50%)	+50%
Dimension	Count	What It Tests
Structural	6	Required sections exist
Content depth	6	Rationale present, FR coverage, open questions surfaced
Faithfulness	6	Claims trace to source per-section, no invented terminology
Conciseness	4	5-15% target, no verbatim copy
Completeness	4	AC consolidation noted, MUST reqs, no omitted questions
Test	With	Without	Lift
Add testing skill (task-specific)	6/6 (100%)	4/6 (66.7%)	+33.3%
Onboarding (vague prompt)	4/4 (100%)	3/4 (75%)	+25%
Dimension	Count	What It Tests
Process	6	Checklist steps with visible reported contents
Depth	4	PR relevance, discrepancies, actual contents
Accuracy	4	Cross-reference against actual repo state
Actionability	4	Next steps, conflicts flagged, target files
Anti-patterns	4	No memory answers, no single-turn, no unverified claims