Skip to content

Instantly share code, notes, and snippets.

@apnea
Last active May 2, 2026 18:24
Show Gist options
  • Select an option

  • Save apnea/84b8c673d4c84f28992835a26aea4abe to your computer and use it in GitHub Desktop.

Select an option

Save apnea/84b8c673d4c84f28992835a26aea4abe to your computer and use it in GitHub Desktop.
Caveman skill token cost analysis

Caveman Token Cost Analysis

Date: 2026-04-18

Analysis of caveman — a skill/plugin instructing AI coding agents to respond in compressed prose, dropping articles, filler, and hedging while preserving technical accuracy.

Summary

Caveman claims: "~65-75% fewer tokens," measured via an eval harness that counts visible output tokens.

This analysis finds:

  • 50% fewer visible output tokens (not 65-75%), vs "Answer concisely." baseline
  • Eval does not account for 896 input tokens/turn instruction payload
  • Thinking tokens (hidden reasoning) could be massive and are unaffected — visible output savings may be a small fraction of total token consumption
  • Per-prompt savings range from 0% to 88%; average of ~102 tokens/turn breaks even at ~9 turns
  • The real value may be cognitive, not economic: less text to read, and intensity levels the user can choose to match their own domain familiarity

Methods

Token types used in this analysis

Every LLM API call involves three categories of tokens:

  • Input tokens: everything sent to the model — your prompt, conversation history, and system instructions.
  • Output tokens (visible): the response text you see. Output token pricing varies by provider — Claude Opus 4 charges 5× more per output token than input (Anthropic pricing), GPT-5.4 charges 6× (OpenAI pricing).
  • Output tokens (thinking): hidden reasoning the model generates before producing the visible response. Priced the same as visible output tokens (Anthropic extended thinking docs: "You're charged for the full thinking tokens generated by the original request"). Not shown to the user.

Source of the numbers

This analysis draws on three activities:

  1. Reading caveman's eval code and results: caveman's eval harness (evals/llm_run.py) sends 10 questions to Claude with varied system prompts, recording visible responses to evals/snapshots/results.json. Token counts derived by tokenizing those responses with tiktoken (o200k_base encoding).

  2. Measuring input token costs: input token counts measured with tiktoken (o200k_base). Approximation — Claude uses a different tokenizer. Ratios between configurations should hold regardless.

  3. Thinking tokens: not directly measured. Would require API response metadata.


Analysis

The caveman claim

Caveman's README: "~65-75% fewer tokens."

What caveman's evaluation measures

Caveman includes a built-in evaluation harness that sends 10 questions to Claude with three different system prompts:

Configuration System prompt
baseline (empty — no instructions)
terse Answer concisely. (3 tokens)
caveman Answer concisely. + full caveman SKILL.md (896 tokens, tiktoken measurement)

Caveman's eval captures visible text output, tokenized with tiktoken. The honest comparison is caveman vs terse — comparing against baseline would conflate caveman's compression with generic brevity. Input tokens and thinking/reasoning tokens are NOT captured.

The numbers

From caveman's eval harness (claude-opus-4-6, 10 prompts, 2026-04-08):

Configuration Visible output tokens vs terse
baseline (no instructions) 1,948 -5%
terse (Answer concisely.) 2,045
caveman 1,026 -50%
caveman-cn (文言文) 951 -53%
caveman-es (Spanish) 1,218 -40%
compress 1,624 -21%

Adding "Answer concisely." increased output by 97 tokens vs no instructions — counterintuitive. Likely artifact of small sample (10 prompts). The honest comparison is caveman vs terse, not vs baseline.

The "-50%" for caveman is the aggregate (1,026 ÷ 2,045 = 50%). Per-prompt savings vary from 0% on "How do I fix a memory leak in a long-running Node.js process?" to 88% on "Explain database connection pooling." The median of the 10 per-prompt savings is 50.5%, coincidentally close to the aggregate.

The "~65-75%" claim

Caveman's eval shows 50% aggregate reduction vs terse. The 65-75% figure is unsupported by the eval data. Source undetermined.

Hidden costs

1. Input token overhead (every turn)

Caveman's compression rules are injected as a system prompt with every message. Token counts from tiktoken measurement:

Delivery method Input tokens/turn (tiktoken)
Core caveman rules only 896
All four caveman skill files (full install) 3,409
Minimal activation rule only 163

Caveman's eval does not subtract this overhead from the reported savings.

2. Thinking/reasoning tokens

Caveman only compresses visible output. Extended thinking generates hidden reasoning tokens (billed as output) before the visible response. Caveman's eval has zero visibility into these. Whether the compression rules affect thinking output is untested. If they do, the economics shift in caveman's favor.

3. Full token picture per response

Tokens Source
Thinking (if enabled, same for both) Unknown Not measured
Terse visible output ~205 out Eval data (2,045 ÷ 10)
Caveman visible output ~103 out Eval data (1,026 ÷ 10)
Visible output saved ~102/turn Difference
Caveman input overhead 896 in tiktoken measurement

Break-even analysis

The break-even calculation uses average output savings from caveman's eval (~102 tokens/turn) against the fixed input cost per turn. Per-prompt savings range from 0% to 88%.

Measured data only (visible output):

Delivery method Output tokens saved/turn (eval avg) Input tokens paid/turn (tiktoken) Turns to break even (visible output only)
Core caveman rules (896 tokens) ~102 896 ~9 turns
All four skill files (3,409 tokens) ~102 3,409 ~33 turns
Minimal rule (163 tokens) ~102 163 ~2 turns

With thinking tokens (speculative): uncompressible thinking output would increase break-even, but magnitude depends on unmeasured thinking token counts. Not quantified.

Average masks the range: 0% savings still pays 896 input tokens; 88% savings breaks even in ~2 turns. Session break-even depends on question mix.

Cognitive benefits

This analysis is economic. Caveman may also deliver cognitive benefits.

Hypotheses

  • Reading time: average English reading speed ~238 wpm (Brysbaert 2019). Fewer tokens should mean less reading, but tokens ≠ words and compressed text may require re-reading. Unmeasured.
  • Signal-to-noise: caveman strips filler while preserving technical content (code, identifiers, commands), potentially increasing information density. Unmeasured.
  • Time-to-comprehension: the metric that matters. Controlled study needed: two groups, same question, standard vs compressed response, measure time to correct action.
  • Ambiguity tradeoff: compressed text may require more effort per word. For experienced developers, likely net positive — they know the domain and don't need hedging.

Open questions

  1. Thinking token compression: could caveman's rules make the model think tersely? Or would compressed output require thinking in natural language first, then compressing — increasing thinking tokens? Same for compressed prior responses in conversation history — the model may need to expand them before reasoning. Untested. Only answerable by comparing thinking token counts between terse and caveman arms.
  2. Non-Claude models: caveman's eval only tests Claude opus. Output pricing, thinking behavior, and compression effectiveness may differ across Gemini, GPT, etc.
  3. Origin of the 65-75% claim: undetermined. May come from a different eval run, model, or measurement method.
  4. Time-to-comprehension study: cognitive benefit plausible but unmeasured. Controlled study (terse vs caveman at each intensity level, segmented by developer experience) would settle this.
  5. Signal-to-noise measurement: untested. Measuring technical content ratio in caveman's eval responses would be straightforward.

Further reading

Research on chain-of-thought (reasoning step) compression — adjacent to the thinking token question but not directly testing system prompt effects on production models:

  • Lee, Che & Peng (2025) — each problem has a minimum thinking token count ("token complexity") below which accuracy degrades. arxiv:2503.01141
  • Han et al. (2025) — proposes a token-budget-aware prompting framework; LLM reasoning is "unnecessarily lengthy" and compressible via prompt instructions. arxiv:2412.18547
  • Cao et al. (2026, Draft-Thinking) — training-based approach: teaches models a concise draft-style reasoning structure via curriculum learning, achieving 82.6% reasoning budget reduction on MATH500 at 2.6% accuracy cost. arxiv:2603.00578
  • Huang et al. (2026, SAT) — step-level difficulty-aware pruning of reasoning tokens using a process reward model; 40% reduction while maintaining accuracy. arxiv:2604.07922
  • Yan et al. (2026) — CoT compression can degrade trustworthiness (safety, hallucination resistance) even when accuracy is preserved. arxiv:2604.04120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment