Skip to content

Instantly share code, notes, and snippets.

@abhishekmishragithub
Last active November 14, 2025 12:16
Show Gist options
  • Select an option

  • Save abhishekmishragithub/1d428504e16db95672221f09b5d07ce4 to your computer and use it in GitHub Desktop.

Select an option

Save abhishekmishragithub/1d428504e16db95672221f09b5d07ce4 to your computer and use it in GitHub Desktop.
kimi-k2-thinking-eval-benchamark

Kimi-K2-Thinking – Local Evaluation (vLLM + LM Evaluation Harness)

1. Setup

  • Model: kimi-k2-thinking (Moonshot Kimi-K2 Thinking)

  • Serving backend: vLLM

  • Serve command (summary):

    • tensor-parallel-size=8
    • distributed-executor-backend=mp
    • max-model-len=131072
    • max-num-batched-tokens=98304
    • max-num-seqs=256
    • dtype=bfloat16
    • enable-chunked-prefill, enable-auto-tool-choice
  • Endpoints (OpenAI-compatible):

    • Chat: POST http://127.0.0.1:8000/v1/chat/completions
    • Completions: POST http://127.0.0.1:8000/v1/completions
  • Eval harness: EleutherAI lm-evaluation-harness

    • For chat-style tasks (GSM8K CoT, IFEval, etc.):
      • --model local-chat-completions
      • --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/chat/completions, ..."
      • --apply_chat_template --fewshot_as_multiturn
    • For loglikelihood / multiple-choice tasks (HellaSwag):
      • --model local-completions
      • --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking, ..."

2. Benchmarks Run (Smoke Tests)

These were quick sanity checks using --limit 20. They confirm integration + basic behavior but should not be treated as final metrics.

2.1 GSM8K (CoT) – Math Reasoning

  • Task: gsm8k_cot (grade-school math word problems with chain-of-thought prompting)
  • Backend: local-chat-completions
  • Settings:
    • 8-shot (--num_fewshot 8)
    • temperature = 0 (greedy)
    • max_gen_toks = 512
    • batch_size = 1
    • limit = 20 examples

Results (20-sample preview):

Task Filter n-shot Metric Value StdErr
gsm8k_cot flexible-extract 8 exact_match 0.75 0.0993
gsm8k_cot strict-match 8 exact_match 0.70 0.1051

Interpretation:

  • flexible-extract:
    • Looser answer extraction (tolerates “The answer is 47.”).
    • ~75% of sampled problems answered correctly (≈15/20).
  • strict-match:
    • Stricter matching against the gold answer string.
    • ~70% exact-match accuracy (≈14/20).
  • High stderr (~0.10) because the sample size is tiny; this is just a sanity check.

2.2 HellaSwag – Commonsense Multiple Choice

  • Task: hellaswag
    • Standard commonsense reasoning benchmark with 4-way multiple choice.
  • Backend: local-completions
  • Settings (smoke run):
    • --model local-completions
    • --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=8,max_retries=3,timeout=3000,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking"
    • temperature = 0
    • max_gen_toks = 64
    • batch_size = 1
    • limit = 20 examples

Results (20-sample preview):

  • Metrics are written to:
    • results/kimi-think-hellaswag-smoke-hellaswag/results.json
  • This run is only a small slice of the full HellaSwag validation set (~10k examples), so numbers are for smoke-testing only.

✅ Once the full run is completed (no --limit), update this section with:

  • Accuracy (full validation set)
    • accuracy = X.XX (from results/kimi-think-hellaswag-full/results.json)

3. Planned / Full Evaluations -- WIP.

For metrics we can actually compare and track, we plan to run:

3.1 GSM8K (CoT) – Full

Command:

lm_eval \
  --model local-chat-completions \
  --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=8,max_retries=3,tokenized_requests=False" \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks gsm8k_cot \
  --batch_size 1 \
  --num_fewshot 8 \
  --gen_kwargs "temperature=0,max_gen_toks=512" \
  --output_path results/kimi-think-gsm8k-full \
  --log_samples
  • Same setup as the smoke test, but no --limit.
  • Produces results/kimi-think-gsm8k-full/results.json with final exact-match scores.

3.2 HellaSwag – Full

Command:

lm_eval \
  --model local-completions \
  --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=8,max_retries=3,timeout=3000,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking" \
  --tasks hellaswag \
  --batch_size 1 \
  --gen_kwargs "temperature=0,max_gen_toks=64" \
  --output_path results/kimi-think-hellaswag-full \
  --log_samples
  • Runs on the full HellaSwag validation set.
  • Produces results/kimi-think-hellaswag-full/results.json with accuracy and stderr.

3.3 Optional Next Steps

  • BBH (BigBench Hard) via --tasks bbh using local-chat-completions (reasoning diversity).
  • IFEval via --tasks ifeval to explicitly measure instruction-following behavior.

4. Summary

  • Integration between vLLM (local Kimi-K2-Thinking) and lm-evaluation-harness is working for both:

    • Chat-style CoT tasks (gsm8k_cot via local-chat-completions), and
    • Loglikelihood multiple-choice tasks (hellaswag via local-completions).
  • Initial smoke runs show:

    • Reasonable math reasoning on GSM8K-CoT (≈70–75% on a 20-sample slice).
    • HellaSwag pipeline working end-to-end (20-sample smoke run).
  • Next, we will:

    • Run full GSM8K-CoT and full HellaSwag without --limit,
    • Log outputs and metrics under results/kimi-think-gsm8k-full/ and results/kimi-think-hellaswag-full/,
    • Add those final numbers to this document.

==========================

Short descriptions of the evals

GSM8K / gsm8k_cot

  • Full name: Grade School Math 8K

  • What it is: A dataset of ~8.5k math word problems written at grade-school level.

  • Eval variant (gsm8k_cot): Uses chain-of-thought prompting — the model sees a few worked examples and is expected to reason step-by-step and give the final numeric answer.

  • What it measures:

    • Multi-step arithmetic and algebra reasoning
    • Ability to maintain a chain of reasoning and land on the correct final answer
  • Metric: exact_match (did the model’s final answer match the gold answer).


HellaSwag / hellaswag

  • What it is: A commonsense reasoning benchmark built from multiple choice sentence completion. The model sees a short context and must choose the most plausible continuation among 4 options.

  • Eval style: Multiple-choice evaluated via log-likelihood (for each option) using a completions / logprob-style API.

  • What it measures:

    • Commonsense and world knowledge
    • Plausibility judgments (which continuation “feels” natural vs absurd)
  • Metric: accuracy over the 4-way choices.


BBH / bbh (Big-Bench Hard)

  • Full name: BIG-Bench Hard (subset of BIG-Bench)

  • What it is: A curated collection of the hardest tasks from BIG-Bench: logical reasoning, math puzzles, tracking objects, word games, etc. It’s split into many subtasks like boolean_expressions, tracking_shuffled_objects, salient_translation_error_detection, web_of_lies, etc.

  • Eval variant in lm-eval: bbh uses CoT-style few-shot prompts (typically 3-shot) per subtask (you saw names like bbh_cot_fewshot_boolean_expressions in the logs).

  • What it measures:

    • General-purpose “hard” reasoning
    • Ability to follow tricky instructions and maintain multi-step logic
  • Metric: Mostly accuracy per subtask, with an overall average across subtasks.


IFEval / ifeval

  • Full name: Instruction-Following Evaluation

  • What it is: A benchmark where each example is a natural-language instruction (or a set of constraints) and a reference rubric for what counts as “following the instruction”.

  • How it’s evaluated: The model’s output is checked against a set of automatically computed constraints (e.g., did it use the requested style, length, format, include/exclude certain words).

  • What it measures:

    • How reliably the model follows instructions
    • Adherence to formatting / phrasing constraints (e.g., “answer with only YES or NO”, “use exactly 3 bullet points”, etc.).
  • Metric: Percentage of constraints satisfied (various sub-metrics; lm-eval aggregates them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment