abhishekmishragithub/kimi-k2-thinking-eval-benchamark.md

Kimi-K2-Thinking – Local Evaluation (vLLM + LM Evaluation Harness)

1. Setup

Model: kimi-k2-thinking (Moonshot Kimi-K2 Thinking)
Serving backend: vLLM
Serve command (summary):
- tensor-parallel-size=8
- distributed-executor-backend=mp
- max-model-len=131072
- max-num-batched-tokens=98304
- max-num-seqs=256
- dtype=bfloat16
- enable-chunked-prefill, enable-auto-tool-choice
Endpoints (OpenAI-compatible):
- Chat: POST http://127.0.0.1:8000/v1/chat/completions
- Completions: POST http://127.0.0.1:8000/v1/completions
Eval harness: EleutherAI lm-evaluation-harness
- For chat-style tasks (GSM8K CoT, IFEval, etc.):
  - --model local-chat-completions
  - --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/chat/completions, ..."
  - --apply_chat_template --fewshot_as_multiturn
- For loglikelihood / multiple-choice tasks (HellaSwag):
  - --model local-completions
  - --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking, ..."

2. Benchmarks Run (Smoke Tests)

These were quick sanity checks using --limit 20. They confirm integration + basic behavior but should not be treated as final metrics.

2.1 GSM8K (CoT) – Math Reasoning

Task: gsm8k_cot (grade-school math word problems with chain-of-thought prompting)
Backend: local-chat-completions
Settings:
- 8-shot (--num_fewshot 8)
- temperature = 0 (greedy)
- max_gen_toks = 512
- batch_size = 1
- limit = 20 examples

Results (20-sample preview):

Task	Filter	n-shot	Metric	Value	StdErr
gsm8k_cot	flexible-extract	8	exact_match	0.75	0.0993
gsm8k_cot	strict-match	8	exact_match	0.70	0.1051

Interpretation:

flexible-extract:
- Looser answer extraction (tolerates “The answer is 47.”).
- ~75% of sampled problems answered correctly (≈15/20).
strict-match:
- Stricter matching against the gold answer string.
- ~70% exact-match accuracy (≈14/20).
High stderr (~0.10) because the sample size is tiny; this is just a sanity check.

2.2 HellaSwag – Commonsense Multiple Choice

Task: hellaswag
- Standard commonsense reasoning benchmark with 4-way multiple choice.
Backend: local-completions
Settings (smoke run):
- --model local-completions
- --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=8,max_retries=3,timeout=3000,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking"
- temperature = 0
- max_gen_toks = 64
- batch_size = 1
- limit = 20 examples

Results (20-sample preview):

Metrics are written to:
- results/kimi-think-hellaswag-smoke-hellaswag/results.json
This run is only a small slice of the full HellaSwag validation set (~10k examples), so numbers are for smoke-testing only.

✅ Once the full run is completed (no --limit), update this section with:

Accuracy (full validation set)

accuracy = X.XX (from results/kimi-think-hellaswag-full/results.json)

3. Planned / Full Evaluations -- WIP.

For metrics we can actually compare and track, we plan to run:

3.1 GSM8K (CoT) – Full

Command:

lm_eval \
  --model local-chat-completions \
  --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=8,max_retries=3,tokenized_requests=False" \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks gsm8k_cot \
  --batch_size 1 \
  --num_fewshot 8 \
  --gen_kwargs "temperature=0,max_gen_toks=512" \
  --output_path results/kimi-think-gsm8k-full \
  --log_samples

Same setup as the smoke test, but no --limit.
Produces results/kimi-think-gsm8k-full/results.json with final exact-match scores.

3.2 HellaSwag – Full

Command:

lm_eval \
  --model local-completions \
  --model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=8,max_retries=3,timeout=3000,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking" \
  --tasks hellaswag \
  --batch_size 1 \
  --gen_kwargs "temperature=0,max_gen_toks=64" \
  --output_path results/kimi-think-hellaswag-full \
  --log_samples

Runs on the full HellaSwag validation set.
Produces results/kimi-think-hellaswag-full/results.json with accuracy and stderr.

3.3 Optional Next Steps

BBH (BigBench Hard) via --tasks bbh using local-chat-completions (reasoning diversity).
IFEval via --tasks ifeval to explicitly measure instruction-following behavior.

4. Summary

Integration between vLLM (local Kimi-K2-Thinking) and lm-evaluation-harness is working for both:
- Chat-style CoT tasks (gsm8k_cot via local-chat-completions), and
- Loglikelihood multiple-choice tasks (hellaswag via local-completions).
Initial smoke runs show:
- Reasonable math reasoning on GSM8K-CoT (≈70–75% on a 20-sample slice).
- HellaSwag pipeline working end-to-end (20-sample smoke run).
Next, we will:
- Run full GSM8K-CoT and full HellaSwag without --limit,
- Log outputs and metrics under results/kimi-think-gsm8k-full/ and results/kimi-think-hellaswag-full/,
- Add those final numbers to this document.

==========================

Short descriptions of the evals

GSM8K / `gsm8k_cot`

Full name: Grade School Math 8K
What it is: A dataset of ~8.5k math word problems written at grade-school level.
Eval variant (gsm8k_cot): Uses chain-of-thought prompting — the model sees a few worked examples and is expected to reason step-by-step and give the final numeric answer.
What it measures:
- Multi-step arithmetic and algebra reasoning
- Ability to maintain a chain of reasoning and land on the correct final answer
Metric: exact_match (did the model’s final answer match the gold answer).

HellaSwag / `hellaswag`

What it is: A commonsense reasoning benchmark built from multiple choice sentence completion. The model sees a short context and must choose the most plausible continuation among 4 options.
Eval style: Multiple-choice evaluated via log-likelihood (for each option) using a completions / logprob-style API.
What it measures:
- Commonsense and world knowledge
- Plausibility judgments (which continuation “feels” natural vs absurd)
Metric: accuracy over the 4-way choices.

BBH / `bbh` (Big-Bench Hard)

Full name: BIG-Bench Hard (subset of BIG-Bench)
What it is: A curated collection of the hardest tasks from BIG-Bench: logical reasoning, math puzzles, tracking objects, word games, etc. It’s split into many subtasks like boolean_expressions, tracking_shuffled_objects, salient_translation_error_detection, web_of_lies, etc.
Eval variant in lm-eval: bbh uses CoT-style few-shot prompts (typically 3-shot) per subtask (you saw names like bbh_cot_fewshot_boolean_expressions in the logs).
What it measures:
- General-purpose “hard” reasoning
- Ability to follow tricky instructions and maintain multi-step logic
Metric: Mostly accuracy per subtask, with an overall average across subtasks.

IFEval / `ifeval`

Full name: Instruction-Following Evaluation
What it is: A benchmark where each example is a natural-language instruction (or a set of constraints) and a reference rubric for what counts as “following the instruction”.
How it’s evaluated: The model’s output is checked against a set of automatically computed constraints (e.g., did it use the requested style, length, format, include/exclude certain words).
What it measures:
- How reliably the model follows instructions
- Adherence to formatting / phrasing constraints (e.g., “answer with only YES or NO”, “use exactly 3 bullet points”, etc.).
Metric: Percentage of constraints satisfied (various sub-metrics; lm-eval aggregates them).

abhishekmishragithub/kimi-k2-thinking-eval-benchamark.md

Select an option

No results found

Select an option

No results found

Kimi-K2-Thinking – Local Evaluation (vLLM + LM Evaluation Harness)

1. Setup

2. Benchmarks Run (Smoke Tests)

2.1 GSM8K (CoT) – Math Reasoning

2.2 HellaSwag – Commonsense Multiple Choice

3. Planned / Full Evaluations -- WIP.

3.1 GSM8K (CoT) – Full

3.2 HellaSwag – Full

3.3 Optional Next Steps

4. Summary

Short descriptions of the evals

GSM8K / `gsm8k_cot`

HellaSwag / `hellaswag`

BBH / `bbh` (Big-Bench Hard)

IFEval / `ifeval`

abhishekmishragithub/kimi-k2-thinking-eval-benchamark.md

Kimi-K2-Thinking – Local Evaluation (vLLM + LM Evaluation Harness)

1. Setup

2. Benchmarks Run (Smoke Tests)

2.1 GSM8K (CoT) – Math Reasoning

2.2 HellaSwag – Commonsense Multiple Choice

3. Planned / Full Evaluations -- WIP.

3.1 GSM8K (CoT) – Full

3.2 HellaSwag – Full

3.3 Optional Next Steps

4. Summary

Short descriptions of the evals

GSM8K / gsm8k_cot

HellaSwag / hellaswag

BBH / bbh (Big-Bench Hard)

IFEval / ifeval

GSM8K / `gsm8k_cot`

HellaSwag / `hellaswag`

BBH / `bbh` (Big-Bench Hard)

IFEval / `ifeval`