-
Model:
kimi-k2-thinking(Moonshot Kimi-K2 Thinking) -
Serving backend: vLLM
-
Serve command (summary):
tensor-parallel-size=8distributed-executor-backend=mpmax-model-len=131072max-num-batched-tokens=98304max-num-seqs=256dtype=bfloat16enable-chunked-prefill,enable-auto-tool-choice
-
Endpoints (OpenAI-compatible):
- Chat:
POST http://127.0.0.1:8000/v1/chat/completions - Completions:
POST http://127.0.0.1:8000/v1/completions
- Chat:
-
Eval harness: EleutherAI lm-evaluation-harness
- For chat-style tasks (GSM8K CoT, IFEval, etc.):
--model local-chat-completions--model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/chat/completions, ..."--apply_chat_template --fewshot_as_multiturn
- For loglikelihood / multiple-choice tasks (HellaSwag):
--model local-completions--model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking, ..."
- For chat-style tasks (GSM8K CoT, IFEval, etc.):
These were quick sanity checks using
--limit 20. They confirm integration + basic behavior but should not be treated as final metrics.
- Task:
gsm8k_cot(grade-school math word problems with chain-of-thought prompting) - Backend:
local-chat-completions - Settings:
- 8-shot (
--num_fewshot 8) temperature = 0(greedy)max_gen_toks = 512batch_size = 1limit = 20examples
- 8-shot (
Results (20-sample preview):
| Task | Filter | n-shot | Metric | Value | StdErr |
|---|---|---|---|---|---|
| gsm8k_cot | flexible-extract | 8 | exact_match | 0.75 | 0.0993 |
| gsm8k_cot | strict-match | 8 | exact_match | 0.70 | 0.1051 |
Interpretation:
flexible-extract:- Looser answer extraction (tolerates “The answer is 47.”).
- ~75% of sampled problems answered correctly (≈15/20).
strict-match:- Stricter matching against the gold answer string.
- ~70% exact-match accuracy (≈14/20).
- High stderr (~0.10) because the sample size is tiny; this is just a sanity check.
- Task:
hellaswag- Standard commonsense reasoning benchmark with 4-way multiple choice.
- Backend:
local-completions - Settings (smoke run):
--model local-completions--model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=8,max_retries=3,timeout=3000,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking"temperature = 0max_gen_toks = 64batch_size = 1limit = 20examples
Results (20-sample preview):
- Metrics are written to:
results/kimi-think-hellaswag-smoke-hellaswag/results.json
- This run is only a small slice of the full HellaSwag validation set (~10k examples), so numbers are for smoke-testing only.
✅ Once the full run is completed (no
--limit), update this section with:
- Accuracy (full validation set)
accuracy = X.XX(fromresults/kimi-think-hellaswag-full/results.json)
For metrics we can actually compare and track, we plan to run:
Command:
lm_eval \
--model local-chat-completions \
--model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=8,max_retries=3,tokenized_requests=False" \
--apply_chat_template \
--fewshot_as_multiturn \
--tasks gsm8k_cot \
--batch_size 1 \
--num_fewshot 8 \
--gen_kwargs "temperature=0,max_gen_toks=512" \
--output_path results/kimi-think-gsm8k-full \
--log_samples- Same setup as the smoke test, but no
--limit. - Produces
results/kimi-think-gsm8k-full/results.jsonwith final exact-match scores.
Command:
lm_eval \
--model local-completions \
--model_args "model=kimi-k2-thinking,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=8,max_retries=3,timeout=3000,tokenized_requests=True,tokenizer=/home/jovyan/models/kimi-k2-thinking" \
--tasks hellaswag \
--batch_size 1 \
--gen_kwargs "temperature=0,max_gen_toks=64" \
--output_path results/kimi-think-hellaswag-full \
--log_samples- Runs on the full HellaSwag validation set.
- Produces
results/kimi-think-hellaswag-full/results.jsonwith accuracy and stderr.
- BBH (BigBench Hard) via
--tasks bbhusinglocal-chat-completions(reasoning diversity). - IFEval via
--tasks ifevalto explicitly measure instruction-following behavior.
-
Integration between vLLM (local Kimi-K2-Thinking) and lm-evaluation-harness is working for both:
- Chat-style CoT tasks (
gsm8k_cotvialocal-chat-completions), and - Loglikelihood multiple-choice tasks (
hellaswagvialocal-completions).
- Chat-style CoT tasks (
-
Initial smoke runs show:
- Reasonable math reasoning on GSM8K-CoT (≈70–75% on a 20-sample slice).
- HellaSwag pipeline working end-to-end (20-sample smoke run).
-
Next, we will:
- Run full GSM8K-CoT and full HellaSwag without
--limit, - Log outputs and metrics under
results/kimi-think-gsm8k-full/andresults/kimi-think-hellaswag-full/, - Add those final numbers to this document.
- Run full GSM8K-CoT and full HellaSwag without
==========================
-
Full name: Grade School Math 8K
-
What it is: A dataset of ~8.5k math word problems written at grade-school level.
-
Eval variant (
gsm8k_cot): Uses chain-of-thought prompting — the model sees a few worked examples and is expected to reason step-by-step and give the final numeric answer. -
What it measures:
- Multi-step arithmetic and algebra reasoning
- Ability to maintain a chain of reasoning and land on the correct final answer
-
Metric:
exact_match(did the model’s final answer match the gold answer).
-
What it is: A commonsense reasoning benchmark built from multiple choice sentence completion. The model sees a short context and must choose the most plausible continuation among 4 options.
-
Eval style: Multiple-choice evaluated via log-likelihood (for each option) using a completions / logprob-style API.
-
What it measures:
- Commonsense and world knowledge
- Plausibility judgments (which continuation “feels” natural vs absurd)
-
Metric:
accuracyover the 4-way choices.
-
Full name: BIG-Bench Hard (subset of BIG-Bench)
-
What it is: A curated collection of the hardest tasks from BIG-Bench: logical reasoning, math puzzles, tracking objects, word games, etc. It’s split into many subtasks like
boolean_expressions,tracking_shuffled_objects,salient_translation_error_detection,web_of_lies, etc. -
Eval variant in lm-eval:
bbhuses CoT-style few-shot prompts (typically 3-shot) per subtask (you saw names likebbh_cot_fewshot_boolean_expressionsin the logs). -
What it measures:
- General-purpose “hard” reasoning
- Ability to follow tricky instructions and maintain multi-step logic
-
Metric: Mostly
accuracyper subtask, with an overall average across subtasks.
-
Full name: Instruction-Following Evaluation
-
What it is: A benchmark where each example is a natural-language instruction (or a set of constraints) and a reference rubric for what counts as “following the instruction”.
-
How it’s evaluated: The model’s output is checked against a set of automatically computed constraints (e.g., did it use the requested style, length, format, include/exclude certain words).
-
What it measures:
- How reliably the model follows instructions
- Adherence to formatting / phrasing constraints (e.g., “answer with only YES or NO”, “use exactly 3 bullet points”, etc.).
-
Metric: Percentage of constraints satisfied (various sub-metrics; lm-eval aggregates them).