Skip to content

Instantly share code, notes, and snippets.

@abhishekmishragithub
Last active September 25, 2025 05:16
Show Gist options
  • Select an option

  • Save abhishekmishragithub/a814397048e55cf43b0b664e54ffa494 to your computer and use it in GitHub Desktop.

Select an option

Save abhishekmishragithub/a814397048e55cf43b0b664e54ffa494 to your computer and use it in GitHub Desktop.
llm benchmarking terms / glossary

RPS (Requests per Second): how many requests the system completed per second. RPS = successful_requests / benchmark_duration_sec

TPS (Tokens per Second): ambiguous term, can be more explicit:

  • Output Token Throughput (tok/s) = total_output_tokens / duration

  • Total Token Throughput (tok/s) = (input_tokens + output_tokens) / duration

  • Per-request TPS (your sequential runs) = tokens_returned / request_latency

TTFT (Time To First Token): **time from request send → first token received. Proxy for perceived snappiness. Lower is better.

TPOT (Time Per Output Token) (excl. first token): cadence after streaming begins. TPOT ≈ 1 / (tokens_per_second_per_stream)

ITL (Inter-Token Latency): measured gap between consecutive tokens when streaming. Similar to TPOT; sometimes includes small transport overheads.

E2E (End-to-End latency): request send → last token received. Depends heavily on output length.

Concurrency (in-flight): how many requests are simultaneously being processed.

Burstiness factor: if arrivals are Poisson (default), a factor of 1.0 = classic Poisson; >1 makes bursts spikier.

rule of thumb: TTFT = “how fast does it start talking?”, ITL/TPOT = “how fast does it keep talking?”, E2E = “when does it finish?”, tok/s = “how much work per second can it do?”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment