RPS (Requests per Second): how many requests the system completed per second.
RPS = successful_requests / benchmark_duration_sec
TPS (Tokens per Second): ambiguous term, can be more explicit:
-
Output Token Throughput (tok/s) = total_output_tokens / duration
-
Total Token Throughput (tok/s) = (input_tokens + output_tokens) / duration
-
Per-request TPS (your sequential runs) = tokens_returned / request_latency
TTFT (Time To First Token): **time from request send → first token received. Proxy for perceived snappiness. Lower is better.
TPOT (Time Per Output Token) (excl. first token): cadence after streaming begins.
TPOT ≈ 1 / (tokens_per_second_per_stream)
ITL (Inter-Token Latency): measured gap between consecutive tokens when streaming. Similar to TPOT; sometimes includes small transport overheads.
E2E (End-to-End latency): request send → last token received. Depends heavily on output length.
Concurrency (in-flight): how many requests are simultaneously being processed.
Burstiness factor: if arrivals are Poisson (default), a factor of 1.0 = classic Poisson; >1 makes bursts spikier.
rule of thumb: TTFT = “how fast does it start talking?”, ITL/TPOT = “how fast does it keep talking?”, E2E = “when does it finish?”, tok/s = “how much work per second can it do?”