Skip to content

Instantly share code, notes, and snippets.

@arun-gupta
Last active April 29, 2026 02:01
Show Gist options
  • Select an option

  • Save arun-gupta/236afac8603af5aae84a83e7ce61c8a1 to your computer and use it in GitHub Desktop.

Select an option

Save arun-gupta/236afac8603af5aae84a83e7ce61c8a1 to your computer and use it in GitHub Desktop.
llm-d on a kind Cluster (Mac) — Series Index

llm-d on a kind Cluster (Mac): Series Index

LLM inferencing at scale is finding a combination of hardware, software, drivers, kernels, and routing.

vLLM is the inference engine. It takes a model and a GPU and turns them into a high-throughput HTTP API. Its core innovation is PagedAttention, which manages KV cache memory the way an OS manages virtual memory, enabling continuous batching and dramatically higher GPU utilization compared to naive serving. vLLM handles everything from kernel selection to memory management to the OpenAI-compatible endpoint your application talks to.

But vLLM is a single server. When you need multiple replicas (for scale, redundancy, or cost) you need something above it that understands LLM-specific signals like KV cache state and queue depth. A standard load balancer doesn't; it just round-robins blind.

That's where llm-d comes in. llm-d adds a scheduling layer, the EPP (Endpoint Policy Processor), that sits between your gateway and your vLLM pods. The EPP routes each request to the pod best positioned to handle it, based on live signals: queue depth, KV cache utilization, and prefix cache hits. Sending a request to a pod that already has the prompt prefix cached avoids redundant computation, reduces latency, and makes better use of GPU memory.

This series walks through deploying and exercising the full llm-d stack on a local Kubernetes cluster on macOS, no GPU required. Each guide builds on the previous one.

Architecture

flowchart LR
    classDef external  fill:#e2e8f0,stroke:#94a3b8,color:#1e293b
    classDef service   fill:#dbeafe,stroke:#3b82f6,color:#1e40af
    classDef container fill:#dcfce7,stroke:#16a34a,color:#166534
    classDef policy    fill:#fce7f3,stroke:#db2777,color:#831843

    Client(Client):::external -->|":8080"| GW[(Gateway\ninfra-sim)]:::service
    Metrics(Metrics Client):::external -->|":9090/metrics"| EPP
    GW -->|"ext-proc :9002"| EPP[["EPP\ngaie-sim"]]:::container
    IMR{{InferenceModelRewrite}}:::policy -->|"model aliasing"| EPP
    EPP -->|"InferencePool"| V1[["vLLM pod 1"]]:::container
    EPP -->|"InferencePool"| V2[["vLLM pod 2"]]:::container
    EPP -->|"InferencePool"| V3[["vLLM pod 3"]]:::container
    AGW[["agentgateway\ncontroller"]]:::container -.->|"programs"| GW

    subgraph agentgateway-system["agentgateway-system"]
        AGW
    end

    subgraph kind["kind cluster: vllm-hello"]
        GW
        EPP
        IMR
        V1
        V2
        V3
    end
Loading

Guides

# Guide What it covers
1 Run vLLM on a kind Cluster Deploy vLLM in CPU mode with facebook/opt-125m on a local kind cluster
2 Run llm-d on a kind Cluster Add the llm-d Gateway and EPP scheduling layer on top of vLLM
3 Load Distribution Scale to three replicas and watch the EPP spread requests across pods
4 Model Aliasing Use InferenceModelRewrite to decouple client model names from backend model names
5 Fault Tolerance Delete a pod mid-traffic and watch the EPP route around it automatically
6 EPP Observability Scrape the EPP's Prometheus metrics endpoint and watch pool size change in real time

Prerequisites

Start with Guide 1 and follow in order. Each guide assumes the cluster and components from the previous one are still running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment