Part 3 of 6 — Series Index
This guide scales the vLLM deployment to three replicas and shows llm-d's EPP routing requests across all of them in real time.
Prerequisite: Complete the Run llm-d on a kind Cluster (Mac) guide first. The vllm-hello cluster must be running with the llm-d stack deployed and the gateway port-forwarded on :8080.
flowchart LR
classDef external fill:#e2e8f0,stroke:#94a3b8,color:#1e293b
classDef service fill:#dbeafe,stroke:#3b82f6,color:#1e40af
classDef container fill:#dcfce7,stroke:#16a34a,color:#166534
Client(Client):::external -->|":8080"| GW[(Gateway\ninfra-sim)]:::service
GW -->|"HTTPRoute\next-proc"| EPP[["EPP\ngaie-sim"]]:::container
EPP --> V1[["vLLM pod 1"]]:::container
EPP --> V2[["vLLM pod 2"]]:::container
EPP --> V3[["vLLM pod 3"]]:::container
AGW[["agentgateway\ncontroller"]]:::container -.->|"programs"| GW
subgraph agentgateway-system["agentgateway-system"]
AGW
end
subgraph kind["kind cluster: vllm-hello"]
GW
EPP
V1
V2
V3
end
kubectl scale deployment/vllm --replicas=3Watch all three pods come up:
kubectl get pods -l app=vllm -wEach new pod pulls no image (already cached) but takes 2–3 minutes to load the model. Wait until all three show 1/1 Running:
NAME READY STATUS RESTARTS AGE
vllm-5d8d7cfcdd-7gv6b 1/1 Running 0 30m
vllm-5d8d7cfcdd-m2kj9 1/1 Running 0 3m
vllm-5d8d7cfcdd-p8xnr 1/1 Running 0 3m
In a separate terminal, stream request logs from every vLLM pod simultaneously. The --prefix flag prepends the pod name to each line; grep filters out startup noise:
kubectl logs --prefix -l app=vllm -f | grep "POST /v1"Leave this running — it is the live view of which pod handles each request.
Back in the original terminal, fire nine requests in parallel:
for i in {1..9}; do
curl -s http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d "{\"model\": \"facebook/opt-125m\", \"prompt\": \"Prompt $i: The future of AI is\", \"max_tokens\": 20}" \
-o /dev/null &
done
wait
echo "done"In the log terminal you will see the three pod names interleaved, confirming that llm-d's EPP spread the load across all replicas:
[pod/vllm-f8bc8464c-jhr6m] INFO: 10.244.0.15:60830 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-jhr6m] INFO: 10.244.0.15:60844 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-jhr6m] INFO: 10.244.0.15:60856 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-mr2zp] INFO: 10.244.0.15:37462 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-mr2zp] INFO: 10.244.0.15:37470 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-6q9f8] INFO: 10.244.0.15:43174 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-6q9f8] INFO: 10.244.0.15:43180 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-6q9f8] INFO: 10.244.0.15:43192 - "POST /v1/completions HTTP/1.1" 200 OK
[pod/vllm-f8bc8464c-6q9f8] INFO: 10.244.0.15:43202 - "POST /v1/completions HTTP/1.1" 200 OK
All three pod suffixes (jhr6m, mr2zp, 6q9f8) appear — the EPP's load-aware routing spread requests across the pool. Distribution is not strictly round-robin; pods receive varying counts based on observed queue depth.
kubectl scale deployment/vllm --replicas=1