Part 2 of 6 — Series Index
llm-d adds a scheduling and routing layer on top of vLLM: a Gateway that accepts incoming requests and an EPP (Endpoint Policy Processor) that performs KV-cache-aware, load-aware routing to vLLM pods.
This guide deploys the llm-d scheduling layer into the same vllm-hello cluster from the vLLM on kind guide and routes requests through it to the real vLLM pod.
flowchart LR
classDef external fill:#e2e8f0,stroke:#94a3b8,color:#1e293b
classDef service fill:#dbeafe,stroke:#3b82f6,color:#1e40af
classDef container fill:#dcfce7,stroke:#16a34a,color:#166534
Client(Client):::external -->|":8080"| GW[(Gateway\ninfra-sim)]:::service
GW -->|"HTTPRoute\next-proc"| EPP[["EPP\ngaie-sim"]]:::container
EPP -->|"InferencePool\ngaie-sim"| VLLM[["vLLM Container\nfacebook/opt-125m · CPU"]]:::container
AGW[["agentgateway\ncontroller"]]:::container -.->|"programs"| GW
subgraph agentgateway-system["agentgateway-system"]
AGW
end
subgraph kind["kind cluster: vllm-hello"]
GW
EPP
VLLM
end
Prerequisite: Complete the vLLM on kind guide first. The
vllm-hellocluster must be running withkind,kubectl, andhelminstalled.
Install the Kubernetes Gateway API CRDs:
kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yamlInstall the Agentgateway CRDs and controller:
helm upgrade -i agentgateway-crds \
oci://cr.agentgateway.dev/charts/agentgateway-crds \
--create-namespace --namespace agentgateway-system \
--version v1.1.0helm upgrade -i agentgateway \
oci://cr.agentgateway.dev/charts/agentgateway \
--namespace agentgateway-system \
--version v1.1.0 \
--set inferenceExtension.enabled=true \
--waitInstall the Gateway API Inference Extension CRDs (provides InferencePool, InferenceModel):
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.4.0/manifests.yamlhelm repo add llm-d-infra https://llm-d-incubation.github.io/llm-d-infra/helm repo updateConfirm the vLLM deployment from the prerequisite guide is running:
kubectl get deployment vllmThe EPP's InferencePool selects pods by label. Patch the deployment to add the required labels:
kubectl patch deployment vllm --type=merge -p '
spec:
template:
metadata:
labels:
llm-d.ai/inference-serving: "true"
llm-d.ai/guide: simulated-accelerators
llm-d.ai/accelerator-variant: cpu
llm-d.ai/model: random'Verify the labels are on the pod:
kubectl get pods -l llm-d.ai/inference-serving=truehelm install infra-sim llm-d-infra/llm-d-infra \
--version v1.4.0 \
--set gateway.provider=agentgatewayhelm install gaie-sim \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.4.0 \
-f <(curl -s https://raw.githubusercontent.com/llm-d/llm-d/main/guides/recipes/scheduler/base.values.yaml) \
-f <(curl -s https://raw.githubusercontent.com/llm-d/llm-d/main/guides/simulated-accelerators/gaie-sim/values.yaml) \
--set inferenceExtension.monitoring.prometheus.enabled=false \
--set inferenceExtension.monitoring.prometheus.auth.enabled=falsekubectl apply -f https://raw.githubusercontent.com/llm-d/llm-d/main/guides/simulated-accelerators/httproute.yamlCheck all pods are running:
kubectl get podsExpect the vLLM pod plus EPP and gateway pods:
NAME READY STATUS RESTARTS AGE
epp-... 1/1 Running 0 1m
gateway-... 1/1 Running 0 1m
vllm-... 1/1 Running 0 20m
Check the InferencePool was created:
kubectl get inferencepoolPort-forward the gateway:
kubectl port-forward svc/infra-sim-inference-gateway 8080:80Send a completion request through the llm-d scheduling layer:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is known for",
"max_tokens": 50
}'Sample response:
{
"id": "cmpl-56b28a36-357f-4790-884c-0e7b93de2aec",
"object": "text_completion",
"created": 1777417557,
"model": "facebook/opt-125m",
"choices": [
{
"index": 0,
"text": " being the safest place to visit for everything from whales to scallops.",
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 6,
"completion_tokens": 50,
"total_tokens": 56
}
}Remove the llm-d components (leaves vLLM running):
helm uninstall gaie-sim && helm uninstall infra-simTo delete the cluster entirely:
kind delete cluster --name vllm-hello