Skip to content

Instantly share code, notes, and snippets.

@arun-gupta
Last active April 29, 2026 01:52
Show Gist options
  • Select an option

  • Save arun-gupta/ec4bcc0314daa956689f25993855e1b5 to your computer and use it in GitHub Desktop.

Select an option

Save arun-gupta/ec4bcc0314daa956689f25993855e1b5 to your computer and use it in GitHub Desktop.
llm-d EPP Observability on a kind Cluster (Mac)

llm-d EPP Observability on a kind Cluster (Mac)

Part 6 of 6 — Series Index

This guide explores the metrics exposed by llm-d's EPP (Endpoint Policy Processor) — the data the scheduler uses to make routing decisions. These metrics give visibility into per-pod queue depth, KV cache utilization, and request rates without any additional monitoring infrastructure.

Prerequisite: Complete the llm-d on a kind Cluster (Mac) guide. The vllm-hello cluster must be running with the llm-d stack deployed.

If you installed gaie-sim before this guide was updated, the metrics endpoint may return Unauthorized. Upgrade with auth disabled:

helm upgrade gaie-sim \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
  --version v1.4.0 \
  -f <(curl -s https://raw.githubusercontent.com/llm-d/llm-d/main/guides/recipes/scheduler/base.values.yaml) \
  -f <(curl -s https://raw.githubusercontent.com/llm-d/llm-d/main/guides/simulated-accelerators/gaie-sim/values.yaml) \
  --set inferenceExtension.monitoring.prometheus.enabled=false \
  --set inferenceExtension.monitoring.prometheus.auth.enabled=false
kubectl rollout status deployment/gaie-sim-epp

Architecture

flowchart LR
    classDef external  fill:#e2e8f0,stroke:#94a3b8,color:#1e293b
    classDef service   fill:#dbeafe,stroke:#3b82f6,color:#1e40af
    classDef container fill:#dcfce7,stroke:#16a34a,color:#166534

    Client(Client):::external -->|":8080"| GW[(Gateway\ninfra-sim)]:::service
    GW -->|"ext-proc :9002"| EPP[["EPP\ngaie-sim"]]:::container
    EPP --> VLLM[["vLLM pods"]]:::container
    Metrics(Metrics\nClient):::external -->|":9090/metrics"| EPP

    subgraph kind["kind cluster: vllm-hello"]
        GW
        EPP
        VLLM
    end
Loading

The EPP exposes a Prometheus metrics endpoint on port 9090. These are the same signals the EPP uses internally to score and select pods for each request.

1. Port-Forward the Metrics Endpoint

In a separate terminal, forward the EPP metrics port:

kubectl port-forward svc/gaie-sim-epp 9090:9090

2. View All Metrics

Fetch the full metrics output:

curl -s http://localhost:9090/metrics

3. Key Metrics

Filter for the inference-specific metrics:

curl -s http://localhost:9090/metrics | grep "^inference_"

Sample output (at idle with three pods):

inference_extension_info{build_ref="",commit=""} 1
inference_extension_prefix_indexer_size 0
inference_pool_average_kv_cache_utilization{name="gaie-sim"} 0
inference_pool_average_queue_size{name="gaie-sim"} 0
inference_pool_per_pod_queue_size{model_server_pod="vllm-xxx-rank-0",name="gaie-sim"} 0
inference_pool_per_pod_queue_size{model_server_pod="vllm-yyy-rank-0",name="gaie-sim"} 0
inference_pool_per_pod_queue_size{model_server_pod="vllm-zzz-rank-0",name="gaie-sim"} 0
inference_pool_ready_pods{name="gaie-sim"} 3

The EPP exposes per-pod scheduling signals:

Metric Description
inference_pool_ready_pods Number of healthy pods currently in the pool
inference_pool_average_kv_cache_utilization Average KV cache fill across all pods
inference_pool_average_queue_size Average number of requests queued per pod
inference_pool_per_pod_queue_size Per-pod queue depth — the primary signal the EPP uses to route requests

Note on CPU mode: queue depth and KV cache utilization read 0 on CPU-mode vLLM — the model processes requests faster than the EPP polls. On a GPU cluster these metrics reflect real load. The meaningful observable metric on a local cluster is inference_pool_ready_pods.

4. Watch Pool Size Change

In a separate terminal, start polling the metrics endpoint every second:

while true; do curl -s http://localhost:9090/metrics | grep "^inference_pool_ready_pods"; sleep 1; done

Back in the original terminal, scale to three replicas:

kubectl scale deployment/vllm --replicas=3

The polling terminal shows ready_pods jump to 3 once the EPP registers the new pods. There is a delay of up to a minute after the pods reach 1/1 Running — the EPP has its own health-check polling cycle before it adds them to the pool:

inference_pool_ready_pods{name="gaie-sim"} 1
inference_pool_ready_pods{name="gaie-sim"} 1
inference_pool_ready_pods{name="gaie-sim"} 1
inference_pool_ready_pods{name="gaie-sim"} 3
inference_pool_ready_pods{name="gaie-sim"} 3
inference_pool_ready_pods{name="gaie-sim"} 3

5. Scale Back Down

kubectl scale deployment/vllm --replicas=1

The polling terminal shows ready_pods drop back to 1 as the EPP detects the terminating pods:

inference_pool_ready_pods{name="gaie-sim"} 3
inference_pool_ready_pods{name="gaie-sim"} 3
inference_pool_ready_pods{name="gaie-sim"} 1
inference_pool_ready_pods{name="gaie-sim"} 1

ready_pods is the signal the EPP uses to decide which backends are eligible for routing — when it drops to 0, the EPP stops routing until pods recover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment