llm-d is a Kubernetes-native distributed inference stack for large language models. It builds on top of vLLM and the Kubernetes Gateway API (plus the Gateway API Inference Extension) to provide smart, model-aware routing, inference pools, and autoscaling primitives — so serving an LLM on Kubernetes feels closer to running a regular web service than hand-rolling vLLM deployments. It's maintained as an incubation project under the llm-d community and is designed to plug into any conformant Gateway API implementation (kgateway, Istio, Envoy Gateway, etc.).
This section was generated by AI to briefly introduce the project — the rest of the guide is based on my hands-on deployment notes.