Arun Gupta arun-gupta

llm-d on a kind Cluster (Mac): Series Index

LLM inferencing at scale is finding a combination of hardware, software, drivers, kernels, and routing.

vLLM is the inference engine. It takes a model and a GPU and turns them into a high-throughput HTTP API. Its core innovation is PagedAttention, which manages KV cache memory the way an OS manages virtual memory, enabling continuous batching and dramatically higher GPU utilization compared to naive serving. vLLM handles everything from kernel selection to memory management to the OpenAI-compatible endpoint your application talks to.

But vLLM is a single server. When you need multiple replicas (for scale, redundancy, or cost) you need something above it that understands LLM-specific signals like KV cache state and queue depth. A standard load balancer doesn't; it just round-robins blind.

That's where llm-d comes in. llm-d adds a scheduling layer, the EPP (Endpoint Policy Processor), that sits between your gatew

llm-d EPP Observability on a kind Cluster (Mac)

Part 6 of 6 — Series Index

This guide explores the metrics exposed by llm-d's EPP (Endpoint Policy Processor) — the data the scheduler uses to make routing decisions. These metrics give visibility into per-pod queue depth, KV cache utilization, and request rates without any additional monitoring infrastructure.

Prerequisite: Complete the llm-d on a kind Cluster (Mac) guide. The vllm-hello cluster must be running with the llm-d stack deployed.

If you installed gaie-sim before this guide was updated, the metrics endpoint may return Unauthorized. Upgrade with auth disabled:

llm-d Fault Tolerance on a kind Cluster (Mac)

Part 5 of 6 — Series Index

This guide kills a vLLM pod mid-traffic and shows that llm-d's EPP automatically routes around it. There is a brief disruption while the EPP detects the failure, then traffic recovers automatically on the remaining pods — no manual intervention required.

Prerequisite: Complete the llm-d on a kind Cluster (Mac) guide. The vllm-hello cluster must be running with the llm-d stack deployed and the gateway port-forwarded on :8080.

Architecture

Model Aliasing with llm-d on a kind Cluster (Mac)

Part 4 of 6 — Series Index

This guide demonstrates InferenceModelRewrite — a Gateway API Inference Extension resource that decouples the model name clients send from the model vLLM actually serves. This enables model aliasing, versioning, and A/B traffic splitting without changing client code.

What works on CPU: Model aliasing (Sections 1–3) is fully functional. A/B weighted traffic splitting (Section 4) is defined in the API but not yet implemented in the EPP version used by this guide — the configuration is shown as a reference only.

Note on KV-cache prefix routing: Prefix caching requires GPU and does not work with CPU-mode vLLM. Model aliasing and rewriting are the meaningful routing features demonstrable on a local kind cluster.

llm-d Load Distribution on a kind Cluster (Mac)

Part 3 of 6 — Series Index

This guide scales the vLLM deployment to three replicas and shows llm-d's EPP routing requests across all of them in real time.

Prerequisite: Complete the Run llm-d on a kind Cluster (Mac) guide first. The vllm-hello cluster must be running with the llm-d stack deployed and the gateway port-forwarded on :8080.

Architecture

Run llm-d on a kind Cluster (Mac)

Part 2 of 6 — Series Index

llm-d adds a scheduling and routing layer on top of vLLM: a Gateway that accepts incoming requests and an EPP (Endpoint Policy Processor) that performs KV-cache-aware, load-aware routing to vLLM pods.

This guide deploys the llm-d scheduling layer into the same vllm-hello cluster from the vLLM on kind guide and routes requests through it to the real vLLM pod.

Architecture

Run vLLM on a kind Cluster (Mac)

Part 1 of 6 — Series Index

This guide runs a vLLM inference server on a local Kubernetes cluster using kind on macOS. Because kind uses Docker and Mac lacks GPU passthrough to containers, vLLM runs in CPU mode with a small model (facebook/opt-125m).

Architecture

flowchart LR

Using SpecKit to hand-off between Claude Code and Codex

Purpose: This article shows how to reliably hand off work between Claude Code and Codex using SpecKit with a simple project-level governance layer.

AI Coding Breaks at Hand-off

AI coding works well inside a single session. It breaks when work moves between agents. The context gets lost, decisions drift, and each agent starts re-figuring out what was already known. Frankly, it becomes annoying, fast. This is not a model problem, but a system problem.

What you need is a way to make context durable so any agent can pick up exactly where the last one left off. That’s where SpecKit helps, but only if you add a missing layer, and wire up with your coding agent.

The problem is not that we don’t write specs. It’s that specs are not the system of record. Code is. In practice, the spec lives in a ticket, gets closed, and drifts. With AI generating code faster than teams can review it, that gap becomes expensive fast.

Specification-Driven Development flips that model. The spec comes first and stays the source of truth. It is a precise, versioned definition of intent that drives architecture, implementation, tests, and docs. The idea has existed in fragments: TDD, BDD, API-first design all push toward defining intent earlier. What’s different here is making the spec the primary artifact throughout, not just at the start.

Tools like SpecKit, AgentOS, Tessl, SuperClaude, BMAD, and Kiro are all exploring this space, each from a different angle.

I’ve been experimen

Running NemoClaw on macOS: Sandboxed OpenClaw with NVIDIA Guardrails

Following my earlier post on running OpenClaw in Docker on macOS, I wanted to explore the next level: NemoClaw, NVIDIA's open source reference stack that wraps OpenClaw in a proper security sandbox powered by NVIDIA OpenShell.

NemoClaw was demoed at the GTC keynote and is fresh out of the oven (alpha, released March 16, 2026), so this is an early look. Let's get into it.