daturkel/ray-serve-overview.md

Ray Serve on KubeRay: A Practical Overview

A guide to understanding the core concepts behind Ray Serve — from clusters to deployments to autoscaling — so you can build and configure production applications with confidence.

1. The Ray Cluster

A Ray cluster is a pool of machines that Ray uses to schedule and run distributed workloads. It has one head node (the control plane) and any number of worker nodes (compute).

graph TB
    subgraph RayCluster["Ray Cluster"]
        Head["Head Node<br>────────────<br>• Cluster control plane<br>• Autoscaler<br>• Dashboard (:8265)"]

        subgraph CPUWorkers["CPU Worker Group"]
            W1["Worker Node"]
            W2["Worker Node"]
        end

        subgraph GPUWorkers["GPU Worker Group"]
            G1["GPU Worker Node"]
        end
    end

    Head --- CPUWorkers
    Head --- GPUWorkers

Worker nodes are organized into worker groups — sets of identically-shaped machines (same CPU, memory, GPU configuration). A cluster can have multiple worker groups with different resource profiles, which is useful when some deployments need GPUs and others don't.

2. KubeRay: Running Ray on Kubernetes

KubeRay is a Kubernetes operator that manages Ray clusters for you. Instead of provisioning machines manually, you describe what you want in a Kubernetes manifest and KubeRay handles the rest.

Each Ray node runs as a Kubernetes Pod. KubeRay creates and deletes these pods to match the desired cluster state.

graph LR
    subgraph YourManifest["Your Kubernetes Manifest"]
        RS["RayService<br>(cluster config +<br>Serve app config)"]
    end

    subgraph KubernetesCluster["Kubernetes Cluster"]
        KO["KubeRay Operator"]
        HP["Head Pod"]
        WP1["Worker Pod"]
        WP2["Worker Pod"]
        GP["GPU Worker Pod"]
        SVC["Service :8000<br>(traffic to Serve)"]
    end

    RS -->|"watched by"| KO
    KO -->|"creates & manages"| HP
    KO -->|"creates & manages"| WP1
    KO -->|"creates & manages"| WP2
    KO -->|"creates & manages"| GP
    HP --- SVC

The main resource you'll work with is the RayService CRD, which bundles together your cluster definition and your Ray Serve application config into one manifest.

3. Ray Serve Concepts

Ray Serve runs on top of a Ray cluster and provides the HTTP serving layer. Here are the core building blocks:

graph TB
    subgraph ServeApplication["Serve Application"]
        Ingress["Ingress Deployment<br>(receives HTTP traffic)"]
        D1["Deployment A<br>(2 replicas)"]
        D2["Deployment B<br>(4 replicas, GPU)"]

        Ingress -->|"calls"| D1
        Ingress -->|"calls"| D2
    end

    Client["Client"] -->|"HTTP"| Ingress
    Ingress -->|"response"| Client

Concept	What it is
Deployment	Your code (a Python class) wrapped with `@serve.deployment`. Defines how many replicas to run and what resources each needs.
Replica	A single running instance of a deployment. Adding replicas = horizontal scaling.
Application	One or more deployments wired together. Has exactly one ingress deployment that receives HTTP traffic.
Ingress Deployment	The front door of your application — receives HTTP requests and routes or processes them.

Defining a deployment

from ray import serve
from fastapi import FastAPI

http_app = FastAPI()

@serve.deployment(
    num_replicas=3,
    ray_actor_options={"num_cpus": 2, "num_gpus": 1},
)
@serve.ingress(http_app)
class MyModel:
    def __init__(self):
        self.model = load_model()  # runs once per replica at startup

    @http_app.post("/predict")
    async def predict(self, payload: dict) -> dict:
        return {"result": self.model.run(payload["input"])}

app = MyModel.bind()

Key things to know:

__init__ runs once per replica when it starts up. Put expensive work (model loading, DB connections) here, not in the request handler.
num_replicas sets a fixed count. Use autoscaling_config instead for dynamic scaling (covered below).
ray_actor_options declares resource needs. Ray uses these to place replicas on nodes that have the capacity.

4. How Requests Flow

When traffic arrives at your application:

sequenceDiagram
    participant Client
    participant Proxy as "HTTP Proxy<br>(on every node)"
    participant Replica as "Your Replica"

    Client->>Proxy: HTTP request
    Proxy->>Proxy: route to correct deployment
    Proxy->>Replica: forward request<br>(picks least-loaded replica)
    Replica-->>Proxy: response
    Proxy-->>Client: HTTP response

Ray Serve runs an HTTP proxy on each node in the cluster, so traffic can be served from any worker pod — a Kubernetes Service or load balancer in front of them distributes requests across nodes.

If all replicas are busy, requests wait in a queue. If the queue fills up (max_queued_requests), Serve returns HTTP 503. You can tune per-replica concurrency with max_ongoing_requests (default: 5).

5. Scaling: Two Layers

This is the most important thing to understand when running Ray Serve on KubeRay. Scaling happens at two independent levels:

graph TB
    subgraph AppLayer["Layer 1 · Serve Replica Autoscaling"]
        SA["Serve Controller<br>watches: queue depth per replica<br>acts: add or remove replicas"]
    end

    subgraph ClusterLayer["Layer 2 · Ray Cluster Autoscaling"]
        RA["Ray Autoscaler<br>watches: unschedulable replica requests<br>acts: add or remove worker pods"]
    end

    Traffic["Increasing Traffic"] --> SA
    SA -->|"needs more replicas,<br>but no spare capacity"| RA
    RA -->|"new worker pod joins cluster"| SA
    SA -->|"replica scheduled<br>on new node"| Done["Replica serving traffic"]

    style AppLayer fill:#dbeafe,stroke:#3b82f6
    style ClusterLayer fill:#dcfce7,stroke:#22c55e

Layer 1: Serve Replica Autoscaling

The Serve autoscaler watches how many requests are queued per replica. When the average exceeds target_ongoing_requests, it adds replicas. When traffic drops and replicas are idle, it removes them.

# In your serveConfigV2 / deployment config
autoscaling_config:
  min_replicas: 1           # never go below this
  max_replicas: 20          # never exceed this
  target_ongoing_requests: 5  # scale up when avg queue depth > this
  upscale_delay_s: 30       # wait before adding replicas (avoids flapping)
  downscale_delay_s: 600    # wait before removing replicas (avoids churn)

Layer 2: Ray Cluster (Node) Autoscaling

When Serve tries to schedule a new replica but the cluster lacks resources, the Ray Autoscaler requests a new worker pod from KubeRay. When a worker pod becomes idle for long enough, it's removed.

# In your RayCluster workerGroupSpecs
- groupName: gpu-workers
  minReplicas: 1          # always keep at least 1 GPU pod warm
  maxReplicas: 8          # hard cap on pods in this group
autoscalerOptions:
  idleTimeoutSeconds: 60  # remove idle nodes after this long

Why two layers?

	Serve Replica Autoscaler	Cluster Autoscaler
Reacts to	Request queue depth	Unschedulable resource requests
Speed	Seconds	Minutes (pod startup time)
Acts on	Replica count	Worker pod count

Because node provisioning is slow (especially GPUs), keep minReplicas > 0 in your worker groups for latency-sensitive deployments so replicas can always be scheduled immediately without waiting for a new pod to start.

6. Composing Multi-Stage Pipelines

Real applications often involve multiple steps — preprocessing, embedding, inference, postprocessing. Ray Serve lets you split these into separate deployments that each scale independently.

graph LR
    Client -->|"HTTP POST /search"| Pipeline

    subgraph Application
        Pipeline["Pipeline<br>(ingress)<br>3 replicas"]
        Embedder["Embedder<br>(CPU)<br>2–8 replicas"]
        Ranker["Ranker<br>(GPU)<br>1–2 replicas"]

        Pipeline -->|"embed request"| Embedder
        Pipeline -->|"rank request"| Ranker
    end

    Pipeline -->|"response"| Client

You wire deployments together using bind() and call between them using DeploymentHandle:

from ray.serve.handle import DeploymentHandle

@serve.deployment(autoscaling_config={"min_replicas": 2, "max_replicas": 8})
class Embedder:
    async def embed(self, text: str) -> list[float]:
        return self.model.encode(text)

@serve.deployment(num_replicas=1, ray_actor_options={"num_gpus": 1})
class Ranker:
    async def rank(self, query_emb, doc_embs) -> list[int]:
        return self.model.rank(query_emb, doc_embs)

@serve.deployment(num_replicas=3)
@serve.ingress(http_app)
class Pipeline:
    def __init__(self, embedder: DeploymentHandle, ranker: DeploymentHandle):
        self.embedder = embedder
        self.ranker = ranker

    @http_app.post("/search")
    async def search(self, query: str, docs: list[str]):
        query_emb = await self.embedder.embed.remote(query)
        # ... ranking logic
        return results

# Bind the DAG — Serve wires dependencies automatically
app = Pipeline.bind(Embedder.bind(), Ranker.bind())

Each deployment in the pipeline has its own replica count and autoscaling config, so your GPU-heavy ranker and your CPU-based embedder can scale independently based on their own load.

7. Deploying with RayService

RayService is the Kubernetes resource that ties everything together in production. It manages both the Ray cluster and the Serve application, and handles upgrades safely.

graph TB
    subgraph RayService
        ClusterConfig["rayClusterConfig<br>(node types, resources,<br>autoscaler settings)"]
        ServeConfig["serveConfigV2<br>(deployments, replicas,<br>autoscaling, runtime env)"]
    end

    ClusterConfig -->|"KubeRay creates"| Cluster["Ray Cluster<br>(pods)"]
    ServeConfig -->|"Serve deploys"| App["Running Application<br>(replicas)"]

Two kinds of updates

flowchart LR
    Change["You update<br>the manifest"] --> Q{{"What changed?"}}

    Q -->|"Only serveConfigV2<br>(replicas, code, autoscaling)"| A["In-place update<br>No restart, no downtime"]
    Q -->|"rayClusterConfig<br>(image, resources, Ray version)"| B["Blue/green upgrade<br>New cluster spun up,<br>traffic switched over,<br>old cluster torn down"]

For day-to-day application changes (new model, tuned autoscaling config), only serveConfigV2 changes — Serve applies these in-place with no cluster restart and no traffic interruption.

For infrastructure changes (Ray version bump, pod resource changes), KubeRay spins up a whole new cluster, confirms the app is healthy there, then shifts traffic over. This requires enough Kubernetes capacity to run both clusters simultaneously.

Minimal RayService example

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: my-app
spec:
  serveConfigV2: |
    proxy_location: EveryNode
    applications:
      - name: my_app
        import_path: mymodule:app
        runtime_env:
          pip: ["torch==2.3.0"]
        deployments:
          - name: MyModel
            autoscaling_config:
              min_replicas: 1
              max_replicas: 10
              target_ongoing_requests: 5
            ray_actor_options:
              num_gpus: 1

  rayClusterConfig:
    rayVersion: "2.40.0"
    autoscalerOptions:
      idleTimeoutSeconds: 60
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.40.0
              resources:
                requests: { cpu: "4", memory: "8Gi" }
    workerGroupSpecs:
      - groupName: gpu-workers
        minReplicas: 1
        maxReplicas: 4
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray-ml:2.40.0-gpu
                resources:
                  requests: { cpu: "8", memory: "24Gi", nvidia.com/gpu: "1" }

8. Resource Allocation

When you configure a deployment, you declare what resources each replica needs. Ray uses these to decide which worker pod to place the replica on.

graph TB
    subgraph WorkerPod["Worker Pod (8 CPU, 1 GPU, 24GB RAM)"]
        R1["Replica: Embedder<br>2 CPU · 4GB RAM"]
        R2["Replica: Embedder<br>2 CPU · 4GB RAM"]
        R3["Replica: Embedder<br>2 CPU · 4GB RAM"]
        R4["Replica: Ranker<br>1 GPU · 2 CPU · 8GB RAM"]
    end

@serve.deployment(
    ray_actor_options={
        "num_cpus": 2,        # CPU cores to reserve per replica
        "num_gpus": 1,        # GPUs to reserve (can be fractional: 0.5)
        "memory": 4 * 1024**3 # RAM in bytes (optional hint)
    }
)
class MyDeployment:
    ...

Fractional GPUs

If your model is small enough to share a GPU, you can pack multiple replicas onto one:

@serve.deployment(
    num_replicas=4,
    ray_actor_options={"num_gpus": 0.25},  # 4 replicas share 1 GPU
)
class LightModel:
    ...

Sizing tips

Situation	Recommendation
Latency-sensitive	Keep `min_replicas ≥ 1` and pre-warm worker groups (`minReplicas > 0`) to avoid cold-start delays
GPU workloads	Use a dedicated worker group with GPU pods; set `minReplicas ≥ 1` so a GPU node is always available
Bursty traffic	Set generous `max_replicas` and tune `upscale_delay_s` down; accept some over-provisioning
Cost-sensitive	Set `min_replicas: 0` on the Serve deployment and `minReplicas: 0` on the worker group; accept cold-start latency
Shared GPU models	Use fractional `num_gpus` (e.g. `0.25`) to pack multiple replicas per GPU node