A guide to understanding the core concepts behind Ray Serve — from clusters to deployments to autoscaling — so you can build and configure production applications with confidence.
- The Ray Cluster
- KubeRay: Running Ray on Kubernetes
- Ray Serve Concepts
- How Requests Flow
- Scaling: Two Layers
- Composing Multi-Stage Pipelines
- Deploying with RayService
- Resource Allocation
A Ray cluster is a pool of machines that Ray uses to schedule and run distributed workloads. It has one head node (the control plane) and any number of worker nodes (compute).
graph TB
subgraph RayCluster["Ray Cluster"]
Head["Head Node<br>────────────<br>• Cluster control plane<br>• Autoscaler<br>• Dashboard (:8265)"]
subgraph CPUWorkers["CPU Worker Group"]
W1["Worker Node"]
W2["Worker Node"]
end
subgraph GPUWorkers["GPU Worker Group"]
G1["GPU Worker Node"]
end
end
Head --- CPUWorkers
Head --- GPUWorkers
Worker nodes are organized into worker groups — sets of identically-shaped machines (same CPU, memory, GPU configuration). A cluster can have multiple worker groups with different resource profiles, which is useful when some deployments need GPUs and others don't.
KubeRay is a Kubernetes operator that manages Ray clusters for you. Instead of provisioning machines manually, you describe what you want in a Kubernetes manifest and KubeRay handles the rest.
Each Ray node runs as a Kubernetes Pod. KubeRay creates and deletes these pods to match the desired cluster state.
graph LR
subgraph YourManifest["Your Kubernetes Manifest"]
RS["RayService<br>(cluster config +<br>Serve app config)"]
end
subgraph KubernetesCluster["Kubernetes Cluster"]
KO["KubeRay Operator"]
HP["Head Pod"]
WP1["Worker Pod"]
WP2["Worker Pod"]
GP["GPU Worker Pod"]
SVC["Service :8000<br>(traffic to Serve)"]
end
RS -->|"watched by"| KO
KO -->|"creates & manages"| HP
KO -->|"creates & manages"| WP1
KO -->|"creates & manages"| WP2
KO -->|"creates & manages"| GP
HP --- SVC
The main resource you'll work with is the RayService CRD, which bundles together
your cluster definition and your Ray Serve application config into one manifest.
Ray Serve runs on top of a Ray cluster and provides the HTTP serving layer. Here are the core building blocks:
graph TB
subgraph ServeApplication["Serve Application"]
Ingress["Ingress Deployment<br>(receives HTTP traffic)"]
D1["Deployment A<br>(2 replicas)"]
D2["Deployment B<br>(4 replicas, GPU)"]
Ingress -->|"calls"| D1
Ingress -->|"calls"| D2
end
Client["Client"] -->|"HTTP"| Ingress
Ingress -->|"response"| Client
| Concept | What it is |
|---|---|
| Deployment | Your code (a Python class) wrapped with @serve.deployment. Defines how many replicas to run and what resources each needs. |
| Replica | A single running instance of a deployment. Adding replicas = horizontal scaling. |
| Application | One or more deployments wired together. Has exactly one ingress deployment that receives HTTP traffic. |
| Ingress Deployment | The front door of your application — receives HTTP requests and routes or processes them. |
from ray import serve
from fastapi import FastAPI
http_app = FastAPI()
@serve.deployment(
num_replicas=3,
ray_actor_options={"num_cpus": 2, "num_gpus": 1},
)
@serve.ingress(http_app)
class MyModel:
def __init__(self):
self.model = load_model() # runs once per replica at startup
@http_app.post("/predict")
async def predict(self, payload: dict) -> dict:
return {"result": self.model.run(payload["input"])}
app = MyModel.bind()Key things to know:
__init__runs once per replica when it starts up. Put expensive work (model loading, DB connections) here, not in the request handler.num_replicassets a fixed count. Useautoscaling_configinstead for dynamic scaling (covered below).ray_actor_optionsdeclares resource needs. Ray uses these to place replicas on nodes that have the capacity.
When traffic arrives at your application:
sequenceDiagram
participant Client
participant Proxy as "HTTP Proxy<br>(on every node)"
participant Replica as "Your Replica"
Client->>Proxy: HTTP request
Proxy->>Proxy: route to correct deployment
Proxy->>Replica: forward request<br>(picks least-loaded replica)
Replica-->>Proxy: response
Proxy-->>Client: HTTP response
Ray Serve runs an HTTP proxy on each node in the cluster, so traffic can be served from any worker pod — a Kubernetes Service or load balancer in front of them distributes requests across nodes.
If all replicas are busy, requests wait in a queue. If the queue fills up
(max_queued_requests), Serve returns HTTP 503. You can tune per-replica concurrency
with max_ongoing_requests (default: 5).
This is the most important thing to understand when running Ray Serve on KubeRay. Scaling happens at two independent levels:
graph TB
subgraph AppLayer["Layer 1 · Serve Replica Autoscaling"]
SA["Serve Controller<br>watches: queue depth per replica<br>acts: add or remove replicas"]
end
subgraph ClusterLayer["Layer 2 · Ray Cluster Autoscaling"]
RA["Ray Autoscaler<br>watches: unschedulable replica requests<br>acts: add or remove worker pods"]
end
Traffic["Increasing Traffic"] --> SA
SA -->|"needs more replicas,<br>but no spare capacity"| RA
RA -->|"new worker pod joins cluster"| SA
SA -->|"replica scheduled<br>on new node"| Done["Replica serving traffic"]
style AppLayer fill:#dbeafe,stroke:#3b82f6
style ClusterLayer fill:#dcfce7,stroke:#22c55e
The Serve autoscaler watches how many requests are queued per replica. When the average
exceeds target_ongoing_requests, it adds replicas. When traffic drops and replicas are
idle, it removes them.
# In your serveConfigV2 / deployment config
autoscaling_config:
min_replicas: 1 # never go below this
max_replicas: 20 # never exceed this
target_ongoing_requests: 5 # scale up when avg queue depth > this
upscale_delay_s: 30 # wait before adding replicas (avoids flapping)
downscale_delay_s: 600 # wait before removing replicas (avoids churn)When Serve tries to schedule a new replica but the cluster lacks resources, the Ray Autoscaler requests a new worker pod from KubeRay. When a worker pod becomes idle for long enough, it's removed.
# In your RayCluster workerGroupSpecs
- groupName: gpu-workers
minReplicas: 1 # always keep at least 1 GPU pod warm
maxReplicas: 8 # hard cap on pods in this group
autoscalerOptions:
idleTimeoutSeconds: 60 # remove idle nodes after this long| Serve Replica Autoscaler | Cluster Autoscaler | |
|---|---|---|
| Reacts to | Request queue depth | Unschedulable resource requests |
| Speed | Seconds | Minutes (pod startup time) |
| Acts on | Replica count | Worker pod count |
Because node provisioning is slow (especially GPUs), keep minReplicas > 0 in your
worker groups for latency-sensitive deployments so replicas can always be scheduled
immediately without waiting for a new pod to start.
Real applications often involve multiple steps — preprocessing, embedding, inference, postprocessing. Ray Serve lets you split these into separate deployments that each scale independently.
graph LR
Client -->|"HTTP POST /search"| Pipeline
subgraph Application
Pipeline["Pipeline<br>(ingress)<br>3 replicas"]
Embedder["Embedder<br>(CPU)<br>2–8 replicas"]
Ranker["Ranker<br>(GPU)<br>1–2 replicas"]
Pipeline -->|"embed request"| Embedder
Pipeline -->|"rank request"| Ranker
end
Pipeline -->|"response"| Client
You wire deployments together using bind() and call between them using
DeploymentHandle:
from ray.serve.handle import DeploymentHandle
@serve.deployment(autoscaling_config={"min_replicas": 2, "max_replicas": 8})
class Embedder:
async def embed(self, text: str) -> list[float]:
return self.model.encode(text)
@serve.deployment(num_replicas=1, ray_actor_options={"num_gpus": 1})
class Ranker:
async def rank(self, query_emb, doc_embs) -> list[int]:
return self.model.rank(query_emb, doc_embs)
@serve.deployment(num_replicas=3)
@serve.ingress(http_app)
class Pipeline:
def __init__(self, embedder: DeploymentHandle, ranker: DeploymentHandle):
self.embedder = embedder
self.ranker = ranker
@http_app.post("/search")
async def search(self, query: str, docs: list[str]):
query_emb = await self.embedder.embed.remote(query)
# ... ranking logic
return results
# Bind the DAG — Serve wires dependencies automatically
app = Pipeline.bind(Embedder.bind(), Ranker.bind())Each deployment in the pipeline has its own replica count and autoscaling config, so your GPU-heavy ranker and your CPU-based embedder can scale independently based on their own load.
RayService is the Kubernetes resource that ties everything together in production.
It manages both the Ray cluster and the Serve application, and handles upgrades safely.
graph TB
subgraph RayService
ClusterConfig["rayClusterConfig<br>(node types, resources,<br>autoscaler settings)"]
ServeConfig["serveConfigV2<br>(deployments, replicas,<br>autoscaling, runtime env)"]
end
ClusterConfig -->|"KubeRay creates"| Cluster["Ray Cluster<br>(pods)"]
ServeConfig -->|"Serve deploys"| App["Running Application<br>(replicas)"]
flowchart LR
Change["You update<br>the manifest"] --> Q{{"What changed?"}}
Q -->|"Only serveConfigV2<br>(replicas, code, autoscaling)"| A["In-place update<br>No restart, no downtime"]
Q -->|"rayClusterConfig<br>(image, resources, Ray version)"| B["Blue/green upgrade<br>New cluster spun up,<br>traffic switched over,<br>old cluster torn down"]
For day-to-day application changes (new model, tuned autoscaling config), only
serveConfigV2 changes — Serve applies these in-place with no cluster restart and
no traffic interruption.
For infrastructure changes (Ray version bump, pod resource changes), KubeRay spins up a whole new cluster, confirms the app is healthy there, then shifts traffic over. This requires enough Kubernetes capacity to run both clusters simultaneously.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: my-app
spec:
serveConfigV2: |
proxy_location: EveryNode
applications:
- name: my_app
import_path: mymodule:app
runtime_env:
pip: ["torch==2.3.0"]
deployments:
- name: MyModel
autoscaling_config:
min_replicas: 1
max_replicas: 10
target_ongoing_requests: 5
ray_actor_options:
num_gpus: 1
rayClusterConfig:
rayVersion: "2.40.0"
autoscalerOptions:
idleTimeoutSeconds: 60
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.40.0
resources:
requests: { cpu: "4", memory: "8Gi" }
workerGroupSpecs:
- groupName: gpu-workers
minReplicas: 1
maxReplicas: 4
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.40.0-gpu
resources:
requests: { cpu: "8", memory: "24Gi", nvidia.com/gpu: "1" }When you configure a deployment, you declare what resources each replica needs. Ray uses these to decide which worker pod to place the replica on.
graph TB
subgraph WorkerPod["Worker Pod (8 CPU, 1 GPU, 24GB RAM)"]
R1["Replica: Embedder<br>2 CPU · 4GB RAM"]
R2["Replica: Embedder<br>2 CPU · 4GB RAM"]
R3["Replica: Embedder<br>2 CPU · 4GB RAM"]
R4["Replica: Ranker<br>1 GPU · 2 CPU · 8GB RAM"]
end
@serve.deployment(
ray_actor_options={
"num_cpus": 2, # CPU cores to reserve per replica
"num_gpus": 1, # GPUs to reserve (can be fractional: 0.5)
"memory": 4 * 1024**3 # RAM in bytes (optional hint)
}
)
class MyDeployment:
...If your model is small enough to share a GPU, you can pack multiple replicas onto one:
@serve.deployment(
num_replicas=4,
ray_actor_options={"num_gpus": 0.25}, # 4 replicas share 1 GPU
)
class LightModel:
...| Situation | Recommendation |
|---|---|
| Latency-sensitive | Keep min_replicas ≥ 1 and pre-warm worker groups (minReplicas > 0) to avoid cold-start delays |
| GPU workloads | Use a dedicated worker group with GPU pods; set minReplicas ≥ 1 so a GPU node is always available |
| Bursty traffic | Set generous max_replicas and tune upscale_delay_s down; accept some over-provisioning |
| Cost-sensitive | Set min_replicas: 0 on the Serve deployment and minReplicas: 0 on the worker group; accept cold-start latency |
| Shared GPU models | Use fractional num_gpus (e.g. 0.25) to pack multiple replicas per GPU node |