This document covers how to build, deploy, and test the deepseek-ai/DeepSeek-V4-Pro model (1.6T total params, 49B active, FP4+FP8 mixed precision) using nm-vllm-ent (based on upstream vLLM v0.20.1rc0) on NVIDIA B200 GPUs.
| Field | Value |
|---|---|
| Model | deepseek-ai/DeepSeek-V4-Pro |
| Container Image | quay.io/vllm/automation-vllm:cuda-25070156955 |
| nm-vllm-ent Branch | doug/deepseek-v4-0day-v5 (upstream v0.20.1rc0 merged into doug/v0.20.0-release-validation) |
| nm-cicd Branch | doug/deepseek-v4-0day-v5 |
| GH Actions Run ID | 25070156955 |
| Target Device | CUDA |
| Python Version | 3.12 |
| CUDA Version | 13.0 |
| Jira | INFERENG-6294 |
gh workflow run build-whl-image.yml \
--repo neuralmagic/nm-cicd \
--ref doug/deepseek-v4-0day-v5 \
-f repo=neuralmagic/nm-vllm-ent \
-f branch=doug/deepseek-v4-0day-v5 \
-f build_label=k8s-a100-build-13-0 \
-f build_timeout=120 \
-f image_label=ibm-wdc-k8s-h100-dind \
-f python=3.12 \
-f release_image=false \
-f target_device=cuda- 8x NVIDIA B200 GPUs (tensor parallelism of 8)
- Model files downloaded locally
- Docker with NVIDIA runtime
- HuggingFace token with access to
deepseek-ai/DeepSeek-V4-Pro
chg reserve --gpu-ids 0,1,2,3,4,5,6,7 -d 4hHF_HOME=/home/$USER/.cache/huggingface \
HF_TOKEN=$HF_TOKEN \
uv tool run --from huggingface_hub hf download deepseek-ai/DeepSeek-V4-Prodocker pull quay.io/vllm/automation-vllm:cuda-25070156955docker run -d \
--name vllm-dsv4-pro \
--runtime=nvidia \
--user root \
-e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
--security-opt=label=disable \
--shm-size=10g \
--ipc=host \
-p 8000:8000 \
-v /raid/engine/hub_cache:/hf:Z \
-e HF_HUB_OFFLINE=1 \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e HF_HOME=/hf \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
quay.io/vllm/automation-vllm:cuda-25070156955 \
--model deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--enforce-eager \
--max-model-len 4096 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--host 0.0.0.0 --port 8000docker run -d \
--name vllm-dsv4-pro \
--runtime=nvidia \
--user root \
-e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
--security-opt=label=disable \
--shm-size=10g \
--ipc=host \
-p 8000:8000 \
-v /raid/engine/hub_cache:/hf:Z \
-e HF_HUB_OFFLINE=1 \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e HF_HOME=/hf \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
quay.io/vllm/automation-vllm:cuda-25070156955 \
--model deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--enforce-eager \
--max-model-len 4096 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--reasoning-parser deepseek_v4 \
--enable-auto-tool-choice \
--host 0.0.0.0 --port 8000docker logs -f vllm-dsv4-proWait for Application startup complete. — takes about 3.5 minutes on 8x B200 (model load ~69s, total init ~222s including profiling, KV cache setup, and warmup).
curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8000/health
# Expected: 200curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "What is 17 times 19?"}],
"max_tokens": 64
}'curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "A ski lift carries 4 people per chair, with chairs departing every 15 seconds. How many people can it carry up the mountain in 1 hour?"}],
"max_tokens": 1024,
"chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}
}'curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "What is the weather like in Burlington, Vermont?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and state"}
},
"required": ["location"]
}
}
}],
"max_tokens": 128
}'curl -s http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"prompt": "Hello, world! I am DeepSeek V4 and I",
"max_tokens": 64
}'docker stop vllm-dsv4-pro && docker rm vllm-dsv4-pro
chg release --gpu-ids 0,1,2,3,4,5,6,7| Metric | Value |
|---|---|
| Model load | 134.62 GiB, 68.9s |
| Total init time | 222.26s |
| KV cache available | 26.37 GiB |
| KV cache capacity | 15,284 tokens |
| MoE backend | FLASHINFER_TRTLLM_MXFP4_MXFP8 |
| TileLang kernels | mhc_pre_big_fuse_tilelang, mhc_post_tilelang |
| Basic chat latency | ~1-2s for short responses |
| Reasoning latency | ~160s for complex math (302 tokens) |
--user rootis currently required. FlashInfer writes compiled CUDA kernels (cubins) at runtime to/opt/vllm/lib64/python3.12/site-packages/flashinfer_cubin/cubins/. The directory is owned by root with 755 permissions, but the container's defaultvllmuser (uid=2000) can't write to it.FLASHINFER_CACHE_DIRdoes not redirect cubin writes. A proper fix is to chmod/chown the cubins directory in the Dockerfile.--enforce-eageris recommended for initial testing. CUDA graph capture hasn't been validated with V4's mixed FP4+FP8 expert architecture yet. Start eager, then try without once basic serving is confirmed.- Start with
--max-model-len 4096for smoke testing. V4 supports up to 1M context, but scale up gradually. With 8x B200 you have 15,284 tokens of KV capacity at fp8, enough for ~12x concurrent 4096-length requests. - Tool calling and reasoning require server-side flags. You must start the server with
--enable-auto-tool-choice --tool-call-parser deepseek_v4 --reasoning-parser deepseek_v4— these can't be set per-request. --ipc=hostis important. Without it, shared memory for NCCL across 8 GPUs may be insufficient.- Use
127.0.0.1, notlocalhostfor curl. Some hosts try IPv6 first and fail. - Don't use
--rmon the container if you need to debug crashes — you lose the logs. - HF_HUB_OFFLINE=1 prevents the container from trying to reach HuggingFace Hub at startup. The model must be fully downloaded before starting.
- Build required two patches beyond upstream v0.20.1rc0: (1)
python3.12-testpackage for_xxsubinterpretersmodule needed by tilelang, and (2)deep_gemmupgrade from 2.1.1 to 2.4.2 for V4's HC layer (tf32_hc_prenorm_gemm).
- vLLM DeepSeek V4 blog
- DeepSeek-V4-Pro recipe
- DeepSeek-V4-Flash recipe
- HuggingFace collection
- Upstream PR #40760 — DeepSeek V4 implementation