dougbtv/deepseek-v4-pro-guide.md

DeepSeek V4 Pro: Build, Run, and Smoke Test Guide

Overview

This document covers how to build, deploy, and test the deepseek-ai/DeepSeek-V4-Pro model (1.6T total params, 49B active, FP4+FP8 mixed precision) using nm-vllm-ent (based on upstream vLLM v0.20.1rc0) on NVIDIA B200 GPUs.

Build Information

Field	Value
Model	`deepseek-ai/DeepSeek-V4-Pro`
Container Image	`quay.io/vllm/automation-vllm:cuda-25070156955`
nm-vllm-ent Branch	`doug/deepseek-v4-0day-v5` (upstream `v0.20.1rc0` merged into `doug/v0.20.0-release-validation`)
nm-cicd Branch	`doug/deepseek-v4-0day-v5`
GH Actions Run ID	`25070156955`
Target Device	CUDA
Python Version	3.12
CUDA Version	13.0
Jira	INFERENG-6294

Build Commands

Full build (wheel + image)

gh workflow run build-whl-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/deepseek-v4-0day-v5 \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/deepseek-v4-0day-v5 \
  -f build_label=k8s-a100-build-13-0 \
  -f build_timeout=120 \
  -f image_label=ibm-wdc-k8s-h100-dind \
  -f python=3.12 \
  -f release_image=false \
  -f target_device=cuda

Prerequisites

8x NVIDIA B200 GPUs (tensor parallelism of 8)
Model files downloaded locally
Docker with NVIDIA runtime
HuggingFace token with access to deepseek-ai/DeepSeek-V4-Pro

Deployment Steps

1. Reserve GPUs

chg reserve --gpu-ids 0,1,2,3,4,5,6,7 -d 4h

2. Download the model

HF_HOME=/home/$USER/.cache/huggingface \
HF_TOKEN=$HF_TOKEN \
uv tool run --from huggingface_hub hf download deepseek-ai/DeepSeek-V4-Pro

3. Pull the image

docker pull quay.io/vllm/automation-vllm:cuda-25070156955

4. Start the server

docker run -d \
  --name vllm-dsv4-pro \
  --runtime=nvidia \
  --user root \
  -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  --security-opt=label=disable \
  --shm-size=10g \
  --ipc=host \
  -p 8000:8000 \
  -v /raid/engine/hub_cache:/hf:Z \
  -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e HF_HOME=/hf \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  quay.io/vllm/automation-vllm:cuda-25070156955 \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --trust-remote-code \
    --enforce-eager \
    --max-model-len 4096 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 --port 8000

With tool calling and reasoning enabled

docker run -d \
  --name vllm-dsv4-pro \
  --runtime=nvidia \
  --user root \
  -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  --security-opt=label=disable \
  --shm-size=10g \
  --ipc=host \
  -p 8000:8000 \
  -v /raid/engine/hub_cache:/hf:Z \
  -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e HF_HOME=/hf \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  quay.io/vllm/automation-vllm:cuda-25070156955 \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --trust-remote-code \
    --enforce-eager \
    --max-model-len 4096 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 \
    --reasoning-parser deepseek_v4 \
    --enable-auto-tool-choice \
    --host 0.0.0.0 --port 8000

5. Watch logs for startup

docker logs -f vllm-dsv4-pro

Wait for Application startup complete. — takes about 3.5 minutes on 8x B200 (model load ~69s, total init ~222s including profiling, KV cache setup, and warmup).

Smoke Tests

Health check

curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8000/health
# Expected: 200

Chat completion

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "What is 17 times 19?"}],
    "max_tokens": 64
  }'

Reasoning (requires reasoning parser)

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "A ski lift carries 4 people per chair, with chairs departing every 15 seconds. How many people can it carry up the mountain in 1 hour?"}],
    "max_tokens": 1024,
    "chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}
  }'

Tool calling (requires tool-call-parser)

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "What is the weather like in Burlington, Vermont?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City and state"}
          },
          "required": ["location"]
        }
      }
    }],
    "max_tokens": 128
  }'

Text completion

curl -s http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "prompt": "Hello, world! I am DeepSeek V4 and I",
    "max_tokens": 64
  }'

Cleanup

docker stop vllm-dsv4-pro && docker rm vllm-dsv4-pro
chg release --gpu-ids 0,1,2,3,4,5,6,7

Performance (observed on 8x B200)

Metric	Value
Model load	134.62 GiB, 68.9s
Total init time	222.26s
KV cache available	26.37 GiB
KV cache capacity	15,284 tokens
MoE backend	FLASHINFER_TRTLLM_MXFP4_MXFP8
TileLang kernels	mhc_pre_big_fuse_tilelang, mhc_post_tilelang
Basic chat latency	~1-2s for short responses
Reasoning latency	~160s for complex math (302 tokens)

Notes and Gotchas

--user root is currently required. FlashInfer writes compiled CUDA kernels (cubins) at runtime to /opt/vllm/lib64/python3.12/site-packages/flashinfer_cubin/cubins/. The directory is owned by root with 755 permissions, but the container's default vllm user (uid=2000) can't write to it. FLASHINFER_CACHE_DIR does not redirect cubin writes. A proper fix is to chmod/chown the cubins directory in the Dockerfile.
--enforce-eager is recommended for initial testing. CUDA graph capture hasn't been validated with V4's mixed FP4+FP8 expert architecture yet. Start eager, then try without once basic serving is confirmed.
Start with --max-model-len 4096 for smoke testing. V4 supports up to 1M context, but scale up gradually. With 8x B200 you have 15,284 tokens of KV capacity at fp8, enough for ~12x concurrent 4096-length requests.
Tool calling and reasoning require server-side flags. You must start the server with --enable-auto-tool-choice --tool-call-parser deepseek_v4 --reasoning-parser deepseek_v4 — these can't be set per-request.
--ipc=host is important. Without it, shared memory for NCCL across 8 GPUs may be insufficient.
Use 127.0.0.1, not localhost for curl. Some hosts try IPv6 first and fail.
Don't use --rm on the container if you need to debug crashes — you lose the logs.
HF_HUB_OFFLINE=1 prevents the container from trying to reach HuggingFace Hub at startup. The model must be fully downloaded before starting.
Build required two patches beyond upstream v0.20.1rc0: (1) python3.12-test package for _xxsubinterpreters module needed by tilelang, and (2) deep_gemm upgrade from 2.1.1 to 2.4.2 for V4's HC layer (tf32_hc_prenorm_gemm).

References

vLLM DeepSeek V4 blog
DeepSeek-V4-Pro recipe
DeepSeek-V4-Flash recipe
HuggingFace collection
Upstream PR #40760 — DeepSeek V4 implementation