Skip to content

Instantly share code, notes, and snippets.

@dougbtv
Created April 28, 2026 22:01
Show Gist options
  • Select an option

  • Save dougbtv/9da5dad5e4c0281bc33be8c6ad9843ff to your computer and use it in GitHub Desktop.

Select an option

Save dougbtv/9da5dad5e4c0281bc33be8c6ad9843ff to your computer and use it in GitHub Desktop.
DeepSeek V4 Pro: Build, Run, and Smoke Test Guide (nm-vllm-ent v0.20.1rc0, 8x B200)

DeepSeek V4 Pro: Build, Run, and Smoke Test Guide

Overview

This document covers how to build, deploy, and test the deepseek-ai/DeepSeek-V4-Pro model (1.6T total params, 49B active, FP4+FP8 mixed precision) using nm-vllm-ent (based on upstream vLLM v0.20.1rc0) on NVIDIA B200 GPUs.

Build Information

Field Value
Model deepseek-ai/DeepSeek-V4-Pro
Container Image quay.io/vllm/automation-vllm:cuda-25070156955
nm-vllm-ent Branch doug/deepseek-v4-0day-v5 (upstream v0.20.1rc0 merged into doug/v0.20.0-release-validation)
nm-cicd Branch doug/deepseek-v4-0day-v5
GH Actions Run ID 25070156955
Target Device CUDA
Python Version 3.12
CUDA Version 13.0
Jira INFERENG-6294

Build Commands

Full build (wheel + image)

gh workflow run build-whl-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/deepseek-v4-0day-v5 \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/deepseek-v4-0day-v5 \
  -f build_label=k8s-a100-build-13-0 \
  -f build_timeout=120 \
  -f image_label=ibm-wdc-k8s-h100-dind \
  -f python=3.12 \
  -f release_image=false \
  -f target_device=cuda

Prerequisites

  • 8x NVIDIA B200 GPUs (tensor parallelism of 8)
  • Model files downloaded locally
  • Docker with NVIDIA runtime
  • HuggingFace token with access to deepseek-ai/DeepSeek-V4-Pro

Deployment Steps

1. Reserve GPUs

chg reserve --gpu-ids 0,1,2,3,4,5,6,7 -d 4h

2. Download the model

HF_HOME=/home/$USER/.cache/huggingface \
HF_TOKEN=$HF_TOKEN \
uv tool run --from huggingface_hub hf download deepseek-ai/DeepSeek-V4-Pro

3. Pull the image

docker pull quay.io/vllm/automation-vllm:cuda-25070156955

4. Start the server

docker run -d \
  --name vllm-dsv4-pro \
  --runtime=nvidia \
  --user root \
  -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  --security-opt=label=disable \
  --shm-size=10g \
  --ipc=host \
  -p 8000:8000 \
  -v /raid/engine/hub_cache:/hf:Z \
  -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e HF_HOME=/hf \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  quay.io/vllm/automation-vllm:cuda-25070156955 \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --trust-remote-code \
    --enforce-eager \
    --max-model-len 4096 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 --port 8000

With tool calling and reasoning enabled

docker run -d \
  --name vllm-dsv4-pro \
  --runtime=nvidia \
  --user root \
  -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  --security-opt=label=disable \
  --shm-size=10g \
  --ipc=host \
  -p 8000:8000 \
  -v /raid/engine/hub_cache:/hf:Z \
  -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e HF_HOME=/hf \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  quay.io/vllm/automation-vllm:cuda-25070156955 \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --trust-remote-code \
    --enforce-eager \
    --max-model-len 4096 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 \
    --reasoning-parser deepseek_v4 \
    --enable-auto-tool-choice \
    --host 0.0.0.0 --port 8000

5. Watch logs for startup

docker logs -f vllm-dsv4-pro

Wait for Application startup complete. — takes about 3.5 minutes on 8x B200 (model load ~69s, total init ~222s including profiling, KV cache setup, and warmup).

Smoke Tests

Health check

curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8000/health
# Expected: 200

Chat completion

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "What is 17 times 19?"}],
    "max_tokens": 64
  }'

Reasoning (requires reasoning parser)

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "A ski lift carries 4 people per chair, with chairs departing every 15 seconds. How many people can it carry up the mountain in 1 hour?"}],
    "max_tokens": 1024,
    "chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}
  }'

Tool calling (requires tool-call-parser)

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "What is the weather like in Burlington, Vermont?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City and state"}
          },
          "required": ["location"]
        }
      }
    }],
    "max_tokens": 128
  }'

Text completion

curl -s http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "prompt": "Hello, world! I am DeepSeek V4 and I",
    "max_tokens": 64
  }'

Cleanup

docker stop vllm-dsv4-pro && docker rm vllm-dsv4-pro
chg release --gpu-ids 0,1,2,3,4,5,6,7

Performance (observed on 8x B200)

Metric Value
Model load 134.62 GiB, 68.9s
Total init time 222.26s
KV cache available 26.37 GiB
KV cache capacity 15,284 tokens
MoE backend FLASHINFER_TRTLLM_MXFP4_MXFP8
TileLang kernels mhc_pre_big_fuse_tilelang, mhc_post_tilelang
Basic chat latency ~1-2s for short responses
Reasoning latency ~160s for complex math (302 tokens)

Notes and Gotchas

  • --user root is currently required. FlashInfer writes compiled CUDA kernels (cubins) at runtime to /opt/vllm/lib64/python3.12/site-packages/flashinfer_cubin/cubins/. The directory is owned by root with 755 permissions, but the container's default vllm user (uid=2000) can't write to it. FLASHINFER_CACHE_DIR does not redirect cubin writes. A proper fix is to chmod/chown the cubins directory in the Dockerfile.
  • --enforce-eager is recommended for initial testing. CUDA graph capture hasn't been validated with V4's mixed FP4+FP8 expert architecture yet. Start eager, then try without once basic serving is confirmed.
  • Start with --max-model-len 4096 for smoke testing. V4 supports up to 1M context, but scale up gradually. With 8x B200 you have 15,284 tokens of KV capacity at fp8, enough for ~12x concurrent 4096-length requests.
  • Tool calling and reasoning require server-side flags. You must start the server with --enable-auto-tool-choice --tool-call-parser deepseek_v4 --reasoning-parser deepseek_v4 — these can't be set per-request.
  • --ipc=host is important. Without it, shared memory for NCCL across 8 GPUs may be insufficient.
  • Use 127.0.0.1, not localhost for curl. Some hosts try IPv6 first and fail.
  • Don't use --rm on the container if you need to debug crashes — you lose the logs.
  • HF_HUB_OFFLINE=1 prevents the container from trying to reach HuggingFace Hub at startup. The model must be fully downloaded before starting.
  • Build required two patches beyond upstream v0.20.1rc0: (1) python3.12-test package for _xxsubinterpreters module needed by tilelang, and (2) deep_gemm upgrade from 2.1.1 to 2.4.2 for V4's HC layer (tf32_hc_prenorm_gemm).

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment