| Field | Value |
|---|---|
| Model | mistralai/Mistral-Small-4-119B-2603 |
| Image | quay.io/vllm/rhaiis-early-access:mistral-4-small |
| Build Run | 24369571413 |
| nm-cicd branch | doug/mistral-4-small |
| nm-vllm-ent branch | doug/mistral-4-small |
| Target device | cuda |
| Python | 3.12 |
Build wheel + image (from scratch):
gh workflow run build-whl-image.yml \
--repo neuralmagic/nm-cicd \
--ref doug/mistral-4-small \
-f repo=neuralmagic/nm-vllm-ent \
-f branch=doug/mistral-4-small \
-f build_label=k8s-a100-build-12-9 \
-f build_timeout=120 \
-f image_label=ibm-wdc-k8s-h100-dind \
-f python=3.12 \
-f release_image=false \
-f target_device=cudaRebuild image only (reusing an existing wheel from a previous run):
gh workflow run build-image.yml \
--repo neuralmagic/nm-cicd \
--ref doug/mistral-4-small \
-f repo=neuralmagic/nm-vllm-ent \
-f branch=doug/mistral-4-small \
-f build_label=ibm-wdc-k8s-h100-dind \
-f release_image=false \
-f run_id=<PREVIOUS_RUN_ID> \
-f target_device=cuda- 2x H100 80GB GPUs (119B MoE model, needs TP=2)
- Model downloaded to local disk (NFS has rootless podman UID mapping issues)
- podman with NVIDIA CDI support
HF_HOME=/home/$USER/.cache/huggingface \
HF_TOKEN=<your_token> \
uv tool run --from huggingface_hub hf download mistralai/Mistral-Small-4-119B-2603podman pull quay.io/vllm/rhaiis-early-access:mistral-4-smallpodman run -d \
--name vllm-mistral-small-4 \
--device nvidia.com/gpu=0 \
--device nvidia.com/gpu=1 \
--security-opt=label=disable \
--shm-size=10g \
-p 8000:8000 \
-v /home/$USER/.cache/huggingface:/hf:Z \
-e HF_HUB_OFFLINE=1 \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e HF_HOME=/hf \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
quay.io/vllm/rhaiis-early-access:mistral-4-small \
--model mistralai/Mistral-Small-4-119B-2603 \
--tensor-parallel-size 2 \
--attention-backend FLASH_ATTN_MLA \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--reasoning-parser mistral \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--host 0.0.0.0 --port 8000Note: The model card recommends
--max-model-len 262144but that OOMs on 2x H100 80GB. Use 8192 for smoke testing. For full context length you'll need more GPUs or a larger TP size.
podman logs -f vllm-mistral-small-4Look for Application startup complete. -- model loading takes ~30 seconds.
curl http://127.0.0.1:8000/healthcurl -s http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-Small-4-119B-2603",
"prompt": "Write a perl script that outputs an advertisement for noodles marketed to sysadmins",
"max_tokens": 256
}'curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-Small-4-119B-2603",
"messages": [{"role": "user", "content": "Hello, what model are you?"}],
"max_tokens": 128
}'podman stop vllm-mistral-small-4
podman rm vllm-mistral-small-4