Skip to content

Instantly share code, notes, and snippets.

@dougbtv
Created April 13, 2026 23:52
Show Gist options
  • Select an option

  • Save dougbtv/06598e31ed8746bcec5bc67cbd399701 to your computer and use it in GitHub Desktop.

Select an option

Save dougbtv/06598e31ed8746bcec5bc67cbd399701 to your computer and use it in GitHub Desktop.
Mistral-Small-4-119B: build, run, and smoke test howto

Mistral-Small-4-119B Howto

Build Info

Field Value
Model mistralai/Mistral-Small-4-119B-2603
Image quay.io/vllm/rhaiis-early-access:mistral-4-small
Build Run 24369571413
nm-cicd branch doug/mistral-4-small
nm-vllm-ent branch doug/mistral-4-small
Target device cuda
Python 3.12

Build Commands

Build wheel + image (from scratch):

gh workflow run build-whl-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/mistral-4-small \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/mistral-4-small \
  -f build_label=k8s-a100-build-12-9 \
  -f build_timeout=120 \
  -f image_label=ibm-wdc-k8s-h100-dind \
  -f python=3.12 \
  -f release_image=false \
  -f target_device=cuda

Rebuild image only (reusing an existing wheel from a previous run):

gh workflow run build-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/mistral-4-small \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/mistral-4-small \
  -f build_label=ibm-wdc-k8s-h100-dind \
  -f release_image=false \
  -f run_id=<PREVIOUS_RUN_ID> \
  -f target_device=cuda

How to Run

Prerequisites

  • 2x H100 80GB GPUs (119B MoE model, needs TP=2)
  • Model downloaded to local disk (NFS has rootless podman UID mapping issues)
  • podman with NVIDIA CDI support

Download the model

HF_HOME=/home/$USER/.cache/huggingface \
HF_TOKEN=<your_token> \
uv tool run --from huggingface_hub hf download mistralai/Mistral-Small-4-119B-2603

Pull the image

podman pull quay.io/vllm/rhaiis-early-access:mistral-4-small

Start the server

podman run -d \
  --name vllm-mistral-small-4 \
  --device nvidia.com/gpu=0 \
  --device nvidia.com/gpu=1 \
  --security-opt=label=disable \
  --shm-size=10g \
  -p 8000:8000 \
  -v /home/$USER/.cache/huggingface:/hf:Z \
  -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e HF_HOME=/hf \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  quay.io/vllm/rhaiis-early-access:mistral-4-small \
    --model mistralai/Mistral-Small-4-119B-2603 \
    --tensor-parallel-size 2 \
    --attention-backend FLASH_ATTN_MLA \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --reasoning-parser mistral \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --host 0.0.0.0 --port 8000

Note: The model card recommends --max-model-len 262144 but that OOMs on 2x H100 80GB. Use 8192 for smoke testing. For full context length you'll need more GPUs or a larger TP size.

Watch startup logs

podman logs -f vllm-mistral-small-4

Look for Application startup complete. -- model loading takes ~30 seconds.

Smoke Test

Health check

curl http://127.0.0.1:8000/health

Completion request

curl -s http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-4-119B-2603",
    "prompt": "Write a perl script that outputs an advertisement for noodles marketed to sysadmins",
    "max_tokens": 256
  }'

Chat request

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-4-119B-2603",
    "messages": [{"role": "user", "content": "Hello, what model are you?"}],
    "max_tokens": 128
  }'

Cleanup

podman stop vllm-mistral-small-4
podman rm vllm-mistral-small-4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment