dougbtv/MISTRAL_4_SMALL_HOWTO.md

Mistral-Small-4-119B Howto

Build Info

Field	Value
Model	mistralai/Mistral-Small-4-119B-2603
Image	`quay.io/vllm/rhaiis-early-access:mistral-4-small`
Build Run	24369571413
nm-cicd branch	`doug/mistral-4-small`
nm-vllm-ent branch	`doug/mistral-4-small`
Target device	cuda
Python	3.12

Build Commands

Build wheel + image (from scratch):

gh workflow run build-whl-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/mistral-4-small \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/mistral-4-small \
  -f build_label=k8s-a100-build-12-9 \
  -f build_timeout=120 \
  -f image_label=ibm-wdc-k8s-h100-dind \
  -f python=3.12 \
  -f release_image=false \
  -f target_device=cuda

Rebuild image only (reusing an existing wheel from a previous run):

gh workflow run build-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/mistral-4-small \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/mistral-4-small \
  -f build_label=ibm-wdc-k8s-h100-dind \
  -f release_image=false \
  -f run_id=<PREVIOUS_RUN_ID> \
  -f target_device=cuda

How to Run

Prerequisites

2x H100 80GB GPUs (119B MoE model, needs TP=2)
Model downloaded to local disk (NFS has rootless podman UID mapping issues)
podman with NVIDIA CDI support

Download the model

HF_HOME=/home/$USER/.cache/huggingface \
HF_TOKEN=<your_token> \
uv tool run --from huggingface_hub hf download mistralai/Mistral-Small-4-119B-2603

Pull the image

podman pull quay.io/vllm/rhaiis-early-access:mistral-4-small

Start the server

podman run -d \
  --name vllm-mistral-small-4 \
  --device nvidia.com/gpu=0 \
  --device nvidia.com/gpu=1 \
  --security-opt=label=disable \
  --shm-size=10g \
  -p 8000:8000 \
  -v /home/$USER/.cache/huggingface:/hf:Z \
  -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e HF_HOME=/hf \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  quay.io/vllm/rhaiis-early-access:mistral-4-small \
    --model mistralai/Mistral-Small-4-119B-2603 \
    --tensor-parallel-size 2 \
    --attention-backend FLASH_ATTN_MLA \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --reasoning-parser mistral \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --host 0.0.0.0 --port 8000

Note: The model card recommends --max-model-len 262144 but that OOMs on 2x H100 80GB. Use 8192 for smoke testing. For full context length you'll need more GPUs or a larger TP size.

Watch startup logs

podman logs -f vllm-mistral-small-4

Look for Application startup complete. -- model loading takes ~30 seconds.

Smoke Test

Health check

curl http://127.0.0.1:8000/health

Completion request

curl -s http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-4-119B-2603",
    "prompt": "Write a perl script that outputs an advertisement for noodles marketed to sysadmins",
    "max_tokens": 256
  }'

Chat request

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-4-119B-2603",
    "messages": [{"role": "user", "content": "Hello, what model are you?"}],
    "max_tokens": 128
  }'

Cleanup

podman stop vllm-mistral-small-4
podman rm vllm-mistral-small-4

dougbtv/MISTRAL_4_SMALL_HOWTO.md

Select an option

No results found

Select an option

No results found

Mistral-Small-4-119B Howto

Build Info

Build Commands

How to Run

Prerequisites

Download the model

Pull the image

Start the server

Watch startup logs

Smoke Test

Health check

Completion request

Chat request

Cleanup