Skip to content

Instantly share code, notes, and snippets.

@realugbun
Created March 22, 2026 04:34
Show Gist options
  • Select an option

  • Save realugbun/8add8e716e3d768c32f7ebaa687922a4 to your computer and use it in GitHub Desktop.

Select an option

Save realugbun/8add8e716e3d768c32f7ebaa687922a4 to your computer and use it in GitHub Desktop.
Known-Good Strix Halo ROCm + llama.cpp Stack
### Radeon 8060S / gfx1151 · Proxmox LXC passthrough · ROCm 7.2 · llama.cpp HIP build
## Results
Qwen3.5-35B-A3B Q4_K_M (21 GB GGUF): ~45 t/s single user, ~168 t/s aggregate at 16 parallel slots. Memory bandwidth limited at that concurrency.
```bash
llama-server \
-m /models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
--host 0.0.0.0 --port 8091 \
-ngl 999 \
-fa on \
-dio \
-c 131072 \
-np 16 \
-b 2048 \
-ub 512 \
--cache-prompt \
--jinja
```
---
## Host
- Proxmox VE, kernel `6.19.2-1-pve`
- In-tree `amdgpu` driver, no `amdgpu-dkms`
- gfx1151 needs a sufficiently new kernel for native support. In practice, 6.19.2-1-pve works well here. Older 6.8-era Proxmox kernels are too old and tend to push you toward AMD DKMS, which gets messy on Proxmox.
## ROCm
AMD APT repo, pinned priority 600:
```text
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/latest noble main
```
- ~55 packages, all 7.2.0
- `rocm-hip-sdk` as meta-package
- `rocwmma-dev` required for flash-attention build path (came in via `rocm-hip-sdk` for me, verify if doing minimal install)
## LXC passthrough
Unprivileged LXC, Proxmox-native device mappings:
- `/dev/dri/card0`
- `/dev/dri/renderD128`
- `/dev/kfd`
- Host render-group GID mapped correctly
If rocm-smi can't see your GPU from inside the LXC, fix that before anything else.
## llama.cpp build
Commit `4d99d45`:
```bash
HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build \
-DGGML_HIP=ON \
-DGPU_TARGETS=gfx1151 \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DGGML_HIP_NO_VMM=ON \
-DGGML_HIP_MMQ_MFMA=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
```
### Build flags
- `GPU_TARGETS=gfx1151` — mandatory, don't rely on defaults
- `GGML_HIP_ROCWMMA_FATTN=ON` — rocWMMA flash attention, big perf difference. Needs `rocwmma-dev`
- `GGML_HIP_NO_VMM=ON` — **the big one.** HIP VMM doesn't work right on this GPU. Without this you get misleading stability/loading failures that look like driver issues. Try this first if you're having unexplained problems.
- `GGML_HIP_MMQ_MFMA=ON` — helps matmul path
- `HIPCXX` / `HIP_PATH` — set explicitly, cmake wasn't finding the toolchain reliably without them
### Runtime flags
- `-ngl 999` — full GPU offload
- `-fa on` — actually use the rocWMMA flash attention you built with
- `-dio` — **required for models >~6 GB or they hang on load.** Not slow. Hang.
- `-np 16` — drives concurrency, GPU can handle it
- `-b 2048` / `-ub 512` — tuned batch sizes, defaults are conservative
- `--cache-prompt` — matters for real serving with repeated prefixes
---
## Reproduction checklist
1. New enough kernel for gfx1151 (6.19.2-1-pve works well here; 6.8-era Proxmox kernels are too old)
2. In-tree amdgpu, skip DKMS
3. ROCm 7.2.0 from AMD's repo
4. Verify `rocwmma-dev` installed
5. LXC: `/dev/kfd` + `/dev/dri/card0` + `/dev/dri/renderD128` + correct GID
6. `GPU_TARGETS=gfx1151`
7. `GGML_HIP_NO_VMM=ON`
8. `-dio` for anything over ~6 GB
If you've got gfx1151 numbers I'd like to compare, especially around NO_VMM and -dio.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment