Created
March 22, 2026 04:34
-
-
Save realugbun/8add8e716e3d768c32f7ebaa687922a4 to your computer and use it in GitHub Desktop.
Known-Good Strix Halo ROCm + llama.cpp Stack
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ### Radeon 8060S / gfx1151 · Proxmox LXC passthrough · ROCm 7.2 · llama.cpp HIP build | |
| ## Results | |
| Qwen3.5-35B-A3B Q4_K_M (21 GB GGUF): ~45 t/s single user, ~168 t/s aggregate at 16 parallel slots. Memory bandwidth limited at that concurrency. | |
| ```bash | |
| llama-server \ | |
| -m /models/Qwen3.5-35B-A3B-Q4_K_M.gguf \ | |
| --host 0.0.0.0 --port 8091 \ | |
| -ngl 999 \ | |
| -fa on \ | |
| -dio \ | |
| -c 131072 \ | |
| -np 16 \ | |
| -b 2048 \ | |
| -ub 512 \ | |
| --cache-prompt \ | |
| --jinja | |
| ``` | |
| --- | |
| ## Host | |
| - Proxmox VE, kernel `6.19.2-1-pve` | |
| - In-tree `amdgpu` driver, no `amdgpu-dkms` | |
| - gfx1151 needs a sufficiently new kernel for native support. In practice, 6.19.2-1-pve works well here. Older 6.8-era Proxmox kernels are too old and tend to push you toward AMD DKMS, which gets messy on Proxmox. | |
| ## ROCm | |
| AMD APT repo, pinned priority 600: | |
| ```text | |
| deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/latest noble main | |
| ``` | |
| - ~55 packages, all 7.2.0 | |
| - `rocm-hip-sdk` as meta-package | |
| - `rocwmma-dev` required for flash-attention build path (came in via `rocm-hip-sdk` for me, verify if doing minimal install) | |
| ## LXC passthrough | |
| Unprivileged LXC, Proxmox-native device mappings: | |
| - `/dev/dri/card0` | |
| - `/dev/dri/renderD128` | |
| - `/dev/kfd` | |
| - Host render-group GID mapped correctly | |
| If rocm-smi can't see your GPU from inside the LXC, fix that before anything else. | |
| ## llama.cpp build | |
| Commit `4d99d45`: | |
| ```bash | |
| HIPCXX="$(hipconfig -l)/clang" \ | |
| HIP_PATH="$(hipconfig -R)" \ | |
| cmake -S . -B build \ | |
| -DGGML_HIP=ON \ | |
| -DGPU_TARGETS=gfx1151 \ | |
| -DGGML_HIP_ROCWMMA_FATTN=ON \ | |
| -DGGML_HIP_NO_VMM=ON \ | |
| -DGGML_HIP_MMQ_MFMA=ON \ | |
| -DCMAKE_BUILD_TYPE=Release | |
| cmake --build build --config Release -j"$(nproc)" | |
| ``` | |
| ### Build flags | |
| - `GPU_TARGETS=gfx1151` — mandatory, don't rely on defaults | |
| - `GGML_HIP_ROCWMMA_FATTN=ON` — rocWMMA flash attention, big perf difference. Needs `rocwmma-dev` | |
| - `GGML_HIP_NO_VMM=ON` — **the big one.** HIP VMM doesn't work right on this GPU. Without this you get misleading stability/loading failures that look like driver issues. Try this first if you're having unexplained problems. | |
| - `GGML_HIP_MMQ_MFMA=ON` — helps matmul path | |
| - `HIPCXX` / `HIP_PATH` — set explicitly, cmake wasn't finding the toolchain reliably without them | |
| ### Runtime flags | |
| - `-ngl 999` — full GPU offload | |
| - `-fa on` — actually use the rocWMMA flash attention you built with | |
| - `-dio` — **required for models >~6 GB or they hang on load.** Not slow. Hang. | |
| - `-np 16` — drives concurrency, GPU can handle it | |
| - `-b 2048` / `-ub 512` — tuned batch sizes, defaults are conservative | |
| - `--cache-prompt` — matters for real serving with repeated prefixes | |
| --- | |
| ## Reproduction checklist | |
| 1. New enough kernel for gfx1151 (6.19.2-1-pve works well here; 6.8-era Proxmox kernels are too old) | |
| 2. In-tree amdgpu, skip DKMS | |
| 3. ROCm 7.2.0 from AMD's repo | |
| 4. Verify `rocwmma-dev` installed | |
| 5. LXC: `/dev/kfd` + `/dev/dri/card0` + `/dev/dri/renderD128` + correct GID | |
| 6. `GPU_TARGETS=gfx1151` | |
| 7. `GGML_HIP_NO_VMM=ON` | |
| 8. `-dio` for anything over ~6 GB | |
| If you've got gfx1151 numbers I'd like to compare, especially around NO_VMM and -dio. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment