realugbun · March 22, 2026 04:34
diff --git a/gistfile1.txt b/gistfile1.txt
 ### Radeon 8060S / gfx1151 · Proxmox LXC passthrough · ROCm 7.2 · llama.cpp HIP build

 ## Results

 Qwen3.5-35B-A3B Q4_K_M (21 GB GGUF): ~45 t/s single user, ~168 t/s aggregate at 16 parallel slots. Memory bandwidth limited at that concurrency.

 ```bash
 llama-server \
  -m /models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8091 \
  -ngl 999 \
  -fa on \
  -dio \
  -c 131072 \
  -np 16 \
  -b 2048 \
  -ub 512 \
  --cache-prompt \
  --jinja
 ```

 ---

 ## Host

 - Proxmox VE, kernel `6.19.2-1-pve`
 - In-tree `amdgpu` driver, no `amdgpu-dkms`
 - gfx1151 needs a sufficiently new kernel for native support. In practice, 6.19.2-1-pve works well here. Older 6.8-era Proxmox kernels are too old and tend to push you toward AMD DKMS, which gets messy on Proxmox.

 ## ROCm

 AMD APT repo, pinned priority 600:

 ```text
 deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/latest noble main
 ```

 - ~55 packages, all 7.2.0
 - `rocm-hip-sdk` as meta-package
 - `rocwmma-dev` required for flash-attention build path (came in via `rocm-hip-sdk` for me, verify if doing minimal install)

 ## LXC passthrough

 Unprivileged LXC, Proxmox-native device mappings:

 - `/dev/dri/card0`
 - `/dev/dri/renderD128`
 - `/dev/kfd`
 - Host render-group GID mapped correctly

 If rocm-smi can't see your GPU from inside the LXC, fix that before anything else.

 ## llama.cpp build

 Commit `4d99d45`:

 ```bash
 HIPCXX="$(hipconfig -l)/clang" \
 HIP_PATH="$(hipconfig -R)" \
 cmake -S . -B build \
  -DGGML_HIP=ON \
  -DGPU_TARGETS=gfx1151 \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DGGML_HIP_NO_VMM=ON \
  -DGGML_HIP_MMQ_MFMA=ON \
  -DCMAKE_BUILD_TYPE=Release

 cmake --build build --config Release -j"$(nproc)"
 ```

 ### Build flags

 - `GPU_TARGETS=gfx1151` — mandatory, don't rely on defaults
 - `GGML_HIP_ROCWMMA_FATTN=ON` — rocWMMA flash attention, big perf difference. Needs `rocwmma-dev`
 - `GGML_HIP_NO_VMM=ON` — **the big one.** HIP VMM doesn't work right on this GPU. Without this you get misleading stability/loading failures that look like driver issues. Try this first if you're having unexplained problems.
 - `GGML_HIP_MMQ_MFMA=ON` — helps matmul path
 - `HIPCXX` / `HIP_PATH` — set explicitly, cmake wasn't finding the toolchain reliably without them

 ### Runtime flags

 - `-ngl 999` — full GPU offload
 - `-fa on` — actually use the rocWMMA flash attention you built with
 - `-dio` — **required for models >~6 GB or they hang on load.** Not slow. Hang.
 - `-np 16` — drives concurrency, GPU can handle it
 - `-b 2048` / `-ub 512` — tuned batch sizes, defaults are conservative
 - `--cache-prompt` — matters for real serving with repeated prefixes

 ---

 ## Reproduction checklist

 1. New enough kernel for gfx1151 (6.19.2-1-pve works well here; 6.8-era Proxmox kernels are too old)
 2. In-tree amdgpu, skip DKMS
 3. ROCm 7.2.0 from AMD's repo
 4. Verify `rocwmma-dev` installed
 5. LXC: `/dev/kfd` + `/dev/dri/card0` + `/dev/dri/renderD128` + correct GID
 6. `GPU_TARGETS=gfx1151`
 7. `GGML_HIP_NO_VMM=ON`
 8. `-dio` for anything over ~6 GB

 If you've got gfx1151 numbers I'd like to compare, especially around NO_VMM and -dio.
	### Radeon 8060S / gfx1151 · Proxmox LXC passthrough · ROCm 7.2 · llama.cpp HIP build

	## Results

	Qwen3.5-35B-A3B Q4_K_M (21 GB GGUF): ~45 t/s single user, ~168 t/s aggregate at 16 parallel slots. Memory bandwidth limited at that concurrency.

	```bash
	llama-server \
	-m /models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
	--host 0.0.0.0 --port 8091 \
	-ngl 999 \
	-fa on \
	-dio \
	-c 131072 \
	-np 16 \
	-b 2048 \
	-ub 512 \
	--cache-prompt \
	--jinja
	```

	---

	## Host

	- Proxmox VE, kernel `6.19.2-1-pve`
	- In-tree `amdgpu` driver, no `amdgpu-dkms`
	- gfx1151 needs a sufficiently new kernel for native support. In practice, 6.19.2-1-pve works well here. Older 6.8-era Proxmox kernels are too old and tend to push you toward AMD DKMS, which gets messy on Proxmox.

	## ROCm

	AMD APT repo, pinned priority 600:

	```text
	deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/latest noble main
	```

	- ~55 packages, all 7.2.0
	- `rocm-hip-sdk` as meta-package
	- `rocwmma-dev` required for flash-attention build path (came in via `rocm-hip-sdk` for me, verify if doing minimal install)

	## LXC passthrough

	Unprivileged LXC, Proxmox-native device mappings:

	- `/dev/dri/card0`
	- `/dev/dri/renderD128`
	- `/dev/kfd`
	- Host render-group GID mapped correctly

	If rocm-smi can't see your GPU from inside the LXC, fix that before anything else.

	## llama.cpp build

	Commit `4d99d45`:

	```bash
	HIPCXX="$(hipconfig -l)/clang" \
	HIP_PATH="$(hipconfig -R)" \
	cmake -S . -B build \
	-DGGML_HIP=ON \
	-DGPU_TARGETS=gfx1151 \
	-DGGML_HIP_ROCWMMA_FATTN=ON \
	-DGGML_HIP_NO_VMM=ON \
	-DGGML_HIP_MMQ_MFMA=ON \
	-DCMAKE_BUILD_TYPE=Release

	cmake --build build --config Release -j"$(nproc)"
	```

	### Build flags

	- `GPU_TARGETS=gfx1151` — mandatory, don't rely on defaults
	- `GGML_HIP_ROCWMMA_FATTN=ON` — rocWMMA flash attention, big perf difference. Needs `rocwmma-dev`
	- `GGML_HIP_NO_VMM=ON` — the big one. HIP VMM doesn't work right on this GPU. Without this you get misleading stability/loading failures that look like driver issues. Try this first if you're having unexplained problems.
	- `GGML_HIP_MMQ_MFMA=ON` — helps matmul path
	- `HIPCXX` / `HIP_PATH` — set explicitly, cmake wasn't finding the toolchain reliably without them

	### Runtime flags

	- `-ngl 999` — full GPU offload
	- `-fa on` — actually use the rocWMMA flash attention you built with
	- `-dio` — required for models >~6 GB or they hang on load. Not slow. Hang.
	- `-np 16` — drives concurrency, GPU can handle it
	- `-b 2048` / `-ub 512` — tuned batch sizes, defaults are conservative
	- `--cache-prompt` — matters for real serving with repeated prefixes

	---

	## Reproduction checklist

	1. New enough kernel for gfx1151 (6.19.2-1-pve works well here; 6.8-era Proxmox kernels are too old)
	2. In-tree amdgpu, skip DKMS
	3. ROCm 7.2.0 from AMD's repo
	4. Verify `rocwmma-dev` installed
	5. LXC: `/dev/kfd` + `/dev/dri/card0` + `/dev/dri/renderD128` + correct GID
	6. `GPU_TARGETS=gfx1151`
	7. `GGML_HIP_NO_VMM=ON`
	8. `-dio` for anything over ~6 GB

	If you've got gfx1151 numbers I'd like to compare, especially around NO_VMM and -dio.
No results found