AMD’s Strix Halo APU is one of the most interesting consumer options right now for running local AI workloads. It is relatively power efficient, and its unified memory architecture lets the GPU access large amounts of system RAM. This makes it possible to run larger models locally without a discrete GPU.
As the software is still maturing, running models with GPU acceleration is not exactly straightforward yet. So I wrote this guide after getting a setup working reliably.
Before diving into the details, here is a quick overview of the stack I ended up with:
- OS: Fedora Rawhide for a recent kernel and working ROCm stack
- Runtime: Lemonade for managing and serving models
- Backend: llama.cpp (ROCm-enabled, Strix Halo–optimized build)
For the OS, I needed a recent kernel, up-to-date AMD drivers, and a working ROCm stack. I chose Fedora Rawhide because it worked with minimal configuration and was recent enough for this setup.
For model serving, I used Lemonade, which is AMD’s local-first runtime and API layer for running AI workloads directly on a PC. It exposes an OpenAI-compatible API and handles hardware-specific details like CPU, GPU, and NPU backends automatically.
Lemonade is not really needed. I use it as a convenient interface to download,
manage, and quickly try models and their configuration. This is slightly faster
and easier than running llama-server (from llama.cpp) manually. I built
Lemonade from source to use the latest features and fixes. At the time of
writing, the version available for download still used the deprecated
lemonade-server command, so I decided to build and install the current main
branch instead.
As mentioned earlier, the inference backend is
llama.cpp, built with
ROCm support to enable GPU
acceleration on AMD hardware. Lemonade runs the llama-server binary with the
appropriate arguments under the hood. After some testing, I ended up using a
different build
of llama.cpp optimized for Strix Halo, since neither the upstream builds nor
the one bundled with Lemonade worked reliably for me.
The rest of this guide walks through the steps I used to get everything working.
On Strix Halo, the GPU uses system memory. My BIOS allowed allocating up to 96 GB of VRAM, but since the memory is shared, I did not want to reserve a large fixed portion up front. I wanted the flexibility to use more memory when needed, while keeping it available for other workloads. The driver can reserve additional memory at runtime, so I set the VRAM to 512 MB in the BIOS.
Then with the help of this article, I set the kernel boot parameters to max out unified memory:
grubby --update-kernel=ALL --args="amd_iommu=pt ttm.pages_limit=33554432 ttm.page_pool_size=33554432"Verify:
sudo grubby --info=ALLReboot and confirm:
cat /proc/cmdlineThen I installed ROCm and the required packages for building Lemonade:
sudo dnf upgrade --refresh -y
sudo dnf install rocm -y
# Build dependencies
sudo dnf group install development-tools c-development rpm-development-tools -y
# AppImage
sudo dnf install fuse fuse-libsClone the repository:
mkdir -p ~/Projects
cd ~/Projects
git clone https://github.com/lemonade-sdk/lemonade.git
cd lemonadePrepare the build environment:
rm -rf build # optional cleanup
./setup.shBuild the project:
cmake --build --preset defaultOptionally build the Tauri app:
cmake --preset default -DBUILD_APPIMAGE=ON
# Very important, otherwise the build fails
export NO_STRIP=true
cmake --build --preset default --target appimageBuilding an RPM makes it easier to install and manage the service cleanly. The following commands create and install the package:
cd build
cpack -G RPM
sudo dnf install -y lemonade-server-*.rpmTo start Lemonade at boot and run it right away:
sudo systemctl enable --now lemondllama.cpp changes quickly and issues get fixed (or sometimes introduced 😉) in
every release. In my case, some issues with handling prompt template were
resolved by upgrading. Therefore, I recommend trying the most recent version if
you are having issues with the default llama.cpp backend of Lemonade.
Lemonade supports a custom llama-server path, which makes it easy to swap
builds.
The official release could not load models on the GPU in my setup. Lemonade's ROCm nightly build was also partially broken:
After some digging, I found another repository that worked. This may not be needed in the future but the point is that things can break, and it helps to be able to switch builds quickly.
The trick to managing multiple versions/releases of llama.cpp is to store the
builds in /opt/var/lib/lemonade/backends and use a symlink to select the
active version:
ROCM_RELEASES_DIR=/opt/var/lib/lemonade/backends/rocm
sudo mkdir -p "$ROCM_RELEASES_DIR"
sudo mv /path/to/llama/release "$ROCM_RELEASES_DIR/"
sudo chown -R lemonade:lemonade "$ROCM_RELEASES_DIR"
sudo rm -f "$ROCM_RELEASES_DIR/llama-current"
# TODO: replace llama-release-dir-name with the actual directory name
sudo ln -s "$ROCM_RELEASES_DIR/llama-release-dir-name" "$ROCM_RELEASES_DIR/llama-current"Configure Lemonade to use the selected build:
lemonade config set llamacpp.rocm_bin=/opt/var/lib/lemonade/backends/rocm/llama-current/llama-serverI tested a few popular models using the highest quantization that fit in VRAM and the maximum context size supported by each model. In practice, these settings depend on your use case, and you will likely want to balance quality, speed, and memory usage when choosing them. For reference, I included shell scripts to import the models with the same parameters as I used, along with API throughput results measured using llmapibenchmark. I ran it with the following arguments:
MODEL="model-name"
llmapibenchmark -u "http://localhost:13305/v1"-m "${MODEL}" -c 1I also configured Lemonade to use the ROCm backend by default:
lemonade config set llamacpp.backend=rocmRefs:
TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"
for reasoning_effort in "low" "medium" "high"; do
cat <<EOF > model-${reasoning_effort}.json
{
"checkpoints": {
"main": "ggml-org/gpt-oss-120b-GGUF:*"
},
"labels": [
"reasoning",
"tool-calling"
],
"model_name": "gpt-oss-120b-${reasoning_effort}",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 131072,
"llamacpp_args": "--flash-attn on --ubatch-size 2048 --temp 1.0 --top-k 0 --top-p 1.0 --cache-type-k bf16 --cache-type-v bf16 --chat-template-kwargs '{\\"reasoning_effort\\": \\"${reasoning_effort}\\"}'"
}
}
EOF
lemonade import model-${reasoning_effort}.json
done
popd
rm -rf "${TEMP_DIR}"llmapibenchmark -u "http://localhost:13305/v1" -m "user.gpt-oss-120b-medium" -c 1Results:
Model: user.gpt-oss-120b-medium
Latency: 0.20 ms
Input: 90 tokens / Output: 512 tokens
| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|---|---|---|---|---|---|---|
| 1 | 49.59 | 751.25 | 0.12 | 0.12 | 100.00% | 10.32 |
Refs:
TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"
for thinking in 0 1; do
if [ "$thinking" -eq 1 ]; then
thinking_suffix="-thinking"
reasoning_mode="on"
else
thinking_suffix=""
reasoning_mode="off"
fi
cat <<EOF > model${thinking_suffix}.json
{
"checkpoints": {
"main": "unsloth/gemma-4-31B-it-GGUF:BF16"
},
"labels": [
"vision",
"tool-calling"
],
"model_name": "gemma-4-31B-it${thinking_suffix}",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 256000,
"llamacpp_args": "--flash-attn on --ubatch-size 2048 --temp 1.0 --top-k 64 --top-p 0.95 --cache-type-k bf16 --cache-type-v bf16 --reasoning ${reasoning_mode}"
}
}
EOF
lemonade import model${thinking_suffix}.json
done
popd
rm -rf "${TEMP_DIR}"Results:
Model: user.gemma-4-31B-it
Latency: 0.20 ms
Input: 39 tokens / Output: 512 tokens
| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|---|---|---|---|---|---|---|
| 1 | 3.32 | 49.38 | 0.79 | 0.79 | 100.00% | 154.22 |
Note
I lowered ctx_size size to 64000 to make it fit
Refs:
TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"
cat <<EOF > model.json
{
"checkpoints": {
"main": "unsloth/MiniMax-M2.7-GGUF:UD-IQ4_NL"
},
"labels": [
"coding"
],
"model_name": "MiniMax-M2.7",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 64000,
"llamacpp_args": "--flash-attn on --ubatch-size 2048 --temp 1.0 --top-k 40 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0"
}
}
EOF
lemonade import model.json
popd
rm -rf "${TEMP_DIR}"Results:
Model: user.MiniMax-M2.7
Latency: 0.20 ms
Input: 61 tokens / Output: 512 tokens
| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|---|---|---|---|---|---|---|
| 1 | 22.60 | 1020.07 | 0.06 | 0.06 | 100.00% | 22.66 |
Refs:
TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"
cat > qwen36-thinking-general.json <<'EOF'
{
"checkpoints": {
"main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
},
"labels": ["general", "thinking"],
"model_name": "Qwen3.6-35B-A3B-Thinking-General",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 262144,
"llamacpp_args": "--reasoning on --cache-type-k bf16 --cache-type-v bf16 --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0"
}
}
EOF
cat > qwen36-thinking-coding.json <<'EOF'
{
"checkpoints": {
"main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
},
"labels": ["coding", "thinking"],
"model_name": "Qwen3.6-35B-A3B-Thinking-Coding",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 262144,
"llamacpp_args": "--reasoning on --cache-type-k bf16 --cache-type-v bf16 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0"
}
}
EOF
cat > qwen36-instruct-general.json <<'EOF'
{
"checkpoints": {
"main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
},
"labels": ["general", "instruct"],
"model_name": "Qwen3.6-35B-A3B-Instruct-General",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 262144,
"llamacpp_args": "--reasoning off --cache-type-k bf16 --cache-type-v bf16 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0"
}
}
EOF
cat > qwen36-instruct-reasoning.json <<'EOF'
{
"checkpoints": {
"main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
},
"labels": ["reasoning", "instruct"],
"model_name": "Qwen3.6-35B-A3B-Instruct-Reasoning",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 262144,
"llamacpp_args": "--reasoning off --cache-type-k bf16 --cache-type-v bf16 --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0"
}
}
EOF
lemonade import qwen36-thinking-general.json
lemonade import qwen36-thinking-coding.json
lemonade import qwen36-instruct-general.json
lemonade import qwen36-instruct-reasoning.json
popd
rm -rf "${TEMP_DIR}"Results:
Model: user.Qwen3.6-35B-A3B-Instruct-General
Latency: 0.20 ms
Input: 38 tokens / Output: 512 tokens
| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|---|---|---|---|---|---|---|
| 1 | 22.90 | 115.22 | 0.33 | 0.33 | 100.00% | 22.36 |
Refs:
TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"
COMMON_ARGS="--flash-attn auto --ubatch-size 2048 --cache-type-k bf16 --cache-type-v bf16 --reasoning on --top-k 40 --min-p 0.01 --repeat-penalty 1.0"
for variant in general tool_calling; do
if [ "$variant" = "general" ]; then
LLAMACPP_ARGS="${COMMON_ARGS} --temp 1.0 --top-p 0.95"
else
LLAMACPP_ARGS="${COMMON_ARGS} --temp 0.7 --top-p 1.0"
fi
cat <<EOF > model-${variant}.json
{
"checkpoints": {
"main": "unsloth/GLM-4.7-Flash-GGUF:BF16"
},
"labels": [
"${variant}"
],
"model_name": "GLM-4.7-Flash-${variant}",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 202752,
"llamacpp_args": "${LLAMACPP_ARGS}"
}
}
EOF
lemonade import model-${variant}.json
done
popd
rm -rf "${TEMP_DIR}"Note
I lowered ctx_size size to 64000 to make it fit
TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"
MODEL_CHECKPOINT="unsloth/GLM-4.5-Air-GGUF:Q8_0"
MODEL_NAME_BASE="GLM-4.5-Air"
COMMON_ARGS="--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --ubatch-size 2048"
for variant in general reasoning thinking-general thinking-coding; do
case "${variant}" in
general)
LLAMACPP_ARGS="${COMMON_ARGS} --reasoning off --temp 0.6 --top-p 0.95 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0"
;;
reasoning)
LLAMACPP_ARGS="${COMMON_ARGS} --reasoning off --temp 1.0 --top-p 0.95 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0"
;;
thinking-general)
LLAMACPP_ARGS="${COMMON_ARGS} --reasoning on --temp 1.0 --top-p 0.95 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0"
;;
thinking-coding)
LLAMACPP_ARGS="${COMMON_ARGS} --reasoning on --temp 0.6 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0"
;;
esac
cat <<EOF > model-${variant}.json
{
"checkpoints": {
"main": "${MODEL_CHECKPOINT}"
},
"labels": [
"glm-4.5-air",
"llamacpp",
"${variant}"
],
"model_name": "${MODEL_NAME_BASE}-${variant}",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 64000,
"llamacpp_args": "${LLAMACPP_ARGS}"
}
}
EOF
lemonade import model-${variant}.json
done
popd
rm -rf "${TEMP_DIR}"Results:
Model: user.GLM-4.5-Air-thinking-coding
Latency: 0.40 ms
Input: 28 tokens / Output: 512 tokens
| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|---|---|---|---|---|---|---|
| 1 | 14.62 | 127.50 | 0.22 | 0.22 | 100.00% | 35.01 |