Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save fardjad/ca7b38cbf02fa5047928a59a36be0a7e to your computer and use it in GitHub Desktop.

Select an option

Save fardjad/ca7b38cbf02fa5047928a59a36be0a7e to your computer and use it in GitHub Desktop.
[Yet Another Guide to Getting Started with Local AI on Strix Halo] Practical setup for running local AI on Strix Halo with ROCm, Lemonade, and optimized llama.cpp #localai #strixhalo #rocm #llamacpp #amd #blog

Yet Another Guide to Getting Started with Local AI on Strix Halo

AMD’s Strix Halo APU is one of the most interesting consumer options right now for running local AI workloads. It is relatively power efficient, and its unified memory architecture lets the GPU access large amounts of system RAM. This makes it possible to run larger models locally without a discrete GPU.

As the software is still maturing, running models with GPU acceleration is not exactly straightforward yet. So I wrote this guide after getting a setup working reliably.

Software stack

Before diving into the details, here is a quick overview of the stack I ended up with:

  • OS: Fedora Rawhide for a recent kernel and working ROCm stack
  • Runtime: Lemonade for managing and serving models
  • Backend: llama.cpp (ROCm-enabled, Strix Halo–optimized build)

For the OS, I needed a recent kernel, up-to-date AMD drivers, and a working ROCm stack. I chose Fedora Rawhide because it worked with minimal configuration and was recent enough for this setup.

For model serving, I used Lemonade, which is AMD’s local-first runtime and API layer for running AI workloads directly on a PC. It exposes an OpenAI-compatible API and handles hardware-specific details like CPU, GPU, and NPU backends automatically.

Lemonade is not really needed. I use it as a convenient interface to download, manage, and quickly try models and their configuration. This is slightly faster and easier than running llama-server (from llama.cpp) manually. I built Lemonade from source to use the latest features and fixes. At the time of writing, the version available for download still used the deprecated lemonade-server command, so I decided to build and install the current main branch instead.

As mentioned earlier, the inference backend is llama.cpp, built with ROCm support to enable GPU acceleration on AMD hardware. Lemonade runs the llama-server binary with the appropriate arguments under the hood. After some testing, I ended up using a different build of llama.cpp optimized for Strix Halo, since neither the upstream builds nor the one bundled with Lemonade worked reliably for me.

The rest of this guide walks through the steps I used to get everything working.

System configuration

BIOS

On Strix Halo, the GPU uses system memory. My BIOS allowed allocating up to 96 GB of VRAM, but since the memory is shared, I did not want to reserve a large fixed portion up front. I wanted the flexibility to use more memory when needed, while keeping it available for other workloads. The driver can reserve additional memory at runtime, so I set the VRAM to 512 MB in the BIOS.

Kernel parameters

Then with the help of this article, I set the kernel boot parameters to max out unified memory:

grubby --update-kernel=ALL --args="amd_iommu=pt ttm.pages_limit=33554432 ttm.page_pool_size=33554432"

Verify:

sudo grubby --info=ALL

Reboot and confirm:

cat /proc/cmdline

Install dependencies

Then I installed ROCm and the required packages for building Lemonade:

sudo dnf upgrade --refresh -y

sudo dnf install rocm -y

# Build dependencies
sudo dnf group install development-tools c-development rpm-development-tools -y

# AppImage
sudo dnf install fuse fuse-libs

Build and install Lemonade

Clone the repository:

mkdir -p ~/Projects
cd ~/Projects
git clone https://github.com/lemonade-sdk/lemonade.git
cd lemonade

Prepare the build environment:

rm -rf build # optional cleanup
./setup.sh

Build the project:

cmake --build --preset default

Optionally build the Tauri app:

cmake --preset default -DBUILD_APPIMAGE=ON

# Very important, otherwise the build fails
export NO_STRIP=true

cmake --build --preset default --target appimage

Building an RPM makes it easier to install and manage the service cleanly. The following commands create and install the package:

cd build
cpack -G RPM
sudo dnf install -y lemonade-server-*.rpm

To start Lemonade at boot and run it right away:

sudo systemctl enable --now lemond

Using a recent llama.cpp build

llama.cpp changes quickly and issues get fixed (or sometimes introduced 😉) in every release. In my case, some issues with handling prompt template were resolved by upgrading. Therefore, I recommend trying the most recent version if you are having issues with the default llama.cpp backend of Lemonade.

Lemonade supports a custom llama-server path, which makes it easy to swap builds.

The official release could not load models on the GPU in my setup. Lemonade's ROCm nightly build was also partially broken:

After some digging, I found another repository that worked. This may not be needed in the future but the point is that things can break, and it helps to be able to switch builds quickly.

Managing multiple llama.cpp builds

The trick to managing multiple versions/releases of llama.cpp is to store the builds in /opt/var/lib/lemonade/backends and use a symlink to select the active version:

ROCM_RELEASES_DIR=/opt/var/lib/lemonade/backends/rocm

sudo mkdir -p "$ROCM_RELEASES_DIR"

sudo mv /path/to/llama/release "$ROCM_RELEASES_DIR/"
sudo chown -R lemonade:lemonade "$ROCM_RELEASES_DIR"

sudo rm -f "$ROCM_RELEASES_DIR/llama-current"
# TODO: replace llama-release-dir-name with the actual directory name
sudo ln -s "$ROCM_RELEASES_DIR/llama-release-dir-name" "$ROCM_RELEASES_DIR/llama-current"

Configure Lemonade to use the selected build:

lemonade config set llamacpp.rocm_bin=/opt/var/lib/lemonade/backends/rocm/llama-current/llama-server

Testing some models

I tested a few popular models using the highest quantization that fit in VRAM and the maximum context size supported by each model. In practice, these settings depend on your use case, and you will likely want to balance quality, speed, and memory usage when choosing them. For reference, I included shell scripts to import the models with the same parameters as I used, along with API throughput results measured using llmapibenchmark. I ran it with the following arguments:

MODEL="model-name"
llmapibenchmark -u "http://localhost:13305/v1"-m "${MODEL}" -c 1

I also configured Lemonade to use the ROCm backend by default:

lemonade config set llamacpp.backend=rocm

gpt-oss-120b (MXFP4)

Refs:

TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"

for reasoning_effort in "low" "medium" "high"; do

cat <<EOF > model-${reasoning_effort}.json
{
  "checkpoints": {
    "main": "ggml-org/gpt-oss-120b-GGUF:*"
  },
  "labels": [
    "reasoning",
    "tool-calling"
  ],
  "model_name": "gpt-oss-120b-${reasoning_effort}",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 131072,
    "llamacpp_args": "--flash-attn on --ubatch-size 2048 --temp 1.0 --top-k 0 --top-p 1.0 --cache-type-k bf16 --cache-type-v bf16 --chat-template-kwargs '{\\"reasoning_effort\\": \\"${reasoning_effort}\\"}'"
  }
}
EOF

lemonade import model-${reasoning_effort}.json

done

popd
rm -rf "${TEMP_DIR}"
llmapibenchmark -u "http://localhost:13305/v1" -m "user.gpt-oss-120b-medium" -c 1

Results:

Model: user.gpt-oss-120b-medium
Latency: 0.20 ms
Input: 90 tokens / Output: 512 tokens
Conc Gen TPS Prompt TPS Min TTFT(s) Max TTFT(s) Success Total(s)
1 49.59 751.25 0.12 0.12 100.00% 10.32

Gemma 4 31B (BF16)

Refs:

TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"

for thinking in 0 1; do

if [ "$thinking" -eq 1 ]; then
  thinking_suffix="-thinking"
  reasoning_mode="on"
else
  thinking_suffix=""
  reasoning_mode="off"
fi

cat <<EOF > model${thinking_suffix}.json
{
  "checkpoints": {
    "main": "unsloth/gemma-4-31B-it-GGUF:BF16"
  },
  "labels": [
    "vision",
    "tool-calling"
  ],
  "model_name": "gemma-4-31B-it${thinking_suffix}",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 256000,
    "llamacpp_args": "--flash-attn on --ubatch-size 2048 --temp 1.0 --top-k 64 --top-p 0.95 --cache-type-k bf16 --cache-type-v bf16 --reasoning ${reasoning_mode}"
  }
}
EOF

lemonade import model${thinking_suffix}.json

done

popd
rm -rf "${TEMP_DIR}"

Results:

Model: user.gemma-4-31B-it
Latency: 0.20 ms
Input: 39 tokens / Output: 512 tokens
Conc Gen TPS Prompt TPS Min TTFT(s) Max TTFT(s) Success Total(s)
1 3.32 49.38 0.79 0.79 100.00% 154.22

MiniMax-M2.7 (UD-IQ4_NL)

Note

I lowered ctx_size size to 64000 to make it fit

Refs:

TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"

cat <<EOF > model.json
{
  "checkpoints": {
    "main": "unsloth/MiniMax-M2.7-GGUF:UD-IQ4_NL"
  },
  "labels": [
    "coding"
  ],
  "model_name": "MiniMax-M2.7",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 64000,
    "llamacpp_args": "--flash-attn on --ubatch-size 2048 --temp 1.0 --top-k 40 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0"
  }
}
EOF

lemonade import model.json

popd
rm -rf "${TEMP_DIR}"

Results:

Model: user.MiniMax-M2.7
Latency: 0.20 ms
Input: 61 tokens / Output: 512 tokens
Conc Gen TPS Prompt TPS Min TTFT(s) Max TTFT(s) Success Total(s)
1 22.60 1020.07 0.06 0.06 100.00% 22.66

Qwen 3.6 (FP16)

Refs:

TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"

cat > qwen36-thinking-general.json <<'EOF'
{
  "checkpoints": {
    "main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
  },
  "labels": ["general", "thinking"],
  "model_name": "Qwen3.6-35B-A3B-Thinking-General",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 262144,
    "llamacpp_args": "--reasoning on --cache-type-k bf16 --cache-type-v bf16 --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0"
  }
}
EOF

cat > qwen36-thinking-coding.json <<'EOF'
{
  "checkpoints": {
    "main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
  },
  "labels": ["coding", "thinking"],
  "model_name": "Qwen3.6-35B-A3B-Thinking-Coding",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 262144,
    "llamacpp_args": "--reasoning on --cache-type-k bf16 --cache-type-v bf16 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0"
  }
}
EOF

cat > qwen36-instruct-general.json <<'EOF'
{
  "checkpoints": {
    "main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
  },
  "labels": ["general", "instruct"],
  "model_name": "Qwen3.6-35B-A3B-Instruct-General",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 262144,
    "llamacpp_args": "--reasoning off --cache-type-k bf16 --cache-type-v bf16 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0"
  }
}
EOF

cat > qwen36-instruct-reasoning.json <<'EOF'
{
  "checkpoints": {
    "main": "unsloth/Qwen3.6-35B-A3B-GGUF:BF16"
  },
  "labels": ["reasoning", "instruct"],
  "model_name": "Qwen3.6-35B-A3B-Instruct-Reasoning",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 262144,
    "llamacpp_args": "--reasoning off --cache-type-k bf16 --cache-type-v bf16 --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0"
  }
}
EOF

lemonade import qwen36-thinking-general.json
lemonade import qwen36-thinking-coding.json
lemonade import qwen36-instruct-general.json
lemonade import qwen36-instruct-reasoning.json

popd
rm -rf "${TEMP_DIR}"

Results:

Model: user.Qwen3.6-35B-A3B-Instruct-General
Latency: 0.20 ms
Input: 38 tokens / Output: 512 tokens
Conc Gen TPS Prompt TPS Min TTFT(s) Max TTFT(s) Success Total(s)
1 22.90 115.22 0.33 0.33 100.00% 22.36

GLM-4.7-Flash (BF16)

Refs:

TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"

COMMON_ARGS="--flash-attn auto --ubatch-size 2048 --cache-type-k bf16 --cache-type-v bf16 --reasoning on --top-k 40 --min-p 0.01 --repeat-penalty 1.0"

for variant in general tool_calling; do

if [ "$variant" = "general" ]; then
  LLAMACPP_ARGS="${COMMON_ARGS} --temp 1.0 --top-p 0.95"
else
  LLAMACPP_ARGS="${COMMON_ARGS} --temp 0.7 --top-p 1.0"
fi

cat <<EOF > model-${variant}.json
{
  "checkpoints": {
    "main": "unsloth/GLM-4.7-Flash-GGUF:BF16"
  },
  "labels": [
    "${variant}"
  ],
  "model_name": "GLM-4.7-Flash-${variant}",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 202752,
    "llamacpp_args": "${LLAMACPP_ARGS}"
  }
}
EOF

lemonade import model-${variant}.json

done

popd
rm -rf "${TEMP_DIR}"

GLM-4.5-Air (Q8_0)

Note

I lowered ctx_size size to 64000 to make it fit

TEMP_DIR=$(mktemp -d)
pushd "${TEMP_DIR}"

MODEL_CHECKPOINT="unsloth/GLM-4.5-Air-GGUF:Q8_0"
MODEL_NAME_BASE="GLM-4.5-Air"
COMMON_ARGS="--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --ubatch-size 2048"

for variant in general reasoning thinking-general thinking-coding; do

case "${variant}" in
  general)
    LLAMACPP_ARGS="${COMMON_ARGS} --reasoning off --temp 0.6 --top-p 0.95 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0"
    ;;
  reasoning)
    LLAMACPP_ARGS="${COMMON_ARGS} --reasoning off --temp 1.0 --top-p 0.95 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0"
    ;;
  thinking-general)
    LLAMACPP_ARGS="${COMMON_ARGS} --reasoning on --temp 1.0 --top-p 0.95 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0"
    ;;
  thinking-coding)
    LLAMACPP_ARGS="${COMMON_ARGS} --reasoning on --temp 0.6 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0"
    ;;
esac

cat <<EOF > model-${variant}.json
{
  "checkpoints": {
    "main": "${MODEL_CHECKPOINT}"
  },
  "labels": [
    "glm-4.5-air",
    "llamacpp",
    "${variant}"
  ],
  "model_name": "${MODEL_NAME_BASE}-${variant}",
  "recipe": "llamacpp",
  "recipe_options": {
    "ctx_size": 64000,
    "llamacpp_args": "${LLAMACPP_ARGS}"
  }
}
EOF

lemonade import model-${variant}.json

done

popd
rm -rf "${TEMP_DIR}"

Results:

Model: user.GLM-4.5-Air-thinking-coding
Latency: 0.40 ms
Input: 28 tokens / Output: 512 tokens
Conc Gen TPS Prompt TPS Min TTFT(s) Max TTFT(s) Success Total(s)
1 14.62 127.50 0.22 0.22 100.00% 35.01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment