Sebastian noxgle

Running OpenCode with Gemma 4 26B on macOS (via llama.cpp)

As of April 2026, Gemma 4 tool calling is broken in Ollama v0.20.0 (ollama/ollama#15241) - the tool call parser fails and streaming drops tool calls entirely. OpenCode also has issues with local OpenAI-compatible providers (anomalyco/opencode#20669, #20719).

This guide documents a working setup using:

llama.cpp (built from source with PR #21326 template fix + PR #21343 tokenizer fix) instead of Ollama
OpenCode built from source with PR #16531 tool-call compatibility layer

Tested on macOS Apple Silicon (M1 Max, 32GB) on April 2, 2026.

Moved to git repository: https://github.com/denji/nginx-tuning

NGINX Tuning For Best Performance

For this configuration you can use web server you like, i decided, because i work mostly with it to use nginx.

Generally, properly configured nginx can handle up to 400K to 500K requests per second (clustered), most what i saw is 50K to 80K (non-clustered) requests per second and 30% CPU load, course, this was 2 x Intel Xeon with HyperThreading enabled, but it can work without problem on slower machines.

You must understand that this config is used in testing environment and not in production so you will need to find a way to implement most of those features best possible for your servers.

	Server 2 sockets,6 cores each, 2.4ghz

	# Set ixgbe options
	# Limit RSS queues to the number of physical cores per cpu
	# Disable offload
	# When you change this, you need to run the command and reboot for it to take.
	echo "options ixgbe LRO=0,0 MQ=1,1 RSS=6,6 VMDQ=0,0 vxlan_rx=0,0" > /etc/modprobe.d/ixgbe.conf

	# Shut down HT cores
	for i in $(seq 1 2 23); do