Skip to content

Instantly share code, notes, and snippets.

@escherize
Created April 24, 2026 11:49
Show Gist options
  • Select an option

  • Save escherize/e6c4f28955eca75c8126cf7162d84c6c to your computer and use it in GitHub Desktop.

Select an option

Save escherize/e6c4f28955eca75c8126cf7162d84c6c to your computer and use it in GitHub Desktop.
Local LLMs on M4 Max 64GB: what to run, Claude Code integration, and the uncensored landscape (April 2026)

Local LLMs on an M4 Max 64GB: What to Run, How to Use It, and the Uncensored Landscape

Date: 2026-04-24 Machine: MacBook Pro M4 Max, 16-core (12P+4E), 64GB unified memory, ~546 GB/s bandwidth Context: Research summary for picking and running local models on Apple Silicon, integrating them with Claude Code / Cline, and understanding the uncensored/abliterated model ecosystem.


TL;DR

  • Your machine fits 70B-class models at Q4 or 35B MoE models at 8-bit, with bandwidth that makes them genuinely interactive (8-50+ tok/s depending on arch).
  • The sweet spot in April 2026 is MoE models in the 30-35B total / 3B active range — Qwen3.6-35B-A3B and Qwen3-Coder-30B-A3B dominate the "best for 64GB Mac" recommendations.
  • Pull unsloth/*-UD-MLX-*bit variants when available — Unsloth Dynamic 2.0 quants beat vanilla quants of the same bit width.
  • Claude Code can point at LM Studio via ANTHROPIC_BASE_URL, but Cline/Continue.dev are purpose-built for local-LLM coding agents and give a better UX. The optimal setup is hybrid — real Claude for hard work, local for the rest.
  • For uncensored: huihui-ai/*-abliterated is the active publisher in 2026; MoE models abliterate cleanly with minimal quality loss.

1. Hardware context

The M4 Max with 64GB is in a specific sweet spot:

  • Memory ceiling: ~55GB usable for model weights (leaving OS + context + apps).
  • Bandwidth: ~546 GB/s — the number that actually determines tokens/sec, not raw RAM.
  • Rule of thumb: At Q4 quantization, budget ~0.6GB per billion parameters. A 70B model = ~42GB. A 35B model = ~18-22GB.
  • KV cache: Budget ~1GB per 32K tokens of context on top of model weights. This matters more than people expect at the 200K+ context windows Qwen3 supports.

The Ollama 0.19 release added a native MLX backend that auto-activates on 32GB+ Macs, so Apple Silicon performance "changed overnight" across the ecosystem when users upgraded frontends.


2. Model recommendations (~32GB budget on MLX)

Tier 1: Daily drivers

Model LM Studio search RAM Notes
unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit Qwen3.6 35B ~22GB MoE: 35B total, 3B active. Flagship quality at 3B latency.
unsloth/Qwen3.6-27B-UD-MLX-6bit Qwen3.6 27B ~22-24GB Dense 27B. Near-lossless at 6-bit. "Flagship-level coding" per HN.
unsloth/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit Qwen3-Coder 30B ~18-20GB Coding-specialized MoE. Native tool-call support.

Tier 2: Quality over headroom

Model LM Studio search RAM
unsloth/Qwen3.6-27B-MLX-8bit Qwen3.6 27B 8bit ~28-30GB
unsloth/Qwen3.6-35B-A3B-MLX-8bit Qwen3.6 35B 8bit ~32GB

Over-budget (skip on 64GB)

  • Qwen3.5-122B-A10B-MLX-4bit (~65GB)
  • Llama 3.3 70B @ Q8 (~75GB)
  • DeepSeek V3/R1 full 670B (needs ~400GB even quantized)

Unsloth "UD" variants — why they matter

Unsloth Dynamic 2.0 quants calibrate against real-world instruction-following datasets and keep critical layers at higher precision while aggressively quantizing the rest. A UD-4bit often beats a vanilla 6-bit in benchmarks. When you see a -UD- tag, pick it over the non-UD version at the same or lower bit width.


3. MoE vs Dense on Apple Silicon

This is the load-bearing insight for your hardware class:

  • Dense models activate every parameter on every token. 70B dense at Q4 = 42GB of weights moving through the memory bus per token → ~8-12 tok/s on M4 Max.
  • MoE models (like Qwen3.6-35B-A3B) activate only ~3B params per token via a router. Same 22GB sits in RAM, but only ~2-3GB streams per token → 44-52 tok/s on M4 Max.

This is why every "best for Mac" list converges on MoE architectures. Apple Silicon's unified memory + bandwidth profile is exactly right for models that are big in total but narrow in activation.


4. Using local models with Claude / Claude Code

The landscape

LM Studio 0.4.1 (shipped earlier in 2026) added a native Anthropic-compatible /v1/messages endpoint. Claude Code is just an Anthropic-API client — it doesn't verify there's a real Claude on the other end. Flip two env vars and it talks to your local model.

Option 1: Claude Code → LM Studio

# Start LM Studio, load a model, click "Start Server"
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio
claude

Gotcha: ANTHROPIC_BASE_URL is global to the shell. Don't export in .zshrc unless you want every Claude Code session on local. A shell function is cleaner:

claude-local() {
  ANTHROPIC_BASE_URL=http://localhost:1234 \
  ANTHROPIC_AUTH_TOKEN=lmstudio \
  claude "$@"
}

Honest assessment: Works, but Qwen3-Coder ≠ Claude Opus 4.7. Multi-step agentic tasks (edit → test → debug → retry) expose the gap. Tool-calling on local models is improved in 2026 but more fragile. Good for offline/private/travel/grunt-work.

Option 2: Cline + LM Studio (the hot 2026 local coding stack)

The Cline team publicly recommends Cline + LM Studio + Qwen3-Coder-30B-A3B as "the local coding stack." Setup:

  1. Load unsloth/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit in LM Studio, set context to 262,144, click Start Server.
  2. Cline (VS Code extension) → Settings → Provider: LM Studio → Model: qwen/qwen3-coder-30b.
  3. Leave base URL blank (defaults to http://localhost:1234).

Why it beats Option 1 for coding:

  • Native tool-call format specifically for Qwen3-Coder — fewer malformed calls.
  • Diff-preview UX before applying changes.
  • Retry logic tuned for local models.

Option 3 (recommended): Hybrid

Task Tool
Architectural changes, planning, multi-file refactors Claude Code (real Claude)
Boilerplate, small edits, docstrings, test scaffolds Cline + local Qwen3-Coder
Chat / brainstorm / rubber-duck LM Studio chat UI with Qwen3.6-35B-A3B
Offline / on a plane / confidential codebase Claude Code → LM Studio (Option 1)

64GB is enough to keep LM Studio loaded with a coder model and use Claude Code normally. Switch based on task weight.


5. The uncensored / abliterated landscape

Three meanings of "uncensored"

  1. Abliteration — post-hoc weight surgery that zeros out the "refusal direction" in the model's activations. Cheap, fast; typically degrades reasoning/coding 2-8% because refusal circuits overlap with other capabilities.
  2. Uncensored fine-tunes (Dolphin, Venice) — actually retrained on unrestricted data. Preserve quality better but cost more to produce.
  3. Base models (non-instruct) — never had safety training, but also don't follow instructions well.

The active publishers in 2026

  • huihui-ai — publishes abliterated versions of nearly every major open-weight release within days. Dominates the category.
  • cognitivecomputations (Eric Hartford's Dolphin lineage) — older, more respected fine-tune approach.
  • Venice — 2026-era uncensored fine-tune family.

Models that fit your budget

Model LM Studio search RAM
huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated Huihui Qwen3.5 35B ~22GB @ 4bit
huihui-ai/Qwen3-30B-A3B-abliterated Huihui Qwen3 30B ~18-20GB
huihui_ai/qwen3-coder-abliterated:30b Qwen3 Coder abliterated ~18-20GB
huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated (vision + uncensored) ~20GB

Why MoE abliterates cleanly

In dense models, refusal logic is distributed across every layer — abliteration causes more collateral damage. In MoE models like Qwen3-30B-A3B, refusal circuits concentrate in specific experts, so the abliteration "scalpel" does less harm. This is a structural advantage: MoE abliterated variants retain nearly all their coding/reasoning quality.

Caveats worth naming

  1. Quality hit is real but small (2-5% on reasoning benchmarks). Noticeable on hard agentic tasks, invisible on chat/creative writing.
  2. Abliteration isn't jailbreaking — it's lobotomizing. The model won't refuse, but may give worse answers on edge cases because refusal was entangled with "carefully check if this request makes sense."
  3. "Uncensored" ≠ "accurate." Still hallucinates; just less likely to refuse.
  4. MLX uploads lag GGUF. If the MLX variant isn't up yet, GGUF works in LM Studio — slightly slower but functional.
  5. The legitimate use case: most people want abliterated models to stop annoying false-positive refusals (villain dialogue, medication dosages, explaining SQL injection for security learning), not to do anything spicy.

Reversible setup

LM Studio loads models by folder. Keep "clean" and abliterated variants of the same base side-by-side — ~45GB of disk for both, swap via the model picker. No overwriting, no re-downloads.


6. Practical starter pack

If I were setting this machine up today, I'd pull:

  1. unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit — general-purpose daily driver
  2. unsloth/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit — coding model for Cline
  3. huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated (MLX if available, GGUF otherwise) — uncensored variant for when you need it

Total disk: ~60GB. Total RAM used at inference time: ~22GB (one loaded at a time). Leaves 40GB+ for context windows, other apps, and Claude Code running alongside.

Frontend: LM Studio with MLX backend enabled (Settings → Use MLX backend on Apple Silicon).


7. Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment