Date: 2026-04-24 Machine: MacBook Pro M4 Max, 16-core (12P+4E), 64GB unified memory, ~546 GB/s bandwidth Context: Research summary for picking and running local models on Apple Silicon, integrating them with Claude Code / Cline, and understanding the uncensored/abliterated model ecosystem.
- Your machine fits 70B-class models at Q4 or 35B MoE models at 8-bit, with bandwidth that makes them genuinely interactive (8-50+ tok/s depending on arch).
- The sweet spot in April 2026 is MoE models in the 30-35B total / 3B active range — Qwen3.6-35B-A3B and Qwen3-Coder-30B-A3B dominate the "best for 64GB Mac" recommendations.
- Pull
unsloth/*-UD-MLX-*bitvariants when available — Unsloth Dynamic 2.0 quants beat vanilla quants of the same bit width. - Claude Code can point at LM Studio via
ANTHROPIC_BASE_URL, but Cline/Continue.dev are purpose-built for local-LLM coding agents and give a better UX. The optimal setup is hybrid — real Claude for hard work, local for the rest. - For uncensored:
huihui-ai/*-abliteratedis the active publisher in 2026; MoE models abliterate cleanly with minimal quality loss.
The M4 Max with 64GB is in a specific sweet spot:
- Memory ceiling: ~55GB usable for model weights (leaving OS + context + apps).
- Bandwidth: ~546 GB/s — the number that actually determines tokens/sec, not raw RAM.
- Rule of thumb: At Q4 quantization, budget ~0.6GB per billion parameters. A 70B model = ~42GB. A 35B model = ~18-22GB.
- KV cache: Budget ~1GB per 32K tokens of context on top of model weights. This matters more than people expect at the 200K+ context windows Qwen3 supports.
The Ollama 0.19 release added a native MLX backend that auto-activates on 32GB+ Macs, so Apple Silicon performance "changed overnight" across the ecosystem when users upgraded frontends.
| Model | LM Studio search | RAM | Notes |
|---|---|---|---|
unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit |
Qwen3.6 35B | ~22GB | MoE: 35B total, 3B active. Flagship quality at 3B latency. |
unsloth/Qwen3.6-27B-UD-MLX-6bit |
Qwen3.6 27B | ~22-24GB | Dense 27B. Near-lossless at 6-bit. "Flagship-level coding" per HN. |
unsloth/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit |
Qwen3-Coder 30B | ~18-20GB | Coding-specialized MoE. Native tool-call support. |
| Model | LM Studio search | RAM |
|---|---|---|
unsloth/Qwen3.6-27B-MLX-8bit |
Qwen3.6 27B 8bit | ~28-30GB |
unsloth/Qwen3.6-35B-A3B-MLX-8bit |
Qwen3.6 35B 8bit | ~32GB |
- Qwen3.5-122B-A10B-MLX-4bit (~65GB)
- Llama 3.3 70B @ Q8 (~75GB)
- DeepSeek V3/R1 full 670B (needs ~400GB even quantized)
Unsloth Dynamic 2.0 quants calibrate against real-world instruction-following datasets and keep critical layers at higher precision while aggressively quantizing the rest. A UD-4bit often beats a vanilla 6-bit in benchmarks. When you see a -UD- tag, pick it over the non-UD version at the same or lower bit width.
This is the load-bearing insight for your hardware class:
- Dense models activate every parameter on every token. 70B dense at Q4 = 42GB of weights moving through the memory bus per token → ~8-12 tok/s on M4 Max.
- MoE models (like Qwen3.6-35B-A3B) activate only ~3B params per token via a router. Same 22GB sits in RAM, but only ~2-3GB streams per token → 44-52 tok/s on M4 Max.
This is why every "best for Mac" list converges on MoE architectures. Apple Silicon's unified memory + bandwidth profile is exactly right for models that are big in total but narrow in activation.
LM Studio 0.4.1 (shipped earlier in 2026) added a native Anthropic-compatible /v1/messages endpoint. Claude Code is just an Anthropic-API client — it doesn't verify there's a real Claude on the other end. Flip two env vars and it talks to your local model.
# Start LM Studio, load a model, click "Start Server"
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio
claudeGotcha: ANTHROPIC_BASE_URL is global to the shell. Don't export in .zshrc unless you want every Claude Code session on local. A shell function is cleaner:
claude-local() {
ANTHROPIC_BASE_URL=http://localhost:1234 \
ANTHROPIC_AUTH_TOKEN=lmstudio \
claude "$@"
}Honest assessment: Works, but Qwen3-Coder ≠ Claude Opus 4.7. Multi-step agentic tasks (edit → test → debug → retry) expose the gap. Tool-calling on local models is improved in 2026 but more fragile. Good for offline/private/travel/grunt-work.
The Cline team publicly recommends Cline + LM Studio + Qwen3-Coder-30B-A3B as "the local coding stack." Setup:
- Load
unsloth/Qwen3-Coder-30B-A3B-Instruct-MLX-4bitin LM Studio, set context to 262,144, click Start Server. - Cline (VS Code extension) → Settings → Provider: LM Studio → Model:
qwen/qwen3-coder-30b. - Leave base URL blank (defaults to
http://localhost:1234).
Why it beats Option 1 for coding:
- Native tool-call format specifically for Qwen3-Coder — fewer malformed calls.
- Diff-preview UX before applying changes.
- Retry logic tuned for local models.
| Task | Tool |
|---|---|
| Architectural changes, planning, multi-file refactors | Claude Code (real Claude) |
| Boilerplate, small edits, docstrings, test scaffolds | Cline + local Qwen3-Coder |
| Chat / brainstorm / rubber-duck | LM Studio chat UI with Qwen3.6-35B-A3B |
| Offline / on a plane / confidential codebase | Claude Code → LM Studio (Option 1) |
64GB is enough to keep LM Studio loaded with a coder model and use Claude Code normally. Switch based on task weight.
- Abliteration — post-hoc weight surgery that zeros out the "refusal direction" in the model's activations. Cheap, fast; typically degrades reasoning/coding 2-8% because refusal circuits overlap with other capabilities.
- Uncensored fine-tunes (Dolphin, Venice) — actually retrained on unrestricted data. Preserve quality better but cost more to produce.
- Base models (non-instruct) — never had safety training, but also don't follow instructions well.
huihui-ai— publishes abliterated versions of nearly every major open-weight release within days. Dominates the category.cognitivecomputations(Eric Hartford's Dolphin lineage) — older, more respected fine-tune approach.- Venice — 2026-era uncensored fine-tune family.
| Model | LM Studio search | RAM |
|---|---|---|
huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated |
Huihui Qwen3.5 35B | ~22GB @ 4bit |
huihui-ai/Qwen3-30B-A3B-abliterated |
Huihui Qwen3 30B | ~18-20GB |
huihui_ai/qwen3-coder-abliterated:30b |
Qwen3 Coder abliterated | ~18-20GB |
huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated |
(vision + uncensored) | ~20GB |
In dense models, refusal logic is distributed across every layer — abliteration causes more collateral damage. In MoE models like Qwen3-30B-A3B, refusal circuits concentrate in specific experts, so the abliteration "scalpel" does less harm. This is a structural advantage: MoE abliterated variants retain nearly all their coding/reasoning quality.
- Quality hit is real but small (2-5% on reasoning benchmarks). Noticeable on hard agentic tasks, invisible on chat/creative writing.
- Abliteration isn't jailbreaking — it's lobotomizing. The model won't refuse, but may give worse answers on edge cases because refusal was entangled with "carefully check if this request makes sense."
- "Uncensored" ≠ "accurate." Still hallucinates; just less likely to refuse.
- MLX uploads lag GGUF. If the MLX variant isn't up yet, GGUF works in LM Studio — slightly slower but functional.
- The legitimate use case: most people want abliterated models to stop annoying false-positive refusals (villain dialogue, medication dosages, explaining SQL injection for security learning), not to do anything spicy.
LM Studio loads models by folder. Keep "clean" and abliterated variants of the same base side-by-side — ~45GB of disk for both, swap via the model picker. No overwriting, no re-downloads.
If I were setting this machine up today, I'd pull:
unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit— general-purpose daily driverunsloth/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit— coding model for Clinehuihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated(MLX if available, GGUF otherwise) — uncensored variant for when you need it
Total disk: ~60GB. Total RAM used at inference time: ~22GB (one loaded at a time). Leaves 40GB+ for context windows, other apps, and Claude Code running alongside.
Frontend: LM Studio with MLX backend enabled (Settings → Use MLX backend on Apple Silicon).
- Unsloth: Qwen3.6 How to Run Locally
- Unsloth: How to Run Local LLMs with Claude Code
- HF: Qwen3.6-35B-A3B-UD-MLX-4bit
- HF: Qwen3-Coder-30B-A3B MLX 4bit
- HF: Huihui-Qwen3.5-35B-A3B-abliterated
- HF: Qwen3-30B-A3B-abliterated
- LM Studio: Claude Code integration
- LM Studio: Use your models in Claude Code
- Cline: Local coding stack with Qwen3-Coder-30B
- Best Local LLMs to Run on Every Apple Silicon Mac in 2026
- Best Local LLMs for MacBook Pro M4 Max 64GB (2026)
- What to Buy for Local LLMs (April 2026) — Julien Simon
- 20 New Uncensored LLMs Released in March 2026
- HN: Qwen3.6-27B Flagship-Level Coding in 27B Dense
- Qwen3-Coder-Next: 2026 Guide