High-quality neural TTS running locally via MLX. No API key, no cloud, no per-character cost. Uses mlx-community/Kokoro-82M-bf16.
This gist includes:
| File | Purpose |
|---|---|
speak.sh |
Wrapper script — sensible defaults, stdin support, MP3 output |
voice-blend.py |
Blend multiple stock voices into a custom voice |
SKILL.md |
Claude Code skill definition (optional — only if you use Claude Code) |
- Apple Silicon Mac (M1 or newer)
- Python 3.10+ with
pip3 ffmpeg(optional — only needed for MP3 output)
pip3 install mlx-audio "misaki[en]" num2words pathvalidate safetensors numpyOptional, for MP3:
brew install ffmpegmkdir -p ~/bin
curl -L <gist-raw-url>/speak.sh -o ~/bin/speak.sh
chmod +x ~/bin/speak.sh
# make sure ~/bin is on PATH, or copy to /usr/local/binspeak.sh "Hello world" -v am_onyxFirst invocation downloads mlx-community/Kokoro-82M-bf16 (~160 MB) and the voice pack from prince-canuma/Kokoro-82M into ~/.cache/huggingface/. Subsequent runs are fully offline.
Note:
speak.shdefaults to voiceriker, which is a custom blend. If you haven't built it yet (see below), pass-v am_onyxor another stock voice on first runs.
speak.sh "Hello Captain" # plays audio
speak.sh "Save this" -o ~/out.mp3 # write MP3, no playback
speak.sh -f script.txt -o ~/narration.wav # read from file
cat text.txt | speak.sh -o ~/out.wav # from stdin
speak.sh --voices # list available voices
speak.sh -s 1.2 "Faster" # speed multiplierStdout is the output file path — nothing else unless --verbose. Safe to pipe.
| Flag | Description |
|---|---|
-v |
Voice preset (default: riker) |
-o |
Output file (.wav or .mp3) — skips playback unless --play |
-f |
Read text from file |
-s |
Speed multiplier (default: 1.0) |
--play |
Play audio even when -o is set |
--voices |
List all available voices |
--verbose |
Show engine output |
{lang}{gender}_{name} — a = American, b = British, f = female, m = male.
Examples: am_onyx, am_fenrir, bm_daniel, bm_george, bf_emma, af_heart.
Run speak.sh --voices after your first run to see everything cached locally.
voice-blend.py linearly combines voice embeddings from Kokoro's voice pack.
Open voice-blend.py and update the VOICES_DIR constant at the top to match your HuggingFace cache — the path contains your username and the snapshot hash:
~/.cache/huggingface/hub/models--prince-canuma--Kokoro-82M/snapshots/<hash>/voices
After your first speak.sh run, you can find the exact path with:
ls ~/.cache/huggingface/hub/models--prince-canuma--Kokoro-82M/snapshots/*/voiceschmod +x voice-blend.py
./voice-blend.py --output my_voice am_fenrir:0.5 bm_daniel:0.3 am_onyx:0.2
speak.sh "Testing the new voice" -v my_voiceWeights should sum to ~1.0 (it normalizes if not).
If you use Claude Code, drop SKILL.md into a skill directory so agents can call speak.sh automatically:
mkdir -p ~/.claude/skills/kokoro-tts
cp SKILL.md ~/.claude/skills/kokoro-tts/SKILL.mdThe skill triggers on phrases like "speak this", "text to speech", "generate audio".
- "Model not yet downloaded" — run
speak.sh "hello" -v am_onyxonce to trigger the download. - MP3 output fails — install
ffmpeg(brew install ffmpeg). mlx_audio.tts.generatenot found —pip3 install mlx-audiodidn't land on your PATH. The script looks under/Library/Frameworks/Python.framework/Versions/3.12/bin/as a fallback; adjust if your Python lives elsewhere.- Voice not found after blending — confirm
VOICES_DIRinvoice-blend.pymatches your actual HF cache snapshot dir.