💠🌐 Absolutely — here’s a PC/RTX 5070 version of the Qwen3-TTS setup guide, in the same “ready to hand to a local AI agent (or follow yourself)” format you loved for XTTS-v2.
We’ll cover:
- Environment
- Install & dependencies
- Running Qwen3-TTS
- Performance tips for RTX
- Extras (chunking, voice options)
- Quick troubleshooting
This guide gets you:
- Qwen3-TTS model running locally
- On a PC with an RTX 5070 GPU
- Using Python + PyTorch + CUDA
- Good quality English TTS
⚠️ Not real-time by default — output batches audio. It’s for offline generation or near-interactive narrative, not live call-and-response.
✔️ Windows (10/11) or Linux ✔ NVIDIA RTX 5070 GPU ✔ CUDA toolkit compatible with your driver (e.g., CUDA 12/12.1) ✔ Python 3.10 or 3.11 ✔ 12+ GB VRAM helps, but 8 GB is workable
💡 Use a virtual environment so dependencies don’t conflict:
python3.11 -m venv qwen3tts
source qwen3tts/bin/activate # Linux/mac
qwen3tts\Scripts\activate # Windows
pip install --upgrade pip setuptools wheelInstall PyTorch that matches your CUDA version from PyTorch’s official channel:
pip install torch torchvision torchaudio --index-url https://download.pytorch.orgCheck CUDA availability:
python - << 'EOF'
import torch
print("CUDA available:", torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())
EOFExpect:
CUDA available: True
You’ll need:
- transformers
- accelerate
- soundfile
- numpy
- maybe other repo deps
Install:
pip install transformers accelerate soundfile numpy scipyClone the repo (example name, adjust if different):
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTSInstall any internal requirements:
pip install -r requirements.txtBelow is the minimal script to generate audio.
Create generate_qwen3.py:
import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-TTS")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Qwen/Qwen3-TTS")
model.to(device)
model.eval()
text = "This is Qwen three text to speech running on an RTX 5070."
inputs = processor(text=text, return_tensors="pt").to(device)
with torch.no_grad():
generated_audio = model.generate(**inputs)
audio_data = generated_audio.cpu().numpy()
sf.write("qwen3_output.wav", audio_data, 24000)Run it:
python generate_qwen3.pyPlay the output:
# Windows
start qwen3_output.wav
# macOS
afplay qwen3_output.wav
# Linux
aplay qwen3_output.wav- Batch input shorter phrases
- Use mixed precision (FP16) if model supports it
Example for mixed precision:
with torch.autocast(device_type="cuda", dtype=torch.float16):
generated_audio = model.generate(**inputs)Split long text into sentences for faster partial outputs:
import re
def chunk_text(text):
return re.split(r'(?<=[.!?])\s+', text)
chunks = chunk_text(your_long_text)
for i, chunk in enumerate(chunks):
# generate per chunkThis avoids huge memory spikes.
| Setup | Behavior |
|---|---|
| RTX 5070 | Best open local TTS perf |
| CPU fallback | Very slow |
| Chunked text | Good trade-off |
| Full paragraph | Slower |
Qwen3-TTS isn’t designed for streaming, so even on RTX it’s batch-oriented now.
Right now Qwen3-TTS workflows often include options like:
- Speaker IDs
- Conditioning input
- Pitch control
Examples often show:
inputs.update({"speaker": speaker_id})Check model doc / repo for exact syntax.
- Shorten text chunks
- Lower batch size
- Try FP16 generation
- Make sure CUDA & driver versions match
- Reinstall correct PyTorch wheel
- Same sample rate mismatch
- Try resampling to 24 kHz
Since LM Studio doesn’t natively handle TTS pipelines, your agent must:
- Generate text
- Save it to a
.txtfile or variable - Run the Python TTS script
- Return
qwen3_output.wavto the UI - (Optional) Stream playback
Example flow:
User → Agent → text
↓
Agent runs qwen3 script → audio file
↓
Audio returned & played
| Model | RTX Support | Real-Time | Ease | Best Use |
|---|---|---|---|---|
| Qwen3-TTS | 👍 | ❌ (batch) | 🟡 | High-quality offline |
| XTTS-v2 | 👍 | ✅ | Live/interactive | |
| PersonaPlex | 👍 | Expressive prosody experiments |
I can give you: ✅ A local HTTP API for Qwen3-TTS ✅ A combined XTTS + Qwen3 voice switcher ✅ A benchmark script vs ElevenLabs (latency + quality) ✅ A prompt-to-speech orchestrator for multimodal agents
Just pick one 😈