Decision needed: How should we deliver TTS audio to clients so that seeking (scrubbing forward/backward) works reliably on both iOS and Web?
When a user listens to generated audio, they expect standard playback controls: play, pause, and seek (scrub to any point in the audio). This is trivial with a normal audio file but breaks with streamed audio.
ElevenLabs' streaming endpoint returns audio via chunked transfer encoding — the response has no Content-Length header and doesn't support HTTP range requests (Accept-Ranges: bytes). This causes platform-specific issues:
| Player | Can stream? | Can seek during stream? | Can seek after complete? |
|---|---|---|---|
AVAudioPlayer |
No — requires complete Data object |
N/A | Yes, via currentTime |
AVPlayer + AVURLAsset |
Yes | Unreliable — no known duration or byte offsets, seekToTime: fails or jumps incorrectly |
Only if audio has proper duration metadata |
Bottom line: iOS cannot reliably seek in a chunked HTTP audio stream. AVAudioPlayer is the most reliable path but requires the full file before playback starts.
| Approach | Can stream? | Can seek during stream? | Can seek after complete? |
|---|---|---|---|
<audio src="url"> |
Yes (progressive) | No — browser can't calculate seek position without Content-Length / range support |
Only if server supports range requests |
MediaSource API |
Yes — manual chunk appending | Partially — can seek within buffered range | Yes, once all chunks appended and duration set |
Bottom line: Web has a workaround via MediaSource API but it adds significant client complexity. A normal audio file with proper headers works out of the box.
Client requests TTS
→ Backend calls ElevenLabs (stream or standard endpoint)
→ Backend buffers entire response
→ Backend responds with complete audio file
(Content-Length, Accept-Ranges: bytes)
→ Client plays with full seeking support
How it works: Backend calls ElevenLabs, collects the full audio response, then serves it to the client as a standard file download. The client gets a normal audio URL — seeking works everywhere with zero client-side complexity.
| Dimension | Assessment |
|---|---|
| Seeking | Full support on all platforms, no special client code |
| Time to first audio | User waits for full generation. ~1s for a paragraph, ~3-5s for a long summary, ~30s+ for a full document |
| Client complexity | Minimal — standard audio playback |
| Backend complexity | Low — buffer response, serve with proper headers |
| Best for | Short-to-medium text where 1-5s wait is acceptable |
Client requests TTS
→ Backend proxies ElevenLabs stream directly to client
→ Client plays audio as chunks arrive
→ No seeking available (or unreliable)
How it works: Backend pipes the ElevenLabs stream straight through to the client. Audio starts playing almost immediately but seeking doesn't work.
| Dimension | Assessment |
|---|---|
| Seeking | Not supported (or unreliable on iOS) |
| Time to first audio | Near-instant (~75ms with Flash v2.5) |
| Client complexity | Low for playback, but users can't scrub |
| Backend complexity | Low — stream passthrough |
| Best for | Cases where seeking isn't needed (e.g., short snippets, notifications) |
Client requests TTS
→ Backend calls ElevenLabs stream
→ Backend receives chunks, assembles them into a growing audio file
→ After enough chunks for ~N seconds of audio:
Backend sends partial file to client (as a proper file with Content-Length)
Client starts playback — can seek within this chunk
→ Backend continues assembling chunks
→ Periodically (or on completion), client fetches updated/complete file
→ Seeking range grows as more audio arrives
How it works: Instead of streaming raw bytes to the client, the backend collects chunks into valid audio files and serves them progressively. The client always has a proper audio file — it just grows over time. This avoids the "dual source handoff" problem because there's only ever one file, it just gets longer.
Platform implementation:
- iOS:
AVAudioPlayerwithData— reload with updated data as more chunks arrive. Seeking works within the received range viacurrentTime. Need to manage playback position across reloads. - Web:
MediaSourceAPI — append audio buffers as they arrive. Seeking works within the buffered range. Browser handles this natively once set up.
| Dimension | Assessment |
|---|---|
| Seeking | Yes — within received audio from the start, full range once complete |
| Time to first audio | Fast — playback starts after first chunk (~1-2s of audio buffered) |
| Client complexity | Medium — must handle growing audio source, track playback position across updates |
| Backend complexity | Medium — must assemble valid audio files from chunks, manage chunk boundaries |
| Best for | When you want both fast start and seeking without the complexity of a full dual-source hybrid |
Key challenge: Audio codecs have specific framing requirements. MP3 frames are self-contained (~26ms each), so you can concatenate chunks and get a valid file. Other formats (AAC, Opus) may need container headers rewritten. Sticking with MP3 keeps this approach viable.
Client requests TTS
→ Backend calls ElevenLabs stream
→ Backend simultaneously:
1. Forwards raw chunks to client for immediate playback
2. Buffers chunks to build complete file
→ Once complete, backend signals client that full file is available
→ Client switches from stream to seekable file source
How it works: Client starts playing immediately from the raw stream (no seeking). In the background, the backend assembles the full file. Once ready, the client seamlessly switches to the complete file. Seeking becomes available after the full audio has been received.
| Dimension | Assessment |
|---|---|
| Seeking | Available after full audio loads (delayed) |
| Time to first audio | Near-instant |
| Client complexity | High — must handle stream-to-file transition, dual playback sources, seamless handoff without audio glitch |
| Backend complexity | Medium — must track buffer state, notify client of completion, serve both stream and file |
| Best for | Long content where both instant playback and eventual seeking are important |
| Factor | A: Full File | B: Pure Stream | C: Progressive Chunks | D: Dual-Source Hybrid |
|---|---|---|---|---|
| Backend implementation | Simple | Simple | Medium | Medium |
| iOS client work | None (standard AVAudioPlayer) |
Medium (stream handling) | Medium (AVAudioPlayer reload) |
High (dual source, handoff) |
| Web client work | None (standard <audio>) |
Low (no seeking UX) | Medium (MediaSource setup) |
High (MediaSource + transition) |
| Seeking works? | Yes, always (full range) | No | Yes, within received range | Yes, after full load |
| Time to first audio | Delayed (1-30s) | Instant (~75ms) | Fast (~1-2s) | Instant (~75ms) |
| Edge cases | Timeout on very long text | Users frustrated by no scrubbing | Playback position tracking across chunk updates; MP3 frame alignment | Audio glitch during handoff, race conditions |
| Total effort estimate | Small | Small | Medium (~1.5x of A) | Large (2-3x of A) |
-
How long is the typical text we're converting? If it's mostly short content (a few paragraphs), Approach A's wait time is negligible and the simplicity wins. If we're reading entire documents aloud, the wait becomes painful.
-
Is seeking a hard requirement for V1? If users primarily listen start-to-finish (like a podcast), Approach B works and is the simplest. If they need to scrub (like reviewing a specific section), seeking is essential.
-
Is "fast start + seeking" worth the extra complexity? Approach C (progressive chunks) gives both with moderate complexity. It's significantly simpler than D because there's no dual-source handoff — just one growing file.
-
Are there UX patterns we can use to mask the wait? For Approach A, a progress indicator ("Generating audio...") or pre-generating audio when the user opens a document could reduce perceived latency.
Start with Approach A (Full Audio File) for V1, with a clear upgrade path to Approach C (Progressive Chunks) if latency becomes an issue.
Why A first:
- Seeking works perfectly on both iOS and Web with zero client-side complexity
- The backend is straightforward — buffer and serve with proper headers
- For typical text lengths (paragraphs to summaries), the generation wait is 1-5 seconds — acceptable with a loading indicator
- It's the cleanest foundation to build on
Why C is the natural next step (not D):
- Joey's insight is right — assembling chunks into a growing seekable file is cleaner than the dual-source handoff in D
- C avoids the hardest problem in D (seamless audio source switching without glitches)
- C gives seeking from the moment playback starts, not just after full load
- MP3's self-contained frame structure makes progressive assembly straightforward
- Effort is ~1.5x of A vs ~3x for D
For long documents in V1, we can mitigate wait time by splitting text into sections and generating each as a separate audio file — the user starts listening to section 1 while sections 2+ generate in the background.