TosinAF/audio-seeking-decision.md

TTS Audio Delivery: Seeking Support on iOS & Web

Decision needed: How should we deliver TTS audio to clients so that seeking (scrubbing forward/backward) works reliably on both iOS and Web?

The Problem

When a user listens to generated audio, they expect standard playback controls: play, pause, and seek (scrub to any point in the audio). This is trivial with a normal audio file but breaks with streamed audio.

ElevenLabs' streaming endpoint returns audio via chunked transfer encoding — the response has no Content-Length header and doesn't support HTTP range requests (Accept-Ranges: bytes). This causes platform-specific issues:

iOS

Player	Can stream?	Can seek during stream?	Can seek after complete?
`AVAudioPlayer`	No — requires complete `Data` object	N/A	Yes, via `currentTime`
`AVPlayer` + `AVURLAsset`	Yes	Unreliable — no known duration or byte offsets, `seekToTime:` fails or jumps incorrectly	Only if audio has proper duration metadata

Bottom line: iOS cannot reliably seek in a chunked HTTP audio stream. AVAudioPlayer is the most reliable path but requires the full file before playback starts.

Web

Approach	Can stream?	Can seek during stream?	Can seek after complete?
`<audio src="url">`	Yes (progressive)	No — browser can't calculate seek position without `Content-Length` / range support	Only if server supports range requests
`MediaSource` API	Yes — manual chunk appending	Partially — can seek within buffered range	Yes, once all chunks appended and duration set

Bottom line: Web has a workaround via MediaSource API but it adds significant client complexity. A normal audio file with proper headers works out of the box.

The Four Approaches

Approach A: Full Audio File (Wait, Then Play)

Client requests TTS
    → Backend calls ElevenLabs (stream or standard endpoint)
    → Backend buffers entire response
    → Backend responds with complete audio file
        (Content-Length, Accept-Ranges: bytes)
    → Client plays with full seeking support

How it works: Backend calls ElevenLabs, collects the full audio response, then serves it to the client as a standard file download. The client gets a normal audio URL — seeking works everywhere with zero client-side complexity.

Dimension	Assessment
Seeking	Full support on all platforms, no special client code
Time to first audio	User waits for full generation. ~1s for a paragraph, ~3-5s for a long summary, ~30s+ for a full document
Client complexity	Minimal — standard audio playback
Backend complexity	Low — buffer response, serve with proper headers
Best for	Short-to-medium text where 1-5s wait is acceptable

Approach B: Pure Streaming (Play Immediately, No Seeking)

Client requests TTS
    → Backend proxies ElevenLabs stream directly to client
    → Client plays audio as chunks arrive
    → No seeking available (or unreliable)

How it works: Backend pipes the ElevenLabs stream straight through to the client. Audio starts playing almost immediately but seeking doesn't work.

Dimension	Assessment
Seeking	Not supported (or unreliable on iOS)
Time to first audio	Near-instant (~75ms with Flash v2.5)
Client complexity	Low for playback, but users can't scrub
Backend complexity	Low — stream passthrough
Best for	Cases where seeking isn't needed (e.g., short snippets, notifications)

Approach C: Progressive Chunks (Play Early, Seek Within Received Audio)

Client requests TTS
    → Backend calls ElevenLabs stream
    → Backend receives chunks, assembles them into a growing audio file
    → After enough chunks for ~N seconds of audio:
        Backend sends partial file to client (as a proper file with Content-Length)
        Client starts playback — can seek within this chunk
    → Backend continues assembling chunks
    → Periodically (or on completion), client fetches updated/complete file
    → Seeking range grows as more audio arrives

How it works: Instead of streaming raw bytes to the client, the backend collects chunks into valid audio files and serves them progressively. The client always has a proper audio file — it just grows over time. This avoids the "dual source handoff" problem because there's only ever one file, it just gets longer.

Platform implementation:

iOS: AVAudioPlayer with Data — reload with updated data as more chunks arrive. Seeking works within the received range via currentTime. Need to manage playback position across reloads.
Web: MediaSource API — append audio buffers as they arrive. Seeking works within the buffered range. Browser handles this natively once set up.

Dimension	Assessment
Seeking	Yes — within received audio from the start, full range once complete
Time to first audio	Fast — playback starts after first chunk (~1-2s of audio buffered)
Client complexity	Medium — must handle growing audio source, track playback position across updates
Backend complexity	Medium — must assemble valid audio files from chunks, manage chunk boundaries
Best for	When you want both fast start and seeking without the complexity of a full dual-source hybrid

Key challenge: Audio codecs have specific framing requirements. MP3 frames are self-contained (~26ms each), so you can concatenate chunks and get a valid file. Other formats (AAC, Opus) may need container headers rewritten. Sticking with MP3 keeps this approach viable.

Approach D: Hybrid Dual-Source (Stream + Switch to File)

Client requests TTS
    → Backend calls ElevenLabs stream
    → Backend simultaneously:
        1. Forwards raw chunks to client for immediate playback
        2. Buffers chunks to build complete file
    → Once complete, backend signals client that full file is available
    → Client switches from stream to seekable file source

How it works: Client starts playing immediately from the raw stream (no seeking). In the background, the backend assembles the full file. Once ready, the client seamlessly switches to the complete file. Seeking becomes available after the full audio has been received.

Dimension	Assessment
Seeking	Available after full audio loads (delayed)
Time to first audio	Near-instant
Client complexity	High — must handle stream-to-file transition, dual playback sources, seamless handoff without audio glitch
Backend complexity	Medium — must track buffer state, notify client of completion, serve both stream and file
Best for	Long content where both instant playback and eventual seeking are important

Complexity Comparison

Factor	A: Full File	B: Pure Stream	C: Progressive Chunks	D: Dual-Source Hybrid
Backend implementation	Simple	Simple	Medium	Medium
iOS client work	None (standard `AVAudioPlayer`)	Medium (stream handling)	Medium (`AVAudioPlayer` reload)	High (dual source, handoff)
Web client work	None (standard `<audio>`)	Low (no seeking UX)	Medium (`MediaSource` setup)	High (`MediaSource` + transition)
Seeking works?	Yes, always (full range)	No	Yes, within received range	Yes, after full load
Time to first audio	Delayed (1-30s)	Instant (~75ms)	Fast (~1-2s)	Instant (~75ms)
Edge cases	Timeout on very long text	Users frustrated by no scrubbing	Playback position tracking across chunk updates; MP3 frame alignment	Audio glitch during handoff, race conditions
Total effort estimate	Small	Small	Medium (~1.5x of A)	Large (2-3x of A)

Questions to Help Decide

How long is the typical text we're converting? If it's mostly short content (a few paragraphs), Approach A's wait time is negligible and the simplicity wins. If we're reading entire documents aloud, the wait becomes painful.
Is seeking a hard requirement for V1? If users primarily listen start-to-finish (like a podcast), Approach B works and is the simplest. If they need to scrub (like reviewing a specific section), seeking is essential.
Is "fast start + seeking" worth the extra complexity? Approach C (progressive chunks) gives both with moderate complexity. It's significantly simpler than D because there's no dual-source handoff — just one growing file.
Are there UX patterns we can use to mask the wait? For Approach A, a progress indicator ("Generating audio...") or pre-generating audio when the user opens a document could reduce perceived latency.

Recommendation

Start with Approach A (Full Audio File) for V1, with a clear upgrade path to Approach C (Progressive Chunks) if latency becomes an issue.

Why A first:

Seeking works perfectly on both iOS and Web with zero client-side complexity
The backend is straightforward — buffer and serve with proper headers
For typical text lengths (paragraphs to summaries), the generation wait is 1-5 seconds — acceptable with a loading indicator
It's the cleanest foundation to build on

Why C is the natural next step (not D):

Joey's insight is right — assembling chunks into a growing seekable file is cleaner than the dual-source handoff in D
C avoids the hardest problem in D (seamless audio source switching without glitches)
C gives seeking from the moment playback starts, not just after full load
MP3's self-contained frame structure makes progressive assembly straightforward
Effort is ~1.5x of A vs ~3x for D

For long documents in V1, we can mitigate wait time by splitting text into sections and generating each as a separate audio file — the user starts listening to section 1 while sections 2+ generate in the background.