Skip to content

Instantly share code, notes, and snippets.

@TosinAF
Last active March 31, 2026 20:30
Show Gist options
  • Select an option

  • Save TosinAF/26654dcc2c84dea4808ca4472e00e2fe to your computer and use it in GitHub Desktop.

Select an option

Save TosinAF/26654dcc2c84dea4808ca4472e00e2fe to your computer and use it in GitHub Desktop.
TTS Audio Delivery: Seeking Support on iOS & Web — decision doc for ElevenLabs integration

TTS Audio Delivery: Seeking Support on iOS & Web

Decision needed: How should we deliver TTS audio to clients so that seeking (scrubbing forward/backward) works reliably on both iOS and Web?


The Problem

When a user listens to generated audio, they expect standard playback controls: play, pause, and seek (scrub to any point in the audio). This is trivial with a normal audio file but breaks with streamed audio.

ElevenLabs' streaming endpoint returns audio via chunked transfer encoding — the response has no Content-Length header and doesn't support HTTP range requests (Accept-Ranges: bytes). This causes platform-specific issues:

iOS

Player Can stream? Can seek during stream? Can seek after complete?
AVAudioPlayer No — requires complete Data object N/A Yes, via currentTime
AVPlayer + AVURLAsset Yes Unreliable — no known duration or byte offsets, seekToTime: fails or jumps incorrectly Only if audio has proper duration metadata

Bottom line: iOS cannot reliably seek in a chunked HTTP audio stream. AVAudioPlayer is the most reliable path but requires the full file before playback starts.

Web

Approach Can stream? Can seek during stream? Can seek after complete?
<audio src="url"> Yes (progressive) No — browser can't calculate seek position without Content-Length / range support Only if server supports range requests
MediaSource API Yes — manual chunk appending Partially — can seek within buffered range Yes, once all chunks appended and duration set

Bottom line: Web has a workaround via MediaSource API but it adds significant client complexity. A normal audio file with proper headers works out of the box.


The Four Approaches

Approach A: Full Audio File (Wait, Then Play)

Client requests TTS
    → Backend calls ElevenLabs (stream or standard endpoint)
    → Backend buffers entire response
    → Backend responds with complete audio file
        (Content-Length, Accept-Ranges: bytes)
    → Client plays with full seeking support

How it works: Backend calls ElevenLabs, collects the full audio response, then serves it to the client as a standard file download. The client gets a normal audio URL — seeking works everywhere with zero client-side complexity.

Dimension Assessment
Seeking Full support on all platforms, no special client code
Time to first audio User waits for full generation. ~1s for a paragraph, ~3-5s for a long summary, ~30s+ for a full document
Client complexity Minimal — standard audio playback
Backend complexity Low — buffer response, serve with proper headers
Best for Short-to-medium text where 1-5s wait is acceptable

Approach B: Pure Streaming (Play Immediately, No Seeking)

Client requests TTS
    → Backend proxies ElevenLabs stream directly to client
    → Client plays audio as chunks arrive
    → No seeking available (or unreliable)

How it works: Backend pipes the ElevenLabs stream straight through to the client. Audio starts playing almost immediately but seeking doesn't work.

Dimension Assessment
Seeking Not supported (or unreliable on iOS)
Time to first audio Near-instant (~75ms with Flash v2.5)
Client complexity Low for playback, but users can't scrub
Backend complexity Low — stream passthrough
Best for Cases where seeking isn't needed (e.g., short snippets, notifications)

Approach C: Progressive Chunks (Play Early, Seek Within Received Audio)

Client requests TTS
    → Backend calls ElevenLabs stream
    → Backend receives chunks, assembles them into a growing audio file
    → After enough chunks for ~N seconds of audio:
        Backend sends partial file to client (as a proper file with Content-Length)
        Client starts playback — can seek within this chunk
    → Backend continues assembling chunks
    → Periodically (or on completion), client fetches updated/complete file
    → Seeking range grows as more audio arrives

How it works: Instead of streaming raw bytes to the client, the backend collects chunks into valid audio files and serves them progressively. The client always has a proper audio file — it just grows over time. This avoids the "dual source handoff" problem because there's only ever one file, it just gets longer.

Platform implementation:

  • iOS: AVAudioPlayer with Data — reload with updated data as more chunks arrive. Seeking works within the received range via currentTime. Need to manage playback position across reloads.
  • Web: MediaSource API — append audio buffers as they arrive. Seeking works within the buffered range. Browser handles this natively once set up.
Dimension Assessment
Seeking Yes — within received audio from the start, full range once complete
Time to first audio Fast — playback starts after first chunk (~1-2s of audio buffered)
Client complexity Medium — must handle growing audio source, track playback position across updates
Backend complexity Medium — must assemble valid audio files from chunks, manage chunk boundaries
Best for When you want both fast start and seeking without the complexity of a full dual-source hybrid

Key challenge: Audio codecs have specific framing requirements. MP3 frames are self-contained (~26ms each), so you can concatenate chunks and get a valid file. Other formats (AAC, Opus) may need container headers rewritten. Sticking with MP3 keeps this approach viable.

Approach D: Hybrid Dual-Source (Stream + Switch to File)

Client requests TTS
    → Backend calls ElevenLabs stream
    → Backend simultaneously:
        1. Forwards raw chunks to client for immediate playback
        2. Buffers chunks to build complete file
    → Once complete, backend signals client that full file is available
    → Client switches from stream to seekable file source

How it works: Client starts playing immediately from the raw stream (no seeking). In the background, the backend assembles the full file. Once ready, the client seamlessly switches to the complete file. Seeking becomes available after the full audio has been received.

Dimension Assessment
Seeking Available after full audio loads (delayed)
Time to first audio Near-instant
Client complexity High — must handle stream-to-file transition, dual playback sources, seamless handoff without audio glitch
Backend complexity Medium — must track buffer state, notify client of completion, serve both stream and file
Best for Long content where both instant playback and eventual seeking are important

Complexity Comparison

Factor A: Full File B: Pure Stream C: Progressive Chunks D: Dual-Source Hybrid
Backend implementation Simple Simple Medium Medium
iOS client work None (standard AVAudioPlayer) Medium (stream handling) Medium (AVAudioPlayer reload) High (dual source, handoff)
Web client work None (standard <audio>) Low (no seeking UX) Medium (MediaSource setup) High (MediaSource + transition)
Seeking works? Yes, always (full range) No Yes, within received range Yes, after full load
Time to first audio Delayed (1-30s) Instant (~75ms) Fast (~1-2s) Instant (~75ms)
Edge cases Timeout on very long text Users frustrated by no scrubbing Playback position tracking across chunk updates; MP3 frame alignment Audio glitch during handoff, race conditions
Total effort estimate Small Small Medium (~1.5x of A) Large (2-3x of A)

Questions to Help Decide

  1. How long is the typical text we're converting? If it's mostly short content (a few paragraphs), Approach A's wait time is negligible and the simplicity wins. If we're reading entire documents aloud, the wait becomes painful.

  2. Is seeking a hard requirement for V1? If users primarily listen start-to-finish (like a podcast), Approach B works and is the simplest. If they need to scrub (like reviewing a specific section), seeking is essential.

  3. Is "fast start + seeking" worth the extra complexity? Approach C (progressive chunks) gives both with moderate complexity. It's significantly simpler than D because there's no dual-source handoff — just one growing file.

  4. Are there UX patterns we can use to mask the wait? For Approach A, a progress indicator ("Generating audio...") or pre-generating audio when the user opens a document could reduce perceived latency.


Recommendation

Start with Approach A (Full Audio File) for V1, with a clear upgrade path to Approach C (Progressive Chunks) if latency becomes an issue.

Why A first:

  • Seeking works perfectly on both iOS and Web with zero client-side complexity
  • The backend is straightforward — buffer and serve with proper headers
  • For typical text lengths (paragraphs to summaries), the generation wait is 1-5 seconds — acceptable with a loading indicator
  • It's the cleanest foundation to build on

Why C is the natural next step (not D):

  • Joey's insight is right — assembling chunks into a growing seekable file is cleaner than the dual-source handoff in D
  • C avoids the hardest problem in D (seamless audio source switching without glitches)
  • C gives seeking from the moment playback starts, not just after full load
  • MP3's self-contained frame structure makes progressive assembly straightforward
  • Effort is ~1.5x of A vs ~3x for D

For long documents in V1, we can mitigate wait time by splitting text into sections and generating each as a separate audio file — the user starts listening to section 1 while sections 2+ generate in the background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment