Skip to main content

Realtime Text-to-Speech API

A dedicated WebSocket endpoint for real-time text-to-speech synthesis. Send text incrementally — from an LLM token stream or user input — and receive synthesised audio as base64-encoded chunks with minimal latency.

Connection

wss://api.palabra.ai/v1/text-to-speech/stream?token={JWT}

Authentication uses a LiveKit-compatible JWT passed as a query parameter. The connection is rate-limited to 20 new connections per minute per token.

Session lifecycle

  1. Connect to the WebSocket endpoint
  2. Send an init message to establish voice and output settings
  3. Send text messages — each up to 256 characters, mark the last chunk of each sentence with is_eos: true
  4. Receive audio_chunk messages with base64-encoded audio as chunks arrive
  5. Send cancel at any time to stop the current synthesis

The session persists until you disconnect. Settings from init apply to the entire session and can be overridden within text message.


Client → Server Messages

init

Must be sent once, immediately after connecting.

{
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0
},
"output": {
"format": "pcm",
"sample_rate": 24000
}
}
FieldTypeRequiredDefaultDescription
languagestringBCP-47 language code, e.g. en, ru, de, fr.
modelstringautoTTS model ID. Use auto to select the best model for the language.
voice_optionsobjectVoice configuration. See fields below.
outputobjectOutput audio settings. See fields below.

voice_options

FieldTypeRequiredDefaultDescription
voice_idstringVoice identifier. Use default_low or default_high for the language default.
speedfloat0.5Speed multiplier. 0.5 is normal speed. Range: 0.01.0.
deaccent_strengthfloat0.0Reduces foreign accent. Range: 0.01.0.

output

FieldTypeRequiredDefaultDescription
formatstringmp3Output audio format. One of pcm, mp3, wav.
sample_rateinteger24000Output sample rate in Hz. Range: 800048000.

text

Send a chunk of text to synthesise. Each message must be 256 characters or fewer. Send multiple messages to stream longer text — mark the last chunk of each sentence with is_eos: true.

Voice options can be overridden per message — only the fields you supply are changed.

{
"type": "text",
"text": "Hello, how can I help you today?",
"is_eos": true
}
{
"type": "text",
"text": "This part is spoken faster.",
"is_eos": false,
"voice_options": {
"speed": 0.9
}
}
FieldTypeRequiredDefaultDescription
textstringText to synthesise. Max 256 characters per message.
is_eosbooleanfalseEnd of sentence. When true -- finalizes speech synthesis for provided sentence.
voice_optionsobjectOverride of session voice options for this and subsequent messages.

cancel

Stop the current synthesis immediately. The session stays open — you can send new text messages right after.

{
"type": "cancel"
}

Server → Client Messages

audio_chunk

Sent as audio is generated. Each message contains a base64-encoded audio chunk in the format specified in init.output.format.

{
"message_type": "audio_chunk",
"data": {
"audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZQ==",
"size": 9600
}
}
FieldTypeDescription
data.audiostringBase64-encoded audio data.
data.sizeintegerSize of the decoded audio in bytes.

error

Sent on session-level or synthesis errors.

{
"message_type": "error",
"data": {
"code": "SERVICE_UNAVAILABLE",
"desc": "Speech synthesis timed out. Please try again."
}
}
CodeRetryableDescription
SERVICE_UNAVAILABLESynthesis service issue. Wait and retry.
SERVER_ERRORUnexpected internal error.
VALIDATION_ERRORInvalid message field. See desc for details.
CONFLICTSession already initialised. Reconnect to change settings.
SESSION_NOT_FOUNDSend init before sending text.
UNAUTHORIZEDJWT token missing, invalid, or expired.
RATE_LIMIT_EXCEEDEDRate limit exceeded. Connections are limited to 20 per minute. Text messages are limited to 50 per second. Wait briefly and retry.

Output formats

FormatDescription
pcmRaw 16-bit signed PCM, little-endian. No container. Best for real-time playback — schedule each chunk immediately as it arrives.
mp3MPEG Layer 3 compressed audio. Collect all chunks and decode together.
wavPCM with RIFF/WAVE container. Collect all chunks and decode together.
Real-time PCM playback

Decode each data.audio as base64 → Uint8ArrayInt16Array (little-endian) → Float32Array, create an AudioBuffer and schedule with AudioBufferSourceNode.start(nextTime) — accumulating nextTime += audioBuffer.duration after each chunk for gapless playback.


Example — streaming from an LLM

This example shows how sentences splitted into chunks can be streamed to the TTS API. Voice speed is increased mid-stream to demonstrate per-message overrides.

import asyncio, json, base64, websockets

SENTENCES = [
(
"The sun was setting over the mountains,",
"casting long golden shadows across the valley below.",
),
(
"Birds were returning to their nests,",
"filling the air with their evening songs.",
),
(
"A gentle breeze moved through the tall grass,",
"creating waves that rippled toward the horizon.",
),
]

async def main():
url = "wss://api.palabra.ai/v1/text-to-speech/stream?token=YOUR_JWT"
async with websockets.connect(url) as ws:
# 1. Initialise the session
await ws.send(json.dumps({
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0,
},
"output": {
"format": "pcm",
"sample_rate": 24000,
}
}))

# 2. Send each sentence chunk by chunk, is_eos=True on the last chunk of each sentence
for i, sentence in enumerate(SENTENCES):
for k, chunk in enumerate(sentence):
msg = {
"type": "text",
"text": chunk,
"is_eos": k == len(sentence) - 1,
}
if k == 0: # override speed only at the start of each sentence
msg["voice_options"] = {"speed": min(0.5 + i / 10, 1.0)}
await ws.send(json.dumps(msg))

# 3. Collect audio
audio = bytearray()
async for msg in ws:
data = json.loads(msg)
if data.get("message_type") == "audio_chunk":
audio.extend(base64.b64decode(data["data"]["audio"]))
print(f"Chunk: {data['data']['size']} bytes, total: {len(audio)}")
elif data.get("message_type") == "error":
print(f"Error: {data['data']['code']}{data['data']['desc']}")
break

print(f"Total: {len(audio)} bytes")
with open("output.pcm", "wb") as f:
f.write(audio)
# Play: ffplay -f s16le -ar 24000 -ac 1 output.pcm

asyncio.run(main())