Realtime Text-to-Speech API
A dedicated WebSocket endpoint for real-time text-to-speech synthesis. Send text incrementally — from an LLM token stream or user input — and receive synthesised audio as base64-encoded chunks with minimal latency.
Connection
wss://api.palabra.ai/v1/text-to-speech/stream?token={JWT}
Authentication uses a LiveKit-compatible JWT passed as a query parameter. The connection is rate-limited to 20 new connections per minute per token.
Session lifecycle
- Connect to the WebSocket endpoint
- Send an
initmessage to establish voice and output settings - Send
textmessages — each up to 256 characters, mark the last chunk of each sentence withis_eos: true - Receive
audio_chunkmessages with base64-encoded audio as chunks arrive - Send
cancelat any time to stop the current synthesis
The session persists until you disconnect. Settings from init apply to the entire session and can be overridden within text message.
Client → Server Messages
init
Must be sent once, immediately after connecting.
{
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0
},
"output": {
"format": "pcm",
"sample_rate": 24000
}
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
language | string | ✓ | — | BCP-47 language code, e.g. en, ru, de, fr. |
model | string | — | auto | TTS model ID. Use auto to select the best model for the language. |
voice_options | object | ✓ | — | Voice configuration. See fields below. |
output | object | — | — | Output audio settings. See fields below. |
voice_options
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
voice_id | string | ✓ | — | Voice identifier. Use default_low or default_high for the language default. |
speed | float | — | 0.5 | Speed multiplier. 0.5 is normal speed. Range: 0.0–1.0. |
deaccent_strength | float | — | 0.0 | Reduces foreign accent. Range: 0.0–1.0. |
output
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
format | string | — | mp3 | Output audio format. One of pcm, mp3, wav. |
sample_rate | integer | — | 24000 | Output sample rate in Hz. Range: 8000–48000. |
text
Send a chunk of text to synthesise. Each message must be 256 characters or fewer. Send multiple messages to stream longer text — mark the last chunk of each sentence with is_eos: true.
Voice options can be overridden per message — only the fields you supply are changed.
{
"type": "text",
"text": "Hello, how can I help you today?",
"is_eos": true
}
{
"type": "text",
"text": "This part is spoken faster.",
"is_eos": false,
"voice_options": {
"speed": 0.9
}
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | ✓ | — | Text to synthesise. Max 256 characters per message. |
is_eos | boolean | — | false | End of sentence. When true -- finalizes speech synthesis for provided sentence. |
voice_options | object | — | — | Override of session voice options for this and subsequent messages. |
cancel
Stop the current synthesis immediately. The session stays open — you can send new text messages right after.
{
"type": "cancel"
}
Server → Client Messages
audio_chunk
Sent as audio is generated. Each message contains a base64-encoded audio chunk in the format specified in init.output.format.
{
"message_type": "audio_chunk",
"data": {
"audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZQ==",
"size": 9600
}
}
| Field | Type | Description |
|---|---|---|
data.audio | string | Base64-encoded audio data. |
data.size | integer | Size of the decoded audio in bytes. |
error
Sent on session-level or synthesis errors.
{
"message_type": "error",
"data": {
"code": "SERVICE_UNAVAILABLE",
"desc": "Speech synthesis timed out. Please try again."
}
}
| Code | Retryable | Description |
|---|---|---|
SERVICE_UNAVAILABLE | ✓ | Synthesis service issue. Wait and retry. |
SERVER_ERROR | ✗ | Unexpected internal error. |
VALIDATION_ERROR | ✗ | Invalid message field. See desc for details. |
CONFLICT | ✗ | Session already initialised. Reconnect to change settings. |
SESSION_NOT_FOUND | ✗ | Send init before sending text. |
UNAUTHORIZED | ✗ | JWT token missing, invalid, or expired. |
RATE_LIMIT_EXCEEDED | ✓ | Rate limit exceeded. Connections are limited to 20 per minute. Text messages are limited to 50 per second. Wait briefly and retry. |
Output formats
| Format | Description |
|---|---|
pcm | Raw 16-bit signed PCM, little-endian. No container. Best for real-time playback — schedule each chunk immediately as it arrives. |
mp3 | MPEG Layer 3 compressed audio. Collect all chunks and decode together. |
wav | PCM with RIFF/WAVE container. Collect all chunks and decode together. |
Decode each data.audio as base64 → Uint8Array → Int16Array (little-endian) → Float32Array, create an AudioBuffer and schedule with AudioBufferSourceNode.start(nextTime) — accumulating nextTime += audioBuffer.duration after each chunk for gapless playback.
Example — streaming from an LLM
This example shows how sentences splitted into chunks can be streamed to the TTS API. Voice speed is increased mid-stream to demonstrate per-message overrides.
import asyncio, json, base64, websockets
SENTENCES = [
(
"The sun was setting over the mountains,",
"casting long golden shadows across the valley below.",
),
(
"Birds were returning to their nests,",
"filling the air with their evening songs.",
),
(
"A gentle breeze moved through the tall grass,",
"creating waves that rippled toward the horizon.",
),
]
async def main():
url = "wss://api.palabra.ai/v1/text-to-speech/stream?token=YOUR_JWT"
async with websockets.connect(url) as ws:
# 1. Initialise the session
await ws.send(json.dumps({
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0,
},
"output": {
"format": "pcm",
"sample_rate": 24000,
}
}))
# 2. Send each sentence chunk by chunk, is_eos=True on the last chunk of each sentence
for i, sentence in enumerate(SENTENCES):
for k, chunk in enumerate(sentence):
msg = {
"type": "text",
"text": chunk,
"is_eos": k == len(sentence) - 1,
}
if k == 0: # override speed only at the start of each sentence
msg["voice_options"] = {"speed": min(0.5 + i / 10, 1.0)}
await ws.send(json.dumps(msg))
# 3. Collect audio
audio = bytearray()
async for msg in ws:
data = json.loads(msg)
if data.get("message_type") == "audio_chunk":
audio.extend(base64.b64decode(data["data"]["audio"]))
print(f"Chunk: {data['data']['size']} bytes, total: {len(audio)}")
elif data.get("message_type") == "error":
print(f"Error: {data['data']['code']} — {data['data']['desc']}")
break
print(f"Total: {len(audio)} bytes")
with open("output.pcm", "wb") as f:
f.write(audio)
# Play: ffplay -f s16le -ar 24000 -ac 1 output.pcm
asyncio.run(main())