Translation Settings Breakdown
A complete reference for all configuration options of the set_task message.
What are translation settings?
Translation settings control how your audio is processed, transcribed, translated, and synthesized. They are included in the data field of a set_task message sent via the WebRTC data channel or the WebSocket connection.
You will use these settings when:
- Starting a new translation — send your initial
set_taskmessage to begin real-time translation. - Updating an active translation — modify settings during an ongoing translation without interrupting the stream.
- Fine-tuning performance — adjust parameters to optimize quality, latency, or specific use cases.
For ready-to-use values, see Recommended settings.
Message envelope
{
"message_type": "set_task",
"data": {
"input_stream": { /* ... */ },
"output_stream": { /* ... */ },
"pipeline": { /* ... */ }
}
}
message_type
- Type:
string - Description: Identifies the type of command being sent:
"set_task"— create a new task or update an existing one."get_task"— return the current task."pause_task"— stop processing new audio but keep the task alive. Useset_taskto resume."flush_task"— cancel processing of the current (ongoing) phrase."end_task"— end the task (you will be disconnected automatically)."tts_task"— generate text-to-speech from text."input_audio_data"— WebSocket transport only; contains an audio data chunk.
data
- Type:
object - Description: The main task configuration. Contains five major sections:
input_stream
Configures the input stream.
| Field | Type | Description |
|---|---|---|
content_type | string | Content type of the input stream. Currently only audio is supported. |
source | object | How and from where the input audio is sourced. See below. |
source
| Field | Type | Description |
|---|---|---|
type | string | Input audio transport: webrtc or ws. Must match the target transport in output_stream. |
format | string | WebSocket transport only. Input audio format: opus, pcm_s16le, or wav. |
sample_rate | integer | WebSocket transport only. Input sample rate. Allowed range: 16000–48000 Hz. |
channels | integer | WebSocket transport only. Number of input channels: 1 or 2. |
output_stream
Configures the output stream.
| Field | Type | Description |
|---|---|---|
content_type | string | Content type of the output. Currently only audio is supported. |
target | object | Destination of the output. See below. |
target
| Field | Type | Description |
|---|---|---|
type | string | Output audio transport: webrtc or ws. Must match the source transport in input_stream. |
format | string | WebSocket transport only. Output audio format: pcm_s16le or zlib_pcm_s16le (zlib-compressed PCM). |
sample_rate | integer | WebSocket transport only. Output sample rate. Fixed at 24000 Hz. |
channels | integer | WebSocket transport only. Output channels. Fixed at 1 (mono). |
pipeline
Holds the configuration for all processing steps: preprocessing, transcription, and translation.
transcription
Settings for Automatic Speech Recognition (ASR).
source_language
- Type:
string - Description: Language code of the input audio (e.g.,
"en","es","fr"). Set to"auto"to enable automatic language detection (optionally restricted bydetectable_languages). See Supported languages.
detectable_languages
- Type:
array of strings - Description: When
source_languageis"auto", only languages from this list will be detected. Leave empty to allow any supported language.
segment_confirmation_silence_threshold
- Type:
float - Description: Seconds of silence needed to confirm the end of a segment (0.3–2.0, default 0.7). Recommended range: 0.5–0.9 s, depending on the speaker's tempo and pauses. Increase it if the speaker frequently pauses between words; setting it too low can cause unwanted sentence splitting.
speakers_total
- Type:
integer or null - Description: Expected number of speakers (1–1000). Helps speaker handling when known in advance.
only_confirm_by_silence
- Type:
bool - Description: When
true, segments are confirmed only by silence detection.
sentence_splitter
- Type:
object - Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, without losing the meaning) to speed up processing.
| Field | Type | Description |
|---|---|---|
enabled | bool | Whether to enable automatic sentence splitting. |
verification
- Type:
object - Description: Transcription verification settings.
| Field | Type | Description |
|---|---|---|
auto_transcription_correction | bool | WIP. Enables automatic transcription verification using an LLM. |
transcription_correction_style | string or null | Style of the LLM correction. |
translations
- Type:
array of objects - Description: An array of translation targets. Each object defines translation settings for one target language; add one object per target language.
target_language
- Type:
string - Description: Language to translate into (e.g.,
"en-us","es","fr"). See Supported languages.
allowed_source_languages
- Type:
array of strings - Description: Restricts this translation target to specific source languages. Used for conditional multi-language translation together with
"source_language": "auto".
translate_partial_transcriptions
- Type:
bool - Description: Enables translation of partial (unconfirmed) transcriptions.
speech_generation
- Type:
object - Description: Text-to-speech (TTS) settings for this target language.
| Field | Type | Description |
|---|---|---|
voice_cloning | bool | Experimental. Mimics the original speaker's voice. It usually takes 10–20 seconds of speech before the voice changes are applied. |
voice_id | string or null | A specific voice ID (voice cloning must be disabled). "default_low" or "default_high" automatically picks the best default voice for the language. Manage voices in the Palabra web portal. |
voice_timbre_detection | object | Automatically detects voice timbre and assigns voice IDs accordingly. See below. |
voice_timbre_detection
| Field | Type | Description |
|---|---|---|
enabled | bool | Enables voice timbre detection (voice cloning must be disabled). |
high_timbre_voices | array of strings | Voice ID to use for high-timbre voices. Currently only one ID is supported; "default_high" can be used. |
low_timbre_voices | array of strings | Voice ID to use for low-timbre voices. Currently only one ID is supported; "default_low" can be used. |
translation_queue_configs
Configures the behavior of unspoken TTS buffers.
The global key holds the default settings. You can add language-specific overrides by using the language code as a key (for example, "es" for Spanish).
| Field | Type | Description |
|---|---|---|
desired_queue_level_ms | integer | Desired average TTS buffer size in milliseconds (2000–163840). With auto_tempo enabled, the system tries to keep the buffer at this level. Recommended: 5000–10000 ms. |
max_queue_level_ms | integer | Maximum TTS queue size in milliseconds (3000–163840). If the queue grows beyond this limit, it is reduced to desired_queue_level_ms by dropping older queued audio. Must be greater than desired_queue_level_ms; should be at least 2–3× larger. |
auto_tempo | bool | Automatically corrects speech tempo based on the queue state. Recommended to keep on. |
auto_tempo_max_delay_ms | integer | Maximum buffer delay in milliseconds for auto tempo (60–10000, default 250). |
min_tempo | float | Minimum allowed speech speed (1.0–2.0, default 1.0). |
max_tempo | float | Maximum allowed speech speed (1.0–2.0, default 1.35). Must be ≥ min_tempo. |
If you don't provide translation_queue_configs, the server applies a default global config: desired_queue_level_ms: 5000, max_queue_level_ms: 20000, min_tempo: 1.15, max_tempo: 1.45, auto_tempo: true.
allowed_message_types
- Type:
array of strings - Default:
["translated_transcription", "partial_transcription", "validated_transcription"] - Description: Specifies which message types you will receive over the WebSocket. The same messages are also sent in the WebRTC data channel.
"partial_transcription"— emitted for partial transcription segments as they are recognized."partial_translated_transcription"— emitted for partial translated transcriptions iftranslate_partial_transcriptionsis enabled."validated_transcription"— emitted when a transcription segment is fully confirmed."translated_transcription"— emitted when a transcription segment has been translated.
See also: Translation management API · Publishing and receiving audio