Translation Settings Breakdown
A comprehensive reference for all configuration options available in the set_task
message.
What are translation settings?
Translation settings are configuration parameters that you send to Palabra's streaming API to control how your audio is processed, transcribed, translated, and synthesized.
These settings are included in the data
field of a set_task
message that you send via WebRTC data channel or WebSocket connection.
When to use these settings
You'll configure these settings in several scenarios:
- Starting a new translation session - Send your initial
set_task
message to begin real-time translation - Updating an active session - Modify translation settings during an ongoing translation without interrupting the stream
- Fine-tuning performance - Adjust parameters to optimize quality, latency, or specific use cases
Task object structure
1. message_type
- Type:
string
- Description: Identifies the type of command being sent:
"set_task"
- Create a new task or update an existing one."get_task"
- Return the current task."end_task"
- End the task (you will be disconnected automatically)."pause_task"
- Do not process new audio data, but keep the task alive. Useset_task
to resume."tts_task""
- Generates text-to-speech from a text."input_audio_data""
- Websockets transport only. Contains audio data chunk.
2. data
- Type:
object
- Description: Contains the main configuration details for this task.
Within data
, there are five major sections:
input_stream
output_stream
pipeline
translation_queue_configs
allowed_message_types
2.1. input_stream
- Type:
object
- Description: Configures the input stream settings.
2.1.1. content_type
- Type:
string
- Description: Describes the content type in the input stream. Currently, only
audio
is supported.
2.1.2. source
- Type:
object
- Description: Indicates how and from where the input audio stream is sourced.
2.1.2.1. type
- Type:
string
- Description: Specifies input audio transport.
webrtc
andws
are supported. Must matchtarget
transport.
2.1.2.2. format
- Type:
string
- Description: Required for websockets transport only. Input audio format, supported:
opus
,pcm_s16le
andwav
2.1.2.3. sample_rate
- Type:
string
- Description: Required for websockets transport only. Input audio sample rate. Allowed range is from 16khz to 48khz.
2.1.2.4. channels
- Type:
string
- Description: Required for websockets transport only. Input channels. One or two channels is supported.
2.2. output_stream
- Type:
object
- Description: Defines the output settings.
2.2.1. content_type
- Type:
string
- Description: Describes the content type in the output. Currently, only
audio
is supported.
2.2.2. target
- Type:
object
- Description: Indicates the destination of the output.
2.2.2.1. type
- Type:
string
- Description: Specifies output audio transport.
webrtc
andws
are supported. Must matchsource
transport.
2.1.2.2. format
- Type:
string
- Description: Required for websockets transport only. Output audio format, supported:
pcm_s16le
,zlib_pcm_s16le
(zlib-compressed pcm)
2.3. pipeline
- Type:
object
- Description: Holds the configuration for all processing steps, including preprocessing, transcription, and translation.
2.3.2. transcription
- Type:
object
- Description: Settings for Automatic Speech Recognition (ASR).
2.3.2.1. source_language
- Type:
string
- Description: Language code representing the input audio language (e.g.,
"en"
,"es"
,"fr"
). Can be set to"auto"
for automatic language detection and allow to setdetectable_languages
.
2.3.2.2. detectable_languages
- Type:
array of strings
- Description: Only languages from the list will be detected if
source_language
is set toauto
.
2.3.2.3. segment_confirmation_silence_threshold
- Type:
float
- Description: The time in seconds of silence needed to confirm the end of a segment. The recommended value is between 0.5s and 0.9s, depending on the average speech tempo and pauses. Increase this value if a speaker frequently pauses between words. If it is set too low, it can lead to unwanted sentence splitting.
2.3.2.4. sentence_splitter
- Type:
object
- Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, but without losing the meaning) to speed up processing.
2.3.2.4.1. enabled
- Type:
bool
- Description: Whether to enable automatic sentence splitting.
2.3.2.5. verification
- Type:
object
- Description: Controls transcription verification settings.
2.3.2.5.1. auto_transcription_correction
- Type:
bool
- Description: WIP. Allows automatic transcription verification using LLM model.
2.3.2.5.2. transcription_correction_style
- Type:
string or null
- Description: Style of LLM correction.
2.3.3. translations
- Type:
array of objects
- Description: An array of translation targets. Each object defines translation settings for a specific target language.
Note: translations is an array of objects, each representing a required language. Below is an example of an object for a single language.
2.3.3.1.1. target_language
- Type:
string
- Description: The language into which the text should be translated (e.g.,
"en-us"
,"es"
,"fr"
).
2.3.3.1.2. translate_partial_transcriptions
- Type:
bool
- Description: Allows translating partial transcriptions.
2.3.3.1.3. speech_generation
- Type:
object
- Description: Configures text-to-speech (TTS) settings.
2.3.3.1.3.1. voice_cloning
- Type:
bool
- Description: Experimental. Enables voice cloning to mimic the original speaker's voice. It usually takes 10-20 seconds of speech before the voice changes are applied.
2.3.3.1.3.2. voice_id
- Type:
null
orstring
- Description: A particular voice ID can be specified. Voice cloning must be disabled. If set to
"default_low"
or"default_high"
, the best default voice for the selected language will be used automatically. You can create or manage voices in the Palabra web portal.
2.3.3.1.3.3. voice_timbre_detection
- Type:
object
- Description: Allows automatically detecting and assigning voice IDs to different voice timbres.
2.3.3.1.3.3.1. enabled
- Type:
bool
- Description: Enables voice timbre detection. Voice cloning must be disabled.
2.3.3.1.3.3.2. high_timbre_voices
- Type:
array of strings
- Description: Specifies which voice ID to use for high timbre voices. (Currently, only one ID is supported, or use
"default_high"
.)
2.3.3.1.3.3.3. low_timbre_voices
- Type:
array of strings
- Description: Specifies which voice ID to use for low timbre voices. (Currently, only one ID is supported, or use
"default_low"
.)
2.4. translation_queue_configs
- Type:
object
- Description: Configures the behavior of unspoken TTS buffers.
2.4.1. global
- Type:
object
- Description: Global/default settings for TTS queue behavior. You can add language-specific overrides by using the language code as a key (for example,
"es"
for Spanish).
2.4.1.1. desired_queue_level_ms
- Type:
number
- Description: Desired average TTS buffer size in milliseconds. If
speech_tempo_auto
is enabled, it will try to keep the buffer at this level. A recommended value is between 5000 and 8000 milliseconds (5-8 seconds).
2.4.1.2. max_queue_level_ms
- Type:
number
- Description: The maximum TTS queue size in milliseconds. If the queue grows beyond this limit, it will be reduced to
desired_queue_level_ms
by dropping older queued audio. It should be at least two or three times larger thandesired_queue_level_ms
.
2.4.1.3. auto_tempo
- Type:
bool
- Description: Auto correct speech tempo based on the queue state. It is recommended to keep it on.
2.5. allowed_message_types
-
Type:
array of strings
-
Description: Specifies the types of messages you will receive back via WebSocket. The same messages are also sent in the WebRTC data channel.
"partial_transcription"
- Emitted for partial transcription segments as they are recognized."partial_translated_transcription"
- Emitted for partial translated transcriptions iftranslate_partial_transcriptions
is enabled."validated_transcription"
- Emitted when a transcription segment is fully confirmed."translated_transcription"
- Emitted when a transcription segment has been translated.