Translation Settings Breakdown
A comprehensive reference for all configuration options available in the set_task message.
What are translation settings?
Translation settings are configuration parameters that you send to Palabra's streaming API to control how your audio is processed, transcribed, translated, and synthesized.
These settings are included in the data field of a set_task message that you send via WebRTC data channel or WebSocket connection.
When to use these settings
You'll configure these settings in several scenarios:
- Starting a new translation session - Send your initial
set_taskmessage to begin real-time translation - Updating an active session - Modify translation settings during an ongoing translation without interrupting the stream
- Fine-tuning performance - Adjust parameters to optimize quality, latency, or specific use cases
Task object structure
1. message_type
- Type:
string - Description: Identifies the type of command being sent:
"set_task"- Create a new task or update an existing one."get_task"- Return the current task."end_task"- End the task (you will be disconnected automatically)."pause_task"- Do not process new audio data, but keep the task alive. Useset_taskto resume."tts_task""- Generates text-to-speech from a text."input_audio_data""- Websockets transport only. Contains audio data chunk.
2. data
- Type:
object - Description: Contains the main configuration details for this task.
Within data, there are five major sections:
input_streamoutput_streampipelinetranslation_queue_configsallowed_message_types
2.1. input_stream
- Type:
object - Description: Configures the input stream settings.
2.1.1. content_type
- Type:
string - Description: Describes the content type in the input stream. Currently, only
audiois supported.
2.1.2. source
- Type:
object - Description: Indicates how and from where the input audio stream is sourced.
2.1.2.1. type
- Type:
string - Description: Specifies input audio transport.
webrtcandwsare supported. Must matchtargettransport.
2.1.2.2. format
- Type:
string - Description: Required for websockets transport only. Input audio format, supported:
opus,pcm_s16leandwav
2.1.2.3. sample_rate
- Type:
string - Description: Required for websockets transport only. Input audio sample rate. Allowed range is from 16khz to 48khz.
2.1.2.4. channels
- Type:
string - Description: Required for websockets transport only. Input channels. One or two channels is supported.
2.2. output_stream
- Type:
object - Description: Defines the output settings.
2.2.1. content_type
- Type:
string - Description: Describes the content type in the output. Currently, only
audiois supported.
2.2.2. target
- Type:
object - Description: Indicates the destination of the output.
2.2.2.1. type
- Type:
string - Description: Specifies output audio transport.
webrtcandwsare supported. Must matchsourcetransport.
2.1.2.2. format
- Type:
string - Description: Required for websockets transport only. Output audio format, supported:
pcm_s16le,zlib_pcm_s16le(zlib-compressed pcm)
2.3. pipeline
- Type:
object - Description: Holds the configuration for all processing steps, including preprocessing, transcription, and translation.
2.3.2. transcription
- Type:
object - Description: Settings for Automatic Speech Recognition (ASR).
2.3.2.1. source_language
- Type:
string - Description: Language code representing the input audio language (e.g.,
"en","es","fr"). Can be set to"auto"for automatic language detection and allow to setdetectable_languages.
2.3.2.2. detectable_languages
- Type:
array of strings - Description: Only languages from the list will be detected if
source_languageis set toauto.
2.3.2.3. segment_confirmation_silence_threshold
- Type:
float - Description: The time in seconds of silence needed to confirm the end of a segment. The recommended value is between 0.5s and 0.9s, depending on the average speech tempo and pauses. Increase this value if a speaker frequently pauses between words. If it is set too low, it can lead to unwanted sentence splitting.
2.3.2.4. sentence_splitter
- Type:
object - Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, but without losing the meaning) to speed up processing.
2.3.2.4.1. enabled
- Type:
bool - Description: Whether to enable automatic sentence splitting.
2.3.2.5. verification
- Type:
object - Description: Controls transcription verification settings.
2.3.2.5.1. auto_transcription_correction
- Type:
bool - Description: WIP. Allows automatic transcription verification using LLM model.
2.3.2.5.2. transcription_correction_style
- Type:
string or null - Description: Style of LLM correction.
2.3.3. translations
- Type:
array of objects - Description: An array of translation targets. Each object defines translation settings for a specific target language.
Note: translations is an array of objects, each representing a required language. Below is an example of an object for a single language.
2.3.3.1.1. target_language
- Type:
string - Description: The language into which the text should be translated (e.g.,
"en-us","es","fr").
2.3.3.1.2. translate_partial_transcriptions
- Type:
bool - Description: Allows translating partial transcriptions.
2.3.3.1.3. speech_generation
- Type:
object - Description: Configures text-to-speech (TTS) settings.
2.3.3.1.3.1. voice_cloning
- Type:
bool - Description: Experimental. Enables voice cloning to mimic the original speaker's voice. It usually takes 10-20 seconds of speech before the voice changes are applied.
2.3.3.1.3.2. voice_id
- Type:
nullorstring - Description: A particular voice ID can be specified. Voice cloning must be disabled. If set to
"default_low"or"default_high", the best default voice for the selected language will be used automatically. You can create or manage voices in the Palabra web portal.
2.3.3.1.3.3. voice_timbre_detection
- Type:
object - Description: Allows automatically detecting and assigning voice IDs to different voice timbres.
2.3.3.1.3.3.1. enabled
- Type:
bool - Description: Enables voice timbre detection. Voice cloning must be disabled.
2.3.3.1.3.3.2. high_timbre_voices
- Type:
array of strings - Description: Specifies which voice ID to use for high timbre voices. (Currently, only one ID is supported, or use
"default_high".)
2.3.3.1.3.3.3. low_timbre_voices
- Type:
array of strings - Description: Specifies which voice ID to use for low timbre voices. (Currently, only one ID is supported, or use
"default_low".)
2.4. translation_queue_configs
- Type:
object - Description: Configures the behavior of unspoken TTS buffers.
2.4.1. global
- Type:
object - Description: Global/default settings for TTS queue behavior. You can add language-specific overrides by using the language code as a key (for example,
"es"for Spanish).
2.4.1.1. desired_queue_level_ms
- Type:
number - Description: Desired average TTS buffer size in milliseconds. If
auto_tempois enabled, it will try to keep the buffer at this level. A recommended value is between 5000 and 10000 milliseconds (5–10 seconds).
2.4.1.2. max_queue_level_ms
- Type:
number - Description: The maximum TTS queue size in milliseconds. If the queue grows beyond this limit, it will be reduced to
desired_queue_level_msby dropping older queued audio. It should be at least two or three times larger thandesired_queue_level_ms.
2.4.1.3. auto_tempo
- Type:
bool - Description: Auto correct speech tempo based on the queue state. It is recommended to keep it on.
2.4.1.4. min_tempo
- Type:
bool - Description: Minimal allowed speech speed. Defaults to 1.0 must be between 1.0 and 2.0.
2.4.1.5. max_tempo
- Type:
bool - Description: Maximum allowed speech speed. Defaults to 1.2 must be between 1.0 and 2.0.
2.5. allowed_message_types
-
Type:
array of strings -
Description: Specifies the types of messages you will receive back via WebSocket. The same messages are also sent in the WebRTC data channel.
"partial_transcription"- Emitted for partial transcription segments as they are recognized."partial_translated_transcription"- Emitted for partial translated transcriptions iftranslate_partial_transcriptionsis enabled."validated_transcription"- Emitted when a transcription segment is fully confirmed."translated_transcription"- Emitted when a transcription segment has been translated.