Translation Settings Breakdown

A comprehensive reference for all configuration options available in the set_task message.

What are translation settings?

Translation settings are configuration parameters that you send to Palabra's streaming API to control how your audio is processed, transcribed, translated, and synthesized. These settings are included in the data field of a set_task message that you send via WebRTC data channel or WebSocket connection.

When to use these settings

You'll configure these settings in several scenarios:

Starting a new translation session - Send your initial set_task message to begin real-time translation
Updating an active session - Modify translation settings during an ongoing translation without interrupting the stream
Fine-tuning performance - Adjust parameters to optimize quality, latency, or specific use cases

Task object structure

1. `message_type`

Type: string
Description: Identifies the type of command being sent:
- "set_task" - Create a new task or update an existing one.
- "get_task" - Return the current task.
- "end_task" - End the task (you will be disconnected automatically).
- "pause_task" - Do not process new audio data, but keep the task alive. Use set_task to resume.
- "tts_task"" - Generates text-to-speech from a text.
- "input_audio_data"" - Websockets transport only. Contains audio data chunk.

2. `data`

Type: object
Description: Contains the main configuration details for this task.

Within data, there are five major sections:

input_stream
output_stream
pipeline
translation_queue_configs
allowed_message_types

2.1. `input_stream`

Type: object
Description: Configures the input stream settings.

2.1.1. `content_type`

Type: string
Description: Describes the content type in the input stream. Currently, only audio is supported.

2.1.2. `source`

Type: object
Description: Indicates how and from where the input audio stream is sourced.

2.1.2.1. `type`

Type: string
Description: Specifies input audio transport. webrtc and ws are supported. Must match target transport.

2.1.2.2. `format`

Type: string
Description: Required for websockets transport only. Input audio format, supported: opus, pcm_s16le and wav

2.1.2.3. `sample_rate`

Type: string
Description: Required for websockets transport only. Input audio sample rate. Allowed range is from 16khz to 48khz.

2.1.2.4. `channels`

Type: string
Description: Required for websockets transport only. Input channels. One or two channels is supported.

2.2. `output_stream`

Type: object
Description: Defines the output settings.

2.2.1. `content_type`

Type: string
Description: Describes the content type in the output. Currently, only audio is supported.

2.2.2. `target`

Type: object
Description: Indicates the destination of the output.

2.2.2.1. `type`

Type: string
Description: Specifies output audio transport. webrtc and ws are supported. Must match source transport.

2.1.2.2. `format`

Type: string
Description: Required for websockets transport only. Output audio format, supported: pcm_s16le, zlib_pcm_s16le (zlib-compressed pcm)

2.3. `pipeline`

Type: object
Description: Holds the configuration for all processing steps, including preprocessing, transcription, and translation.

2.3.2. `transcription`

Type: object
Description: Settings for Automatic Speech Recognition (ASR).

2.3.2.1. `source_language`

Type: string
Description: Language code representing the input audio language (e.g., "en", "es", "fr"). Can be set to "auto" for automatic language detection and allow to set detectable_languages.

2.3.2.2. `detectable_languages`

Type: array of strings
Description: Only languages from the list will be detected if source_language is set to auto.

2.3.2.3. `segment_confirmation_silence_threshold`

Type: float
Description: The time in seconds of silence needed to confirm the end of a segment. The recommended value is between 0.5s and 0.9s, depending on the average speech tempo and pauses. Increase this value if a speaker frequently pauses between words. If it is set too low, it can lead to unwanted sentence splitting.

2.3.2.4. `sentence_splitter`

Type: object
Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, but without losing the meaning) to speed up processing.

2.3.2.4.1. `enabled`

Type: bool
Description: Whether to enable automatic sentence splitting.

2.3.2.5. `verification`

Type: object
Description: Controls transcription verification settings.

2.3.2.5.1. `auto_transcription_correction`

Type: bool
Description: WIP. Allows automatic transcription verification using LLM model.

2.3.2.5.2. `transcription_correction_style`

Type: string or null
Description: Style of LLM correction.

2.3.3. `translations`

Type: array of objects
Description: An array of translation targets. Each object defines translation settings for a specific target language.

Note: translations is an array of objects, each representing a required language. Below is an example of an object for a single language.

2.3.3.1.1. `target_language`

Type: string
Description: The language into which the text should be translated (e.g., "en-us", "es", "fr").

2.3.3.1.2. `translate_partial_transcriptions`

Type: bool
Description: Allows translating partial transcriptions.

2.3.3.1.3. `speech_generation`

Type: object
Description: Configures text-to-speech (TTS) settings.

2.3.3.1.3.1. `voice_cloning`

Type: bool
Description: Experimental. Enables voice cloning to mimic the original speaker's voice. It usually takes 10-20 seconds of speech before the voice changes are applied.

2.3.3.1.3.2. `voice_id`

Type: null or string
Description: A particular voice ID can be specified. Voice cloning must be disabled. If set to "default_low" or "default_high", the best default voice for the selected language will be used automatically. You can create or manage voices in the Palabra web portal.

2.3.3.1.3.3. `voice_timbre_detection`

Type: object
Description: Allows automatically detecting and assigning voice IDs to different voice timbres.

2.3.3.1.3.3.1. `enabled`

Type: bool
Description: Enables voice timbre detection. Voice cloning must be disabled.

2.3.3.1.3.3.2. `high_timbre_voices`

Type: array of strings
Description: Specifies which voice ID to use for high timbre voices. (Currently, only one ID is supported, or use "default_high".)

2.3.3.1.3.3.3. `low_timbre_voices`

Type: array of strings
Description: Specifies which voice ID to use for low timbre voices. (Currently, only one ID is supported, or use "default_low".)

2.4. `translation_queue_configs`

Type: object
Description: Configures the behavior of unspoken TTS buffers.

2.4.1. `global`

Type: object
Description: Global/default settings for TTS queue behavior. You can add language-specific overrides by using the language code as a key (for example, "es" for Spanish).

2.4.1.1. `desired_queue_level_ms`

Type: number
Description: Desired average TTS buffer size in milliseconds. If auto_tempo is enabled, it will try to keep the buffer at this level. A recommended value is between 5000 and 10000 milliseconds (5–10 seconds).

2.4.1.2. `max_queue_level_ms`

Type: number
Description: The maximum TTS queue size in milliseconds. If the queue grows beyond this limit, it will be reduced to desired_queue_level_ms by dropping older queued audio. It should be at least two or three times larger than desired_queue_level_ms.

2.4.1.3. `auto_tempo`

Type: bool
Description: Auto correct speech tempo based on the queue state. It is recommended to keep it on.

2.4.1.4. `min_tempo`

Type: bool
Description: Minimal allowed speech speed. Defaults to 1.0 must be between 1.0 and 2.0.

2.4.1.5. `max_tempo`

Type: bool
Description: Maximum allowed speech speed. Defaults to 1.2 must be between 1.0 and 2.0.

2.5. `allowed_message_types`

Type: array of strings
Description: Specifies the types of messages you will receive back via WebSocket. The same messages are also sent in the WebRTC data channel.
- "partial_transcription" - Emitted for partial transcription segments as they are recognized.
- "partial_translated_transcription" - Emitted for partial translated transcriptions if translate_partial_transcriptions is enabled.
- "validated_transcription" - Emitted when a transcription segment is fully confirmed.
- "translated_transcription" - Emitted when a transcription segment has been translated.

What are translation settings?​

When to use these settings​

Task object structure

1. message_type​

2. data​

2.1. input_stream​

2.1.1. content_type​

2.1.2. source​

2.1.2.1. type​

2.1.2.2. format​

2.1.2.3. sample_rate​

2.1.2.4. channels​

2.2. output_stream​

2.2.1. content_type​

2.2.2. target​

2.2.2.1. type​

2.1.2.2. format​

2.3. pipeline​

2.3.2. transcription​

2.3.2.1. source_language​

2.3.2.2. detectable_languages​

2.3.2.3. segment_confirmation_silence_threshold​

2.3.2.4. sentence_splitter​

2.3.2.4.1. enabled​

2.3.2.5. verification​

2.3.2.5.1. auto_transcription_correction​

2.3.2.5.2. transcription_correction_style​

2.3.3. translations​

2.3.3.1.1. target_language​

2.3.3.1.2. translate_partial_transcriptions​

2.3.3.1.3. speech_generation​

2.3.3.1.3.1. voice_cloning​

2.3.3.1.3.2. voice_id​

2.3.3.1.3.3. voice_timbre_detection​

2.3.3.1.3.3.1. enabled​

2.3.3.1.3.3.2. high_timbre_voices​

2.3.3.1.3.3.3. low_timbre_voices​

2.4. translation_queue_configs​

2.4.1. global​

2.4.1.1. desired_queue_level_ms​

2.4.1.2. max_queue_level_ms​

2.4.1.3. auto_tempo​

2.4.1.4. min_tempo​

2.4.1.5. max_tempo​

2.5. allowed_message_types​

What are translation settings?

When to use these settings

1. `message_type`

2. `data`

2.1. `input_stream`

2.1.1. `content_type`

2.1.2. `source`

2.1.2.1. `type`

2.1.2.2. `format`

2.1.2.3. `sample_rate`

2.1.2.4. `channels`

2.2. `output_stream`

2.2.1. `content_type`

2.2.2. `target`

2.2.2.1. `type`

2.1.2.2. `format`

2.3. `pipeline`

2.3.2. `transcription`

2.3.2.1. `source_language`

2.3.2.2. `detectable_languages`

2.3.2.3. `segment_confirmation_silence_threshold`

2.3.2.4. `sentence_splitter`

2.3.2.4.1. `enabled`

2.3.2.5. `verification`

2.3.2.5.1. `auto_transcription_correction`

2.3.2.5.2. `transcription_correction_style`

2.3.3. `translations`

2.3.3.1.1. `target_language`

2.3.3.1.2. `translate_partial_transcriptions`

2.3.3.1.3. `speech_generation`

2.3.3.1.3.1. `voice_cloning`

2.3.3.1.3.2. `voice_id`

2.3.3.1.3.3. `voice_timbre_detection`

2.3.3.1.3.3.1. `enabled`

2.3.3.1.3.3.2. `high_timbre_voices`

2.3.3.1.3.3.3. `low_timbre_voices`

2.4. `translation_queue_configs`

2.4.1. `global`

2.4.1.1. `desired_queue_level_ms`

2.4.1.2. `max_queue_level_ms`

2.4.1.3. `auto_tempo`

2.4.1.4. `min_tempo`

2.4.1.5. `max_tempo`

2.5. `allowed_message_types`