Skip to main content

Translation Settings Breakdown

A complete reference for all configuration options of the set_task message.

What are translation settings?

Translation settings control how your audio is processed, transcribed, translated, and synthesized. They are included in the data field of a set_task message sent via the WebRTC data channel or the WebSocket connection.

You will use these settings when:

  • Starting a new translation — send your initial set_task message to begin real-time translation.
  • Updating an active translation — modify settings during an ongoing translation without interrupting the stream.
  • Fine-tuning performance — adjust parameters to optimize quality, latency, or specific use cases.

For ready-to-use values, see Recommended settings.

Message envelope

{
"message_type": "set_task",
"data": {
"input_stream": { /* ... */ },
"output_stream": { /* ... */ },
"pipeline": { /* ... */ }
}
}

message_type

  • Type: string
  • Description: Identifies the type of command being sent:
    • "set_task" — create a new task or update an existing one.
    • "get_task" — return the current task.
    • "pause_task" — stop processing new audio but keep the task alive. Use set_task to resume.
    • "flush_task" — cancel processing of the current (ongoing) phrase.
    • "end_task" — end the task (you will be disconnected automatically).
    • "tts_task" — generate text-to-speech from text.
    • "input_audio_data"WebSocket transport only; contains an audio data chunk.

data


input_stream

Configures the input stream.

FieldTypeDescription
content_typestringContent type of the input stream. Currently only audio is supported.
sourceobjectHow and from where the input audio is sourced. See below.

source

FieldTypeDescription
typestringInput audio transport: webrtc or ws. Must match the target transport in output_stream.
formatstringWebSocket transport only. Input audio format: opus, pcm_s16le, or wav.
sample_rateintegerWebSocket transport only. Input sample rate. Allowed range: 16000–48000 Hz.
channelsintegerWebSocket transport only. Number of input channels: 1 or 2.

output_stream

Configures the output stream.

FieldTypeDescription
content_typestringContent type of the output. Currently only audio is supported.
targetobjectDestination of the output. See below.

target

FieldTypeDescription
typestringOutput audio transport: webrtc or ws. Must match the source transport in input_stream.
formatstringWebSocket transport only. Output audio format: pcm_s16le or zlib_pcm_s16le (zlib-compressed PCM).
sample_rateintegerWebSocket transport only. Output sample rate. Fixed at 24000 Hz.
channelsintegerWebSocket transport only. Output channels. Fixed at 1 (mono).

pipeline

Holds the configuration for all processing steps: preprocessing, transcription, and translation.

transcription

Settings for Automatic Speech Recognition (ASR).

source_language

  • Type: string
  • Description: Language code of the input audio (e.g., "en", "es", "fr"). Set to "auto" to enable automatic language detection (optionally restricted by detectable_languages). See Supported languages.

detectable_languages

  • Type: array of strings
  • Description: When source_language is "auto", only languages from this list will be detected. Leave empty to allow any supported language.

segment_confirmation_silence_threshold

  • Type: float
  • Description: Seconds of silence needed to confirm the end of a segment (0.3–2.0, default 0.7). Recommended range: 0.5–0.9 s, depending on the speaker's tempo and pauses. Increase it if the speaker frequently pauses between words; setting it too low can cause unwanted sentence splitting.

speakers_total

  • Type: integer or null
  • Description: Expected number of speakers (1–1000). Helps speaker handling when known in advance.

only_confirm_by_silence

  • Type: bool
  • Description: When true, segments are confirmed only by silence detection.

sentence_splitter

  • Type: object
  • Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, without losing the meaning) to speed up processing.
FieldTypeDescription
enabledboolWhether to enable automatic sentence splitting.

verification

  • Type: object
  • Description: Transcription verification settings.
FieldTypeDescription
auto_transcription_correctionboolWIP. Enables automatic transcription verification using an LLM.
transcription_correction_stylestring or nullStyle of the LLM correction.

translations

  • Type: array of objects
  • Description: An array of translation targets. Each object defines translation settings for one target language; add one object per target language.

target_language

  • Type: string
  • Description: Language to translate into (e.g., "en-us", "es", "fr"). See Supported languages.

allowed_source_languages

  • Type: array of strings
  • Description: Restricts this translation target to specific source languages. Used for conditional multi-language translation together with "source_language": "auto".

translate_partial_transcriptions

  • Type: bool
  • Description: Enables translation of partial (unconfirmed) transcriptions.

speech_generation

  • Type: object
  • Description: Text-to-speech (TTS) settings for this target language.
FieldTypeDescription
voice_cloningboolExperimental. Mimics the original speaker's voice. It usually takes 10–20 seconds of speech before the voice changes are applied.
voice_idstring or nullA specific voice ID (voice cloning must be disabled). "default_low" or "default_high" automatically picks the best default voice for the language. Manage voices in the Palabra web portal.
voice_timbre_detectionobjectAutomatically detects voice timbre and assigns voice IDs accordingly. See below.
voice_timbre_detection
FieldTypeDescription
enabledboolEnables voice timbre detection (voice cloning must be disabled).
high_timbre_voicesarray of stringsVoice ID to use for high-timbre voices. Currently only one ID is supported; "default_high" can be used.
low_timbre_voicesarray of stringsVoice ID to use for low-timbre voices. Currently only one ID is supported; "default_low" can be used.

translation_queue_configs

Configures the behavior of unspoken TTS buffers.

The global key holds the default settings. You can add language-specific overrides by using the language code as a key (for example, "es" for Spanish).

FieldTypeDescription
desired_queue_level_msintegerDesired average TTS buffer size in milliseconds (2000–163840). With auto_tempo enabled, the system tries to keep the buffer at this level. Recommended: 5000–10000 ms.
max_queue_level_msintegerMaximum TTS queue size in milliseconds (3000–163840). If the queue grows beyond this limit, it is reduced to desired_queue_level_ms by dropping older queued audio. Must be greater than desired_queue_level_ms; should be at least 2–3× larger.
auto_tempoboolAutomatically corrects speech tempo based on the queue state. Recommended to keep on.
auto_tempo_max_delay_msintegerMaximum buffer delay in milliseconds for auto tempo (60–10000, default 250).
min_tempofloatMinimum allowed speech speed (1.0–2.0, default 1.0).
max_tempofloatMaximum allowed speech speed (1.0–2.0, default 1.35). Must be ≥ min_tempo.

If you don't provide translation_queue_configs, the server applies a default global config: desired_queue_level_ms: 5000, max_queue_level_ms: 20000, min_tempo: 1.15, max_tempo: 1.45, auto_tempo: true.


allowed_message_types

  • Type: array of strings
  • Default: ["translated_transcription", "partial_transcription", "validated_transcription"]
  • Description: Specifies which message types you will receive over the WebSocket. The same messages are also sent in the WebRTC data channel.
    • "partial_transcription" — emitted for partial transcription segments as they are recognized.
    • "partial_translated_transcription" — emitted for partial translated transcriptions if translate_partial_transcriptions is enabled.
    • "validated_transcription" — emitted when a transcription segment is fully confirmed.
    • "translated_transcription" — emitted when a transcription segment has been translated.

See also: Translation management API · Publishing and receiving audio