Task object structure
1. message_type
- Type:
string
- Description: Identifies the type of command being sent:
"set_task"
- Create a new task or update an existing one."get_task"
- Return the current task."end_task"
- End the task (you will be disconnected automatically)."pause_task"
- Do not process new audio data, but keep the task alive. Useset_task
to resume."tts_task""
- Generates text-to-speech from a text."input_audio_data""
- Websockets transport only. Contains audio data chunk.
2. data
- Type:
object
- Description: Contains the main configuration details for this task.
Within data
, there are five major sections:
input_stream
output_stream
pipeline
translation_queue_configs
allowed_message_types
2.1. input_stream
- Type:
object
- Description: Configures the input stream settings.
2.1.1. content_type
- Type:
string
- Description: Describes the content type in the input stream. Currently, only
audio
is supported.
2.1.2. source
- Type:
object
- Description: Indicates how and from where the input audio stream is sourced.
2.1.2.1. type
- Type:
string
- Description: Specifies input audio transport.
webrtc
andws
are supported. Must matchtarget
transport.
2.1.2.2. format
- Type:
string
- Description: Required for websockets transport only. Input audio format, supported:
opus
,pcm_s16le
andwav
2.1.2.3. sample_rate
- Type:
string
- Description: Required for websockets transport only. Input audio sample rate. Allowed range is from 16khz to 48khz.
2.1.2.4. channels
- Type:
string
- Description: Required for websockets transport only. Input channels. One or two channels is supported.
2.2. output_stream
- Type:
object
- Description: Defines the output settings.
2.2.1. content_type
- Type:
string
- Description: Describes the content type in the output. Currently, only
audio
is supported.
2.2.2. target
- Type:
object
- Description: Indicates the destination of the output.
2.2.2.1. type
- Type:
string
- Description: Specifies output audio transport.
webrtc
andws
are supported. Must matchsource
transport.
2.1.2.2. format
- Type:
string
- Description: Required for websockets transport only. Output audio format, supported:
pcm_s16le
,zlib_pcm_s16le
(zlib-compressed pcm)
2.3. pipeline
- Type:
object
- Description: Holds the configuration for all processing steps, including preprocessing, transcription, and translation.
2.3.2. transcription
- Type:
object
- Description: Settings for Automatic Speech Recognition (ASR).
2.3.2.1. source_language
- Type:
string
- Description: Language code representing the input audio language (e.g.,
"en"
,"es"
,"fr"
). Can be set to"auto"
for automatic language detection and allow to setdetectable_languages
.
2.3.2.2. detectable_languages
- Type:
array of strings
- Description: Only languages from the list will be detected if
source_language
is set toauto
.
2.3.2.3. segment_confirmation_silence_threshold
- Type:
float
- Description: The time in seconds of silence needed to confirm the end of a segment. The recommended value is between 0.5s and 0.9s, depending on the average speech tempo and pauses. Increase this value if a speaker frequently pauses between words. If it is set too low, it can lead to unwanted sentence splitting.
2.3.2.4. sentence_splitter
- Type:
object
- Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, but without losing the meaning) to speed up processing.
2.3.2.4.1. enabled
- Type:
bool
- Description: Whether to enable automatic sentence splitting.
2.3.2.5. verification
- Type:
object
- Description: Controls transcription verification settings.
2.3.2.5.1. auto_transcription_correction
- Type:
bool
- Description: WIP. Allows automatic transcription verification using LLM model.
2.3.2.5.2. transcription_correction_style
- Type:
string or null
- Description: Style of LLM correction.
2.3.3. translations
- Type:
array of objects
- Description: An array of translation targets. Each object defines translation settings for a specific target language.
Note: translations is an array of objects, each representing a required language. Below is an example of an object for a single language.
2.3.3.1.1. target_language
- Type:
string
- Description: The language into which the text should be translated (e.g.,
"en-us"
,"es"
,"fr"
).
2.3.3.1.2. translate_partial_transcriptions
- Type:
bool
- Description: Allows translating partial transcriptions.
2.3.3.1.3. speech_generation
- Type:
object
- Description: Configures text-to-speech (TTS) settings.
2.3.3.1.3.1. voice_cloning
- Type:
bool
- Description: Experimental. Enables voice cloning to mimic the original speaker's voice. It usually takes 10-20 seconds of speech before the voice changes are applied.
2.3.3.1.3.2. voice_id
- Type:
null
orstring
- Description: A particular voice ID can be specified. Voice cloning must be disabled. If set to
"default_low"
or"default_high"
, the best default voice for the selected language will be used automatically. You can create or manage voices in the Palabra web portal.
2.3.3.1.3.3. voice_timbre_detection
- Type:
object
- Description: Allows automatically detecting and assigning voice IDs to different voice timbres.
2.3.3.1.3.3.1. enabled
- Type:
bool
- Description: Enables voice timbre detection. Voice cloning must be disabled.
2.3.3.1.3.3.2. high_timbre_voices
- Type:
array of strings
- Description: Specifies which voice ID to use for high timbre voices. (Currently, only one ID is supported, or use
"default_high"
.)
2.3.3.1.3.3.3. low_timbre_voices
- Type:
array of strings
- Description: Specifies which voice ID to use for low timbre voices. (Currently, only one ID is supported, or use
"default_low"
.)
2.4. translation_queue_configs
- Type:
object
- Description: Configures the behavior of unspoken TTS buffers.
2.4.1. global
- Type:
object
- Description: Global/default settings for TTS queue behavior. You can add language-specific overrides by using the language code as a key (for example,
"es"
for Spanish).
2.4.1.1. desired_queue_level_ms
- Type:
number
- Description: Desired average TTS buffer size in milliseconds. If
speech_tempo_auto
is enabled, it will try to keep the buffer at this level. A recommended value is between 6000 and 8000 milliseconds (6-8 seconds).
2.4.1.2. max_queue_level_ms
- Type:
number
- Description: The maximum TTS queue size in milliseconds. If the queue grows beyond this limit, it will be reduced to
desired_queue_level_ms
by dropping older queued audio. It should be at least two or three times larger thandesired_queue_level_ms
.
2.4.1.3. auto_tempo
- Type:
bool
- Description: Auto correct speech tempo based on the queue state.
2.5. allowed_message_types
-
Type:
array of strings
-
Description: Specifies the types of messages you will receive back via WebSocket. The same messages are also sent in the WebRTC data channel.
"partial_transcription"
- Emitted for partial transcription segments as they are recognized."partial_translated_transcription"
- Emitted for partial translated transcriptions iftranslate_partial_transcriptions
is enabled."validated_transcription"
- Emitted when a transcription segment is fully confirmed."translated_transcription"
- Emitted when a transcription segment has been translated.