Skip to main content

Task object structure


1. message_type

  • Type: string
  • Description: Identifies the type of command being sent:
    • "set_task" - Create a new task or update an existing one.
    • "get_task" - Return the current task.
    • "end_task" - End the task (you will be disconnected automatically).
    • "pause_task" - Do not process new audio data, but keep the task alive. Use set_task to resume.
    • "tts_task"" - Generates text-to-speech from a text.
    • "input_audio_data"" - Websockets transport only. Contains audio data chunk.

2. data

  • Type: object
  • Description: Contains the main configuration details for this task.

Within data, there are five major sections:

  1. input_stream
  2. output_stream
  3. pipeline
  4. translation_queue_configs
  5. allowed_message_types

2.1. input_stream

  • Type: object
  • Description: Configures the input stream settings.

2.1.1. content_type

  • Type: string
  • Description: Describes the content type in the input stream. Currently, only audio is supported.

2.1.2. source

  • Type: object
  • Description: Indicates how and from where the input audio stream is sourced.
2.1.2.1. type
  • Type: string
  • Description: Specifies input audio transport. webrtc and ws are supported. Must match target transport.
2.1.2.2. format
  • Type: string
  • Description: Required for websockets transport only. Input audio format, supported: opus, pcm_s16le and wav
2.1.2.3. sample_rate
  • Type: string
  • Description: Required for websockets transport only. Input audio sample rate. Allowed range is from 16khz to 48khz.
2.1.2.4. channels
  • Type: string
  • Description: Required for websockets transport only. Input channels. One or two channels is supported.

2.2. output_stream

  • Type: object
  • Description: Defines the output settings.

2.2.1. content_type

  • Type: string
  • Description: Describes the content type in the output. Currently, only audio is supported.

2.2.2. target

  • Type: object
  • Description: Indicates the destination of the output.
2.2.2.1. type
  • Type: string
  • Description: Specifies output audio transport. webrtc and ws are supported. Must match source transport.
2.1.2.2. format
  • Type: string
  • Description: Required for websockets transport only. Output audio format, supported: pcm_s16le, zlib_pcm_s16le (zlib-compressed pcm)

2.3. pipeline

  • Type: object
  • Description: Holds the configuration for all processing steps, including preprocessing, transcription, and translation.

2.3.2. transcription

  • Type: object
  • Description: Settings for Automatic Speech Recognition (ASR).
2.3.2.1. source_language
  • Type: string
  • Description: Language code representing the input audio language (e.g., "en", "es", "fr"). Can be set to "auto" for automatic language detection and allow to set detectable_languages.
2.3.2.2. detectable_languages
  • Type: array of strings
  • Description: Only languages from the list will be detected if source_language is set to auto.
2.3.2.3. segment_confirmation_silence_threshold
  • Type: float
  • Description: The time in seconds of silence needed to confirm the end of a segment. The recommended value is between 0.5s and 0.9s, depending on the average speech tempo and pauses. Increase this value if a speaker frequently pauses between words. If it is set too low, it can lead to unwanted sentence splitting.
2.3.2.4. sentence_splitter
  • Type: object
  • Description: Controls how longer sentences are split into smaller parts (sometimes with slight rephrasing, but without losing the meaning) to speed up processing.
2.3.2.4.1. enabled
  • Type: bool
  • Description: Whether to enable automatic sentence splitting.
2.3.2.5. verification
  • Type: object
  • Description: Controls transcription verification settings.
2.3.2.5.1. auto_transcription_correction
  • Type: bool
  • Description: WIP. Allows automatic transcription verification using LLM model.
2.3.2.5.2. transcription_correction_style
  • Type: string or null
  • Description: Style of LLM correction.

2.3.3. translations

  • Type: array of objects
  • Description: An array of translation targets. Each object defines translation settings for a specific target language.

Note: translations is an array of objects, each representing a required language. Below is an example of an object for a single language.

2.3.3.1.1. target_language
  • Type: string
  • Description: The language into which the text should be translated (e.g., "en-us", "es", "fr").
2.3.3.1.2. translate_partial_transcriptions
  • Type: bool
  • Description: Allows translating partial transcriptions.
2.3.3.1.3. speech_generation
  • Type: object
  • Description: Configures text-to-speech (TTS) settings.
2.3.3.1.3.1. voice_cloning
  • Type: bool
  • Description: Experimental. Enables voice cloning to mimic the original speaker's voice. It usually takes 10-20 seconds of speech before the voice changes are applied.
2.3.3.1.3.2. voice_id
  • Type: null or string
  • Description: A particular voice ID can be specified. Voice cloning must be disabled. If set to "default_low" or "default_high", the best default voice for the selected language will be used automatically. You can create or manage voices in the Palabra web portal.
2.3.3.1.3.3. voice_timbre_detection
  • Type: object
  • Description: Allows automatically detecting and assigning voice IDs to different voice timbres.
2.3.3.1.3.3.1. enabled
  • Type: bool
  • Description: Enables voice timbre detection. Voice cloning must be disabled.
2.3.3.1.3.3.2. high_timbre_voices
  • Type: array of strings
  • Description: Specifies which voice ID to use for high timbre voices. (Currently, only one ID is supported, or use "default_high".)
2.3.3.1.3.3.3. low_timbre_voices
  • Type: array of strings
  • Description: Specifies which voice ID to use for low timbre voices. (Currently, only one ID is supported, or use "default_low".)

2.4. translation_queue_configs

  • Type: object
  • Description: Configures the behavior of unspoken TTS buffers.

2.4.1. global

  • Type: object
  • Description: Global/default settings for TTS queue behavior. You can add language-specific overrides by using the language code as a key (for example, "es" for Spanish).
2.4.1.1. desired_queue_level_ms
  • Type: number
  • Description: Desired average TTS buffer size in milliseconds. If speech_tempo_auto is enabled, it will try to keep the buffer at this level. A recommended value is between 6000 and 8000 milliseconds (6-8 seconds).
2.4.1.2. max_queue_level_ms
  • Type: number
  • Description: The maximum TTS queue size in milliseconds. If the queue grows beyond this limit, it will be reduced to desired_queue_level_ms by dropping older queued audio. It should be at least two or three times larger than desired_queue_level_ms.
2.4.1.3. auto_tempo
  • Type: bool
  • Description: Auto correct speech tempo based on the queue state.

2.5. allowed_message_types

  • Type: array of strings

  • Description: Specifies the types of messages you will receive back via WebSocket. The same messages are also sent in the WebRTC data channel.

    • "partial_transcription" - Emitted for partial transcription segments as they are recognized.
    • "partial_translated_transcription" - Emitted for partial translated transcriptions if translate_partial_transcriptions is enabled.
    • "validated_transcription" - Emitted when a transcription segment is fully confirmed.
    • "translated_transcription" - Emitted when a transcription segment has been translated.