Text to Speech
Websockets
This API provides real-time text-to-speech conversion using WebSockets. It allows you to send text messages and receive audio data back in real-time.
Endpoint
The WebSocket endpoint for the Text-to-Speech API is:
Ideal Usecase for the Websockets Endpoint
- The input text is being streamed or in chunks.
- Word-to-audio alignment information is required.
Use our Streaming Endpoint wherever:
- The entire input text is available with you.
- Prototyping or testing the API, rapidly converting text to speech
Protocol
The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.
Streaming Input Text
The client can send messages with text input to the server. The messages should contain the following fields:
text
: The input text to be converted to speech.voice
: The voice to be used for the conversion. (e.g., “sarah”)similarity
: Similarity factor for voice settings.expressiveness
: Expressiveness factor for voice settings.pitch
: Pitch factor for voice settings.speed
: Speed factor for voice settings.
Streaming Output Audio
The server responds with a message containing the audio data:
audio
: Base64 encoded audio data.isFinal
: Indicates if the audio generation is complete.
Example Script
The following example demonstrates how to use the WebSocket API to convert text to speech in real-time using Python: