Text to speech.
Sound effects. Pay for what you generate.
Convert text to natural-sounding speech or generate sound effects from a description — through one API with no account setup. TTS bills by character count so cost scales with content length. Sound effects bill at a flat rate per call regardless of output length.
Audio generation for agents that need to speak or sound.
TTS and sound effects serve different needs — these four patterns cover the most common reasons to reach for each.
Speak LLM responses aloud.
A conversational agent generates a text reply and immediately converts it to speech before returning it to the user. Per-character TTS billing keeps short responses cheap and long ones proportional — a 20-word confirmation costs far less than a full explanation. No pre-recorded clips to maintain, no voice actor to commission.
Voice scripts and add effects in one flow.
An automated production pipeline takes a written script, converts each segment to speech via TTS, and stitches in generated sound effects — intro stings, scene transitions, ambient sound — through the same API. Character-count billing keeps voiceover cost proportional to script length across high-volume runs.
Generate dialogue and ambient audio from game state.
A game agent generates NPC dialogue dynamically via TTS — no pre-recorded line for every possible conversation branch. Sound effects for environmental cues — footsteps on gravel, distant thunder, door mechanisms — come from text descriptions of the game state. Flat-rate SFX billing makes per-event audio cost predictable regardless of clip length.
Speak alerts and status updates.
An agent monitoring a process generates spoken alerts when conditions change — 'Build failed on main', 'Payment received from Acme'. Short text, low character count, low cost per notification. TTS handles the conversion; the agent decides when to speak based on event severity, not on a fixed script.
Audio generation without the studio setup.
Two operations through one endpoint. For speech: submit text and pick a voice — billing runs per character so cost scales with what you actually say. For sound effects: describe what you need in plain text — 'thunder crack fading to rain', '8-bit coin pickup', 'crowd noise in a small stadium' — and get a generated audio clip back at a flat rate per call.
- TTS — per character
- Sound effects — flat rate
- No account setup
- One API, two operations
The honest answers.
If something below doesn't cover your case, ping us — we answer directly, no SDR funnel.
What's the difference between TTS and sound effects?
+
TTS converts written text to spoken audio — you supply the words, choose a voice, and get back a speech clip. Sound effects generate audio from a description of a sound — you describe what you want to hear, not a script. Same endpoint, different operation type, different billing model.
How does character-count billing work for TTS?
+
You're billed for the number of characters in the text you submit. A 50-character sentence costs half what a 100-character sentence costs. Whitespace and punctuation count. There's no minimum per call, so short alerts are proportionally cheap.
What does the flat rate cover for sound effects?
+
One flat rate per call, regardless of how long the generated audio turns out to be. A two-second clip and a thirty-second clip cost the same. This makes per-event audio budgeting straightforward — each game event, each UI cue, each notification sound is one predictable charge.
What voices are available for TTS?
+
Multiple voices with different characteristics — tone, gender, pace. Pass a voice identifier in the request body. The available voices are listed in the API reference.
What audio format does the API return?
+
MP3. Both TTS and sound effect responses return an audio/mp3 payload you can stream, save, or pipe directly into an audio playback pipeline.
Can I generate TTS and sound effects in the same request?
+
No — each call handles one operation. Submit a TTS request for speech and a separate sound effects request for generated audio. Combining them is done on your side, either by the agent or by a downstream audio pipeline.