OpenClaw · TTS + sound effects · Per-character billing

Text to speech
inside every OpenClaw agent, pay per character.

Register one tool in OpenClaw and your agents can convert text to natural-sounding speech or generate sound effects from a description — no account setup, no subscription. TTS bills per character so audio cost stays proportional to what the agent says. Sound effects bill at a flat rate per call.

  • Single OpenClaw tool — one POST
  • TTS — billed per character
  • Sound effects — flat rate per call
  • Per-run budget caps honored
AI agent
OpenClaw
Audio
Audio API
TTS billing Per character OpenClaw agents pay only for the characters they convert. Short spoken confirmations cost a fraction of full explanations — no fixed audio credits to exhaust.
Sound effects billing Flat rate per call One predictable charge per sound effect call, regardless of output length. OpenClaw budget caps can account for each audio event at a fixed cost.
Time to wire it in ~5 min Register one tool in OpenClaw, declare the per-character TTS rate and flat SFX rate, ship. No audio vendor account to create.
What OpenClaw builders ship

OpenClaw agents that speak and sound.

Four patterns where audio generation fits naturally into an OpenClaw agent — each relying on a different combination of TTS and sound effects.

OpenClaw voice response agent

Speak the agent's output to the user.

An OpenClaw conversational agent generates a text reply and voices it before returning — the TTS tool call sits at the end of the chain, converting final output to speech. Per-character billing keeps audio cost proportional to response length: a short acknowledgement costs almost nothing; a longer explanation costs more, and the user hears why.

... agent generates reply → POST /audio/tts { text: reply, voice: 'nova' } → return audio to user
OpenClaw content production agent

Script, voice, and add effects in one run.

An OpenClaw production agent writes a script, converts each segment to speech via TTS, and generates sound effects for transitions — all in a single run. Budget caps bound total TTS character spend across the script; flat-rate SFX means each transition has a known, fixed cost before the agent runs.

generate script → POST /audio/tts { text: segment } × N → POST /audio/sfx { description: 'transition sting' } → assemble
OpenClaw alert agent

Generate spoken alerts when conditions change.

An OpenClaw monitoring agent watches a condition and fires a spoken alert when it triggers — 'Latency spiked above threshold', 'New order received'. Short text, low character count, predictable cost per alert. The agent decides what to say based on the event; TTS handles the conversion.

condition triggers → POST /audio/tts { text: 'Latency spike on eu-west-1.', voice: 'alloy' } → send audio
OpenClaw interactive narrative agent

Voice characters and generate scene audio together.

An OpenClaw interactive fiction agent generates NPC dialogue and voices it through TTS, then adds scene-appropriate sound effects — ambient environment, interaction cues — through the same tool. Both audio types use the same tool registration; budget caps apply to the combined audio spend for the scene.

generate dialogue → POST /audio/tts { text: dialogue } → POST /audio/sfx { description: 'tavern ambient noise, low' } → play both
OpenClaw-ready in two minutes

One tool. Speech and sound inside every agent.

Two operations through one tool. For speech: the agent submits text and picks a voice — billing runs per character so audio cost stays proportional to what the agent says. For sound effects: describe the sound in plain text — 'door closing firmly', 'rain on glass', 'error chime' — and get a generated audio clip back at a flat rate. OpenClaw budget caps apply across both.

  • Single OpenClaw tool
  • TTS — per character
  • Sound effects — flat rate
  • Budget caps honored
FAQ

OpenClaw-specific questions.

If something below doesn't cover your case, ping us — we work directly with OpenClaw builders, no SDR funnel.

How does this register as an OpenClaw tool?

+

It's a POST endpoint that accepts either a TTS or sound effects request body. Register it in OpenClaw as an HTTP tool with two declared costs: a per-character rate for TTS and a flat rate for sound effects. OpenClaw uses both to enforce budget caps and to show the user what each audio call will cost before the agent runs.

How do OpenClaw budget caps interact with per-character TTS billing?

+

The agent can estimate character count from the text it plans to submit before the tool call fires. OpenClaw uses the per-character rate to calculate expected cost and checks it against the remaining run budget. If the call would exceed the cap, OpenClaw stops the agent before the call is made.

Can an OpenClaw agent use both TTS and sound effects in the same run?

+

Yes. Make separate tool calls for each — one for TTS, one for the sound effect. Both go through the same tool registration. Budget caps apply to the combined cost of all audio calls in the run.

What happens if the TTS text is very long?

+

The call returns a single audio clip for the full text. For very long scripts, the agent may want to split the text into segments and make multiple TTS calls — this also allows voice or pacing changes between sections.

Does the tool return audio inline or as a URL?

+

Audio comes back as an MP3 payload in the response body. The agent can pass it downstream, save it to a file, or return it directly to the user.

Do OpenClaw agents need an audio vendor account?

+

No. The tool handles all vendor relationships. The agent pays per call from a wallet you connect — no separate audio account to create or manage.