Skip to main content

Voice & TTS

Pawz supports text-to-speech so agents can speak their responses aloud.

Setup

Go to Settings → Voice to configure TTS.

Providers

Google Cloud TTS

No API key needed — uses the free web endpoint. Chirp 3 HD voices: Puck, Charon, Kore, Fenrir, Leda, Orus, Zephyr, Aoede, Callirhoe, Autonoe Neural2 voices: en-US-Neural2-A through F Journey voices: en-US-Journey-D, en-US-Journey-F, en-US-Journey-O

OpenAI TTS

Requires an OpenAI API key. Voices: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer

ElevenLabs

Requires an ELEVENLABS_API_KEY. Voices: Sarah, Charlie, George, Callum, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill Models:
ModelBest for
eleven_multilingual_v2Multi-language, highest quality
eleven_turbo_v2_5Low latency, English-focused
eleven_monolingual_v1English only, legacy
Extra settings:
  • Stability (0–1, default 0.5) — higher = more consistent
  • Similarity boost (0–1, default 0.75) — higher = closer to reference voice

Settings

SettingDefaultDescription
ProviderGoogle / OpenAI / ElevenLabs
VoiceVoice name from the selected provider
Speed1.0Playback speed multiplier
LanguageLanguage code (13 supported)
Auto-speakOffAutomatically speak every response

Speech-to-text (STT)

Pawz uses OpenAI Whisper for speech-to-text transcription:
BackendSetupLatencyCost
Whisper APIOpenAI API key (from Models settings)~1–2s$0.006/min
Whisper LocalInstall whisper binary~3–5sFree
STT is used in Talk Mode and any voice input feature. Audio is captured as WebM/Opus (or OGG fallback), base64-encoded, and sent to the Whisper endpoint for transcription.

Audio capture settings

The microphone input uses these Web Audio constraints:
SettingValue
Echo cancellationEnabled
Noise suppressionEnabled
Sample rate16 kHz
Formataudio/webm;codecs=opus (preferred)

Voice activity detection (VAD)

Talk Mode includes built-in voice activity detection to avoid sending silence to the transcription API:
ParameterValueDescription
Recording window8 secondsRecords in 8-second chunks
Minimum audio size8 KBChunks under 8 KB are treated as silence and skipped
Inter-cycle delay500 msBrief pause between recording cycles after errors
Empty transcriptSkippedIf Whisper returns blank text, the cycle restarts
:::info VAD works by checking the size of each recording chunk. Very short or silent recordings produce small files that are automatically discarded before being sent to Whisper. :::

Talk mode

Click the microphone icon in the chat header to enter talk mode. Your speech is transcribed and sent to the agent, and the response is spoken back. Requires either:
  • Whisper API skill (OpenAI API key)
  • Whisper Local skill (install whisper binary)

How talk mode works

  1. Listen — microphone captures audio in 8-second windows
  2. Transcribe — audio is sent to Whisper STT → text
  3. Process — transcribed text is sent to your agent as a chat message
  4. Speak — agent’s response is synthesized via your configured TTS provider
  5. Repeat — next recording cycle starts after playback finishes

Voice command mode vs dictation mode

ModeBehaviorUse case
Voice command (default)Each utterance is sent as a standalone chat messageGiving instructions, asking questions
DictationUtterances are accumulated into a text bufferComposing long-form content, emails
In voice command mode, every 8-second recording window is independently transcribed and sent to the agent. The agent responds and the reply is spoken aloud before the next cycle begins. To use dictation mode, start your utterance with “dictate” or “type” — the agent will accumulate your speech into a document rather than responding conversationally.