openclaw - ✅(Solved) Fix TTS auto-reply generates MP3 only — WhatsApp cannot play as voice note (needs OGG/Opus) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When TTS is enabled with messages.tts.auto: "always", the gateway generates speech audio in MP3 format regardless of the target channel. This works fine on Telegram, but WhatsApp cannot play MP3 files as voice notes — it requires OGG/Opus format. As a result, WhatsApp users see "this audio is unavailable" when the auto-TTS reply arrives.

Root Cause

When TTS is enabled with messages.tts.auto: "always", the gateway generates speech audio in MP3 format regardless of the target channel. This works fine on Telegram, but WhatsApp cannot play MP3 files as voice notes — it requires OGG/Opus format. As a result, WhatsApp users see "this audio is unavailable" when the auto-TTS reply arrives.

Fix Action

Workaround

Manually generating TTS, converting with ffmpeg, and sending:

openclaw infer tts convert --text "Hello" --voice es-ES-ElviraNeural --output /tmp/audio.mp3
ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg
openclaw message send --channel whatsapp --target "+1234567890" --media /tmp/audio.ogg

This works perfectly — WhatsApp plays the OGG file as a native voice note. But it requires manual intervention and cannot be used with tts.auto: "always".

PR fix notes

PR #69528: fix(microsoft-tts): emit ogg/opus for voice-note targets so WhatsApp auto-replies play as native voice notes

Description (problem / solution / changelog)

Summary

WhatsApp TTS auto-replies from the Microsoft (Edge) speech provider were arriving as plain MP3 attachments instead of native voice notes. This PR makes the Microsoft provider honor the target: \"voice-note\" hint that the TTS dispatcher already passes for voice-note-capable channels (WhatsApp, Telegram, Feishu, Matrix, Discord) and produce ogg-48khz-16bit-mono-opus when no explicit override is configured.

Fixes #69435.

Root cause

extensions/speech-core/src/tts.ts already picks target: \"voice-note\" for WhatsApp and other native voice-note channels, and extensions/whatsapp/src/send.ts rewrites audio/ogg to audio/ogg; codecs=opus for PTT sends. Other providers (OpenAI, ElevenLabs) switch to Opus for that target. The Microsoft provider in extensions/microsoft/speech-provider.ts ignored req.target entirely and always used its MP3 default (audio-24khz-48kbitrate-mono-mp3). WhatsApp rejects MP3 as a voice note, so the audio was sent as a regular audio attachment even though the channel was ready to upgrade to PTT.

Fix

Add a narrow resolveMicrosoftOutputFormat helper that prefers:

  1. a request-level providerOverrides.outputFormat, then
  2. an explicit messages.tts.providers.microsoft.outputFormat from user config, then
  3. ogg-48khz-16bit-mono-opus when req.target === \"voice-note\", else
  4. the existing MP3 default for audio-file targets.

inferEdgeExtension maps the new format to .ogg, which isVoiceCompatibleAudio already treats as voice-compatible, so the dispatcher correctly emits audioAsVoice: true and WhatsApp sends a real voice note.

The MP3 fallback on synthesis error is preserved.

Why it is safe

  • No config-surface changes: DEFAULT_EDGE_OUTPUT_FORMAT is unchanged and getResolvedSpeechProviderConfig(\"microsoft\") still resolves to the MP3 default. The contract test (src/plugins/contracts/tts.contract.test.ts \u2192 "resolveEdgeOutputFormat") still passes unchanged.
  • Explicit operator overrides win: users who set messages.tts.providers.microsoft.outputFormat keep that exact value for all targets, including voice-note.
  • Non voice-note targets (audio-file for Slack, Mattermost, Webhooks, SMS, etc.) keep the existing MP3 default.
  • The new Opus format is already documented by Microsoft Speech output formats and is accepted by the bundled node-edge-tts transport (it simply passes outputFormat through to the service).
  • Behavior matches what the OpenAI and ElevenLabs speech providers already do for the same target: \"voice-note\" hint.

Security / runtime controls unchanged

  • No changes to tool policy, sandbox, SSRF policy, gateway auth, plugin trust, or operator-trusted config paths.
  • No changes to prompt text, system prompt, or model-driven behavior. The new branch is a deterministic format selection based on the channel-derived target value computed by the TTS dispatcher before any model output, not on model-controlled text.
  • No new capabilities, no new network destinations, no new credentials.

Tests

Added focused unit cases in extensions/microsoft/speech-provider.test.ts:

  • voice-note target with no configured format \u2192 Edge is called with ogg-48khz-16bit-mono-opus, result reports .ogg extension and voiceCompatible: true.
  • voice-note target with an explicitly configured MP3 format \u2192 operator override wins, Edge is called with the configured MP3 format.

Exact tests run

  • pnpm test extensions/microsoft \u2192 2 files / 15 tests pass
  • pnpm test extensions/microsoft/speech-provider.test.ts extensions/speech-core/src/tts.test.ts \u2192 passes
  • pnpm test src/plugins/contracts/tts.contract.test.ts \u2192 41 tests pass
  • pnpm check:changed --staged on the touched files passes conflict-marker and lint for my files; the only failing lanes are the pre-existing extensions/qa-lab/src/providers/aimock/server.ts TypeScript and lint errors on main (missing @copilotkit/aimock types and derived no-redundant-type-constituents lints). They reproduce on clean main and are unrelated to this change.
  • pnpm exec oxlint extensions/microsoft/speech-provider.ts extensions/microsoft/speech-provider.test.ts \u2192 0 warnings / 0 errors
  • pnpm exec oxfmt on the touched files applied

Testing scope

  • AI-assisted, lightly tested locally (unit + contract suites above; no live Edge network call in this sweep).

Docs

  • docs/tools/tts.md: updated the Microsoft output-format notes and the "Output formats (fixed)" section to describe the new channel-aware default.
  • CHANGELOG.md: added an entry under ## Unreleased \u2192 ### Fixes.

Made with Cursor

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/tools/tts.md (modified, +5/-5)
  • extensions/microsoft/speech-provider.test.ts (modified, +67/-0)
  • extensions/microsoft/speech-provider.ts (modified, +26/-3)

Code Example

"mediaUrl":"/Users/friday/.openclaw/media/outbound/<id>.mp3"
"mediaKind":"audio"

---

$ openclaw infer tts convert --text "test" --voice es-ES-ElviraNeural --output /tmp/test.ogg
$ file /tmp/test.ogg
/tmp/test.ogg: MPEG ADTS, layer III, v2, 48 kbps, 24 kHz, Monaural

---

openclaw infer tts convert --text "Hello" --voice es-ES-ElviraNeural --output /tmp/audio.mp3
ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg
openclaw message send --channel whatsapp --target "+1234567890" --media /tmp/audio.ogg

---

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      providers: {
        microsoft: {
          enabled: true,
          voice: "es-ES-ElviraNeural",
          lang: "es-ES",
          rate: "+15%",
          // outputFormat: "ogg-48khz-16bit-mono-opus" // accepted but ignored
        }
      }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Summary

When TTS is enabled with messages.tts.auto: "always", the gateway generates speech audio in MP3 format regardless of the target channel. This works fine on Telegram, but WhatsApp cannot play MP3 files as voice notes — it requires OGG/Opus format. As a result, WhatsApp users see "this audio is unavailable" when the auto-TTS reply arrives.

Environment

  • OpenClaw: 2026.4.19-beta.2
  • OS: macOS (Darwin 25.3.0, arm64)
  • Node: v22.22.0
  • TTS Provider: Microsoft Azure (es-ES-ElviraNeural)
  • Channels: Telegram (working) + WhatsApp via Baileys (broken)
  • Config: messages.tts.auto: "always", messages.tts.provider: "microsoft"

Current Behavior

  1. User sends a message via WhatsApp.
  2. Gateway generates TTS audio and replies with an MP3 file.
  3. WhatsApp client shows "audio unavailable" or fails to play the voice note.

The outbound log confirms the media is always MP3:

"mediaUrl":"/Users/friday/.openclaw/media/outbound/<id>.mp3"
"mediaKind":"audio"

Even when requesting OGG output via the CLI (--output file.ogg), the file is still encoded as MP3:

$ openclaw infer tts convert --text "test" --voice es-ES-ElviraNeural --output /tmp/test.ogg
$ file /tmp/test.ogg
/tmp/test.ogg: MPEG ADTS, layer III, v2, 48 kbps, 24 kHz, Monaural

Setting messages.tts.providers.microsoft.outputFormat to ogg-48khz-16bit-mono-opus (a valid Azure TTS format) is accepted by config validation but has no effect — output remains MP3.

Expected Behavior

The gateway should detect the target channel and convert TTS audio to the appropriate format:

  • Telegram: MP3 is fine (already works).
  • WhatsApp: OGG/Opus (audio/ogg; codecs=opus) is required for voice note playback.

Workaround

Manually generating TTS, converting with ffmpeg, and sending:

openclaw infer tts convert --text "Hello" --voice es-ES-ElviraNeural --output /tmp/audio.mp3
ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg
openclaw message send --channel whatsapp --target "+1234567890" --media /tmp/audio.ogg

This works perfectly — WhatsApp plays the OGG file as a native voice note. But it requires manual intervention and cannot be used with tts.auto: "always".

Suggested Fix

Options, in order of preference:

  1. Channel-aware format conversion: When TTS auto-reply targets WhatsApp, automatically convert the audio to OGG/Opus using ffmpeg (or a native Opus encoder) before sending.
  2. Configurable output format per channel: Add a messages.tts.outputFormat or per-channel messages.tts.formatByChannel setting so users can specify OGG for WhatsApp.
  3. Native OGG output from provider: Pass the outputFormat through to the Azure TTS API, which natively supports ogg-48khz-16bit-mono-opus. This would avoid the ffmpeg conversion step entirely.

Related Config

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      providers: {
        microsoft: {
          enabled: true,
          voice: "es-ES-ElviraNeural",
          lang: "es-ES",
          rate: "+15%",
          // outputFormat: "ogg-48khz-16bit-mono-opus" // accepted but ignored
        }
      }
    }
  }
}

Thank you for the great work on OpenClaw! 🦞

extent analysis

TL;DR

The most likely fix is to implement channel-aware format conversion, automatically converting TTS audio to OGG/Opus for WhatsApp targets.

Guidance

  • Verify that the Azure TTS API supports output format specification and pass the desired format (ogg-48khz-16bit-mono-opus) through to the API.
  • If the API does not support format specification, consider using a native Opus encoder or ffmpeg for conversion, as demonstrated in the provided workaround.
  • To implement channel-aware format conversion, modify the gateway to detect the target channel and apply the necessary conversion before sending the audio.
  • Consider adding a configurable output format per channel, allowing users to specify the desired format for each channel.

Example

The provided workaround demonstrates the conversion process using ffmpeg:

ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg

This command converts an MP3 file to OGG/Opus format, which can be used as a reference for implementing the conversion in the gateway.

Notes

The current implementation ignores the outputFormat setting, so alternative solutions must be explored. The Azure TTS API's support for output format specification is unclear and should be verified.

Recommendation

Apply a workaround by implementing channel-aware format conversion using ffmpeg or a native Opus encoder, as this approach is already demonstrated to work.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING