openclaw - ✅(Solved) Fix TTS auto-reply generates MP3 only — WhatsApp cannot play as voice note (needs OGG/Opus) [1 pull requests]

StepCodex · 2026-04-20T18:58:51Z

[openclaw] When TTS is enabled with messages.tts.auto: "always" , the gateway generates speech audio in MP3 format regardless of the target channel. This works… When TTS is enabled with `messages.tts.auto: "always"`, the gateway generates speech audio in **MP3 format** regardless of the target channel. This works fine on Telegram, but **WhatsApp** cannot play MP3 files as voice notes — it requires **OGG/Opus** format. As a result, WhatsApp users see "this audio is unavailable" when the auto-TTS reply arrives. # PR #69528: fix(microsoft-tts): emit ogg/opus for voice-note targets so WhatsApp auto-replies play as native voice notes - Repository: openclaw/openclaw - Author: neeravmakwana - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/69528 ## Description (problem / solution / changelog) ## Summary WhatsApp TTS auto-replies from the Microsoft (Edge) speech provider were arriving as plain MP3 attachments instead of native voice notes. This PR makes the Microsoft provider honor the `target: \"voice-note\"` hint that the TTS dispatcher already passes for voice-note-capable channels (WhatsApp, Telegram, Feishu, Matrix, Discord) and produce `ogg-48khz-16bit-mono-opus` when no explicit override is configured. Fixes #69435. ## Root cause `extensions/speech-core/src/tts.ts` already picks `target: \"voice-note\"` for WhatsApp and other native voice-note channels, and `extensions/whatsapp/src/send.ts` rewrites `audio/ogg` to `audio/ogg; codecs=opus` for PTT sends. Other providers (OpenAI, ElevenLabs) switch to Opus for that target. The Microsoft provider in `extensions/microsoft/speech-provider.ts` ignored `req.target` entirely and always used its MP3 default (`audio-24khz-48kbitrate-mono-mp3`). WhatsApp rejects MP3 as a voice note, so the audio was sent as a regular audio attachment even though the channel was ready to upgrade to PTT. ## Fix Add a narrow `resolveMicrosoftOutputFormat` helper that prefers: 1. a request-level `providerOverrides.outputFormat`, then 2. an explicit `messages.tts.providers.microsoft.outputFormat` from user config, then 3. `ogg-48khz-16bit-mono-opus` when `req.target === \"voice-note\"`, else 4. the existing MP3 default for `audio-file` targets. `inferEdgeExtension` maps the new format to `.ogg`, which `isVoiceCompatibleAudio` already treats as voice-compatible, so the dispatcher correctly emits `audioAsVoice: true` and WhatsApp sends a real voice note. The MP3 fallback on synthesis error is preserved. ## Why it is safe - No config-surface changes: `DEFAULT_EDGE_OUTPUT_FORMAT` is unchanged and `getResolvedSpeechProviderConfig(\"microsoft\")` still resolves to the MP3 default. The contract test (`src/plugins/contracts/tts.contract.test.ts` \u2192 \"resolveEdgeOutputFormat\") still passes unchanged. - Explicit operator overrides win: users who set `messages.tts.providers.microsoft.outputFormat` keep that exact value for all targets, including voice-note. - Non voice-note targets (`audio-file` for Slack, Mattermost, Webhooks, SMS, etc.) keep the existing MP3 default. - The new Opus format is already documented by Microsoft Speech output formats and is accepted by the bundled `node-edge-tts` transport (it simply passes `outputFormat` through to the service). - Behavior matches what the OpenAI and ElevenLabs speech providers already do for the same `target: \"voice-note\"` hint. ## Security / runtime controls unchanged - No changes to tool policy, sandbox, SSRF policy, gateway auth, plugin trust, or operator-trusted config paths. - No changes to prompt text, system prompt, or model-driven behavior. The new branch is a deterministic format selection based on the channel-derived `target` value computed by the TTS dispatcher before any model output, not on model-controlled text. - No new capabilities, no new network destinations, no new credentials. ## Tests Added focused unit cases in `extensions/microsoft/speech-provider.test.ts`: - voice-note target with no configured format \u2192 Edge is called with `ogg-48khz-16bit-mono-opus`, result reports `.ogg` extension and `voiceCompatible: true`. - voice-note target with an explicitly configured MP3 format \u2192 operator override wins, Edge is called with the configured MP3 format. ### Exact tests run - `pnpm test extensions/microsoft` \u2192 2 files / 15 tests pass - `pnpm test extensions/microsoft/speech-provider.test.ts extensions/speech-core/src/tts.test.ts` \u2192 passes - `pnpm test src/plugins/contracts/tts.contract.test.ts` \u2192 41 tests pass - `pnpm check:changed --staged` on the touched files passes conflict-marker and lint for my files; the only failing lanes are the pre-existing `extensions/qa-lab/src/providers/aimock/server.ts` TypeScript and lint errors on `main` (missing `@copilotkit/aimock` types and derived `no-redundant-type-constituents` lints). They reproduce on clean `main` and are unrelated to this change. - `pnpm exec oxlint extensions/microsoft/speech-provider.ts extensions/microsoft/speech-provid

openclaw2026-04-20 18:58:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When TTS is enabled with messages.tts.auto: "always", the gateway generates speech audio in MP3 format regardless of the target channel. This works fine on Telegram, but WhatsApp cannot play MP3 files as voice notes — it requires OGG/Opus format. As a result, WhatsApp users see "this audio is unavailable" when the auto-TTS reply arrives.

Root Cause

Fix Action

Workaround

Manually generating TTS, converting with ffmpeg, and sending:

openclaw infer tts convert --text "Hello" --voice es-ES-ElviraNeural --output /tmp/audio.mp3
ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg
openclaw message send --channel whatsapp --target "+1234567890" --media /tmp/audio.ogg

This works perfectly — WhatsApp plays the OGG file as a native voice note. But it requires manual intervention and cannot be used with tts.auto: "always".

PR fix notes

PR #69528: fix(microsoft-tts): emit ogg/opus for voice-note targets so WhatsApp auto-replies play as native voice notes

Repository: openclaw/openclaw
Author: neeravmakwana
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/69528

Description (problem / solution / changelog)

Summary

WhatsApp TTS auto-replies from the Microsoft (Edge) speech provider were arriving as plain MP3 attachments instead of native voice notes. This PR makes the Microsoft provider honor the target: \"voice-note\" hint that the TTS dispatcher already passes for voice-note-capable channels (WhatsApp, Telegram, Feishu, Matrix, Discord) and produce ogg-48khz-16bit-mono-opus when no explicit override is configured.

Fixes #69435.

Root cause

extensions/speech-core/src/tts.ts already picks target: \"voice-note\" for WhatsApp and other native voice-note channels, and extensions/whatsapp/src/send.ts rewrites audio/ogg to audio/ogg; codecs=opus for PTT sends. Other providers (OpenAI, ElevenLabs) switch to Opus for that target. The Microsoft provider in extensions/microsoft/speech-provider.ts ignored req.target entirely and always used its MP3 default (audio-24khz-48kbitrate-mono-mp3). WhatsApp rejects MP3 as a voice note, so the audio was sent as a regular audio attachment even though the channel was ready to upgrade to PTT.

Fix

Add a narrow resolveMicrosoftOutputFormat helper that prefers:

a request-level providerOverrides.outputFormat, then
an explicit messages.tts.providers.microsoft.outputFormat from user config, then
ogg-48khz-16bit-mono-opus when req.target === \"voice-note\", else
the existing MP3 default for audio-file targets.

inferEdgeExtension maps the new format to .ogg, which isVoiceCompatibleAudio already treats as voice-compatible, so the dispatcher correctly emits audioAsVoice: true and WhatsApp sends a real voice note.

The MP3 fallback on synthesis error is preserved.

Why it is safe

No config-surface changes: DEFAULT_EDGE_OUTPUT_FORMAT is unchanged and getResolvedSpeechProviderConfig(\"microsoft\") still resolves to the MP3 default. The contract test (src/plugins/contracts/tts.contract.test.ts \u2192 "resolveEdgeOutputFormat") still passes unchanged.
Explicit operator overrides win: users who set messages.tts.providers.microsoft.outputFormat keep that exact value for all targets, including voice-note.
Non voice-note targets (audio-file for Slack, Mattermost, Webhooks, SMS, etc.) keep the existing MP3 default.
The new Opus format is already documented by Microsoft Speech output formats and is accepted by the bundled node-edge-tts transport (it simply passes outputFormat through to the service).
Behavior matches what the OpenAI and ElevenLabs speech providers already do for the same target: \"voice-note\" hint.

Security / runtime controls unchanged

No changes to tool policy, sandbox, SSRF policy, gateway auth, plugin trust, or operator-trusted config paths.
No changes to prompt text, system prompt, or model-driven behavior. The new branch is a deterministic format selection based on the channel-derived target value computed by the TTS dispatcher before any model output, not on model-controlled text.
No new capabilities, no new network destinations, no new credentials.

Tests

Added focused unit cases in extensions/microsoft/speech-provider.test.ts:

voice-note target with no configured format \u2192 Edge is called with ogg-48khz-16bit-mono-opus, result reports .ogg extension and voiceCompatible: true.
voice-note target with an explicitly configured MP3 format \u2192 operator override wins, Edge is called with the configured MP3 format.

Exact tests run

pnpm test extensions/microsoft \u2192 2 files / 15 tests pass
pnpm test extensions/microsoft/speech-provider.test.ts extensions/speech-core/src/tts.test.ts \u2192 passes
pnpm test src/plugins/contracts/tts.contract.test.ts \u2192 41 tests pass
pnpm check:changed --staged on the touched files passes conflict-marker and lint for my files; the only failing lanes are the pre-existing extensions/qa-lab/src/providers/aimock/server.ts TypeScript and lint errors on main (missing @copilotkit/aimock types and derived no-redundant-type-constituents lints). They reproduce on clean main and are unrelated to this change.
pnpm exec oxlint extensions/microsoft/speech-provider.ts extensions/microsoft/speech-provider.test.ts \u2192 0 warnings / 0 errors
pnpm exec oxfmt on the touched files applied

Testing scope

AI-assisted, lightly tested locally (unit + contract suites above; no live Edge network call in this sweep).

Docs

docs/tools/tts.md: updated the Microsoft output-format notes and the "Output formats (fixed)" section to describe the new channel-aware default.
CHANGELOG.md: added an entry under ## Unreleased \u2192 ### Fixes.

Made with Cursor

Changed files

CHANGELOG.md (modified, +1/-0)
docs/tools/tts.md (modified, +5/-5)
extensions/microsoft/speech-provider.test.ts (modified, +67/-0)
extensions/microsoft/speech-provider.ts (modified, +26/-3)

Code Example

"mediaUrl":"/Users/friday/.openclaw/media/outbound/<id>.mp3"
"mediaKind":"audio"

---

$ openclaw infer tts convert --text "test" --voice es-ES-ElviraNeural --output /tmp/test.ogg
$ file /tmp/test.ogg
/tmp/test.ogg: MPEG ADTS, layer III, v2, 48 kbps, 24 kHz, Monaural

---

openclaw infer tts convert --text "Hello" --voice es-ES-ElviraNeural --output /tmp/audio.mp3
ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg
openclaw message send --channel whatsapp --target "+1234567890" --media /tmp/audio.ogg

---

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      providers: {
        microsoft: {
          enabled: true,
          voice: "es-ES-ElviraNeural",
          lang: "es-ES",
          rate: "+15%",
          // outputFormat: "ogg-48khz-16bit-mono-opus" // accepted but ignored
        }
      }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.4.19-beta.2
OS: macOS (Darwin 25.3.0, arm64)
Node: v22.22.0
TTS Provider: Microsoft Azure (es-ES-ElviraNeural)
Channels: Telegram (working) + WhatsApp via Baileys (broken)
Config: messages.tts.auto: "always", messages.tts.provider: "microsoft"

Current Behavior

User sends a message via WhatsApp.
Gateway generates TTS audio and replies with an MP3 file.
WhatsApp client shows "audio unavailable" or fails to play the voice note.

The outbound log confirms the media is always MP3:

"mediaUrl":"/Users/friday/.openclaw/media/outbound/<id>.mp3"
"mediaKind":"audio"

Even when requesting OGG output via the CLI (--output file.ogg), the file is still encoded as MP3:

$ openclaw infer tts convert --text "test" --voice es-ES-ElviraNeural --output /tmp/test.ogg
$ file /tmp/test.ogg
/tmp/test.ogg: MPEG ADTS, layer III, v2, 48 kbps, 24 kHz, Monaural

Setting messages.tts.providers.microsoft.outputFormat to ogg-48khz-16bit-mono-opus (a valid Azure TTS format) is accepted by config validation but has no effect — output remains MP3.

Expected Behavior

The gateway should detect the target channel and convert TTS audio to the appropriate format:

Telegram: MP3 is fine (already works).
WhatsApp: OGG/Opus (audio/ogg; codecs=opus) is required for voice note playback.

Workaround

Manually generating TTS, converting with ffmpeg, and sending:

openclaw infer tts convert --text "Hello" --voice es-ES-ElviraNeural --output /tmp/audio.mp3
ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg
openclaw message send --channel whatsapp --target "+1234567890" --media /tmp/audio.ogg

This works perfectly — WhatsApp plays the OGG file as a native voice note. But it requires manual intervention and cannot be used with tts.auto: "always".

Suggested Fix

Options, in order of preference:

Channel-aware format conversion: When TTS auto-reply targets WhatsApp, automatically convert the audio to OGG/Opus using ffmpeg (or a native Opus encoder) before sending.
Configurable output format per channel: Add a messages.tts.outputFormat or per-channel messages.tts.formatByChannel setting so users can specify OGG for WhatsApp.
Native OGG output from provider: Pass the outputFormat through to the Azure TTS API, which natively supports ogg-48khz-16bit-mono-opus. This would avoid the ffmpeg conversion step entirely.

Related Config

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      providers: {
        microsoft: {
          enabled: true,
          voice: "es-ES-ElviraNeural",
          lang: "es-ES",
          rate: "+15%",
          // outputFormat: "ogg-48khz-16bit-mono-opus" // accepted but ignored
        }
      }
    }
  }
}

Thank you for the great work on OpenClaw! 🦞

extent analysis

TL;DR

The most likely fix is to implement channel-aware format conversion, automatically converting TTS audio to OGG/Opus for WhatsApp targets.

Guidance

Verify that the Azure TTS API supports output format specification and pass the desired format (ogg-48khz-16bit-mono-opus) through to the API.
If the API does not support format specification, consider using a native Opus encoder or ffmpeg for conversion, as demonstrated in the provided workaround.
To implement channel-aware format conversion, modify the gateway to detect the target channel and apply the necessary conversion before sending the audio.
Consider adding a configurable output format per channel, allowing users to specify the desired format for each channel.

Example

The provided workaround demonstrates the conversion process using ffmpeg:

ffmpeg -y -i /tmp/audio.mp3 -c:a libopus -b:a 48k -ar 48000 /tmp/audio.ogg

This command converts an MP3 file to OGG/Opus format, which can be used as a reference for implementing the conversion in the gateway.

Notes

The current implementation ignores the outputFormat setting, so alternative solutions must be explored. The Azure TTS API's support for output format specification is unclear and should be verified.

Recommendation

Apply a workaround by implementing channel-aware format conversion using ffmpeg or a native Opus encoder, as this approach is already demonstrated to work.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix TTS auto-reply generates MP3 only — WhatsApp cannot play as voice note (needs OGG/Opus) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #69528: fix(microsoft-tts): emit ogg/opus for voice-note targets so WhatsApp auto-replies play as native voice notes

Description (problem / solution / changelog)

Summary

Root cause

Fix

Why it is safe

Security / runtime controls unchanged

Tests

Exact tests run

Testing scope

Docs

Changed files

Code Example

Summary

Environment

Current Behavior

Expected Behavior

Workaround

Suggested Fix

Related Config

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING