openclaw - ✅(Solved) Fix [Bug] Telegram voice messages: media understanding audio transcription pipeline never triggered [3 pull requests, 1 participants]

stanleeyY · 2026-03-26T08:47:05Z

[openclaw] Telegram voice messages are received and downloaded successfully, but the applyMediaUnderstanding audio transcription pipeline is never invoked. The… Telegram voice messages are received and downloaded successfully, but the `applyMediaUnderstanding` audio transcription pipeline is never invoked. The agent receives ` ` with the raw `.ogg` file attached but no automatic transcription occurs. **This affects all channels** (confirmed on both Telegram forum topics and WhatsApp groups), not just Telegram. # PR #55323: fix: add audio capability to openai-codex media understanding provide - Repository: openclaw/openclaw - Author: pxnt - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/55323 ## Description (problem / solution / changelog) The openai-codex provider (OAuth) was missing the audio capability and transcribeAudio handler, so Pro plan users could not use audio transcription. Add both to match the regular openai provider, reusing the same Whisper API function. Closes #55237 Related #55052 ## Summary - Problem: `openai-codex` media understanding provider only registered `["image"]` capability, missing `"audio"` and `transcribeAudio` handler - Why it matters: OpenAI Pro plan (OAuth) users cannot transcribe audio — forced to use API key provider or skip transcription entirely - What changed: Added `"audio"` to capabilities and `transcribeAudio: transcribeOpenAiAudio` to the codex provider; updated test assertion - What did NOT change: No new functions — reuses the existing `transcribeOpenAiAudio` that calls `/v1/audio/transcriptions` ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Integrations ## Linked Issue/PR - Closes #55237 - [x] This PR fixes a bug or regression ## Root Cause / Regression History (if applicable) - Root cause: `openaiCodexMediaUnderstandingProvider` was defined with `capabilities: ["image"]` only and no `transcribeAudio` handler - Missing detection / guardrail: No test asserted audio capability for codex provider - Prior context: The codex media provider was likely added with image-only initially and audio was never wired up - Why this regressed now: Not a regression — audio was never supported for `openai-codex` ## Regression Test Plan (if applicable) - [x] Existing coverage already sufficient - Target test or file: `extensions/openai/index.test.ts` - Scenario the test should lock in: Codex media provider registers `["image", "audio"]` capabilities - Existing test that already covers this: Updated existing assertion at line 196 ## User-visible / Behavior Changes - `openai-codex` provider now supports audio transcription via Whisper API ## Diagram (if applicable) N/A ## Security Impact (required) - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No — same Whisper endpoint, just now reachable via codex provider - Command/tool execution surface changed? No - Data access scope changed? No ## Repro + Verification ### Environment - OS: macOS 15.x (arm64) - Runtime/container: Node 24 - Model/provider: openai-codex ### Steps 1. Configure `openai-codex` as provider with audio enabled 2. Send audio attachment through any channel ### Expected - Audio is transcribed via Whisper API ### Actual (before fix) - Audio transcription skipped — codex provider had no audio capability ## Evidence - [x] Failing test/log before + passing after ## Human Verification (required) - Verified scenarios: Unit tests pass, format check passes - Edge cases checked: Provider normalization preserves `openai-codex` ID correctly - What you did **not** verify: Live audio transcription with real OAuth credentials ## Review Conversations - [x] I replied to or resolved every bot review conversation I addressed in this PR. - [x] I left unresolved only the conversations that still need reviewer or maintainer judgment. ## Compatibility / Migration - Backward compatible? Yes - Config/env changes? No - Migration needed? No ## Risks and Mitigations None ## Changed files - `.agents/skills/openclaw-parallels-smoke/SKILL.md` (modified, +6/-0) - `.github/workflows/ci-bun.yml` (modified, +3/-1) - `.github/workflows/docker-release.yml` (modified, +0/-2) - `AGENTS.md` (modified, +8/-4) - `CHANGELOG.md` (modified, +39/-0) - `apps/android/app/build.gradle.kts` (modified, +2/-2) - `apps/ios/Config/Version.xcconfig` (modified, +3/-3) - `apps/ios/README.md` (modified, +3/-3) - `apps/macos/Sources/OpenClaw/Resources/Info.plist` (modified, +2/-2) - `apps/macos/Sources/OpenClaw/TalkModeRuntime.swift` (modified, +30/-10) - `apps/macos/Tests/OpenClawIPCTests/TalkModeRuntimeSpeechTests.swift` (modified, +9/-0) - `apps/shared/OpenClawKit/Sources/OpenClawKit/TalkSystemSpeechSynthesizer.swift` (modified, +34/-3) - `apps/shared/OpenClawKit/Tests/OpenClawKitTests/TalkSystemSpeechSynthesizerTests.swift` (added, +44/-0) - `docs/.generated/config-baseline.json` (modified, +470/-1471) - `docs/.generated/config-baseline.jso

openclaw2026-03-26 08:47:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#55052•Fetched 2026-04-08 01:33:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

stanleeyY

Participants

stanleeyY

Timeline (top)

cross-referenced ×4

Telegram voice messages are received and downloaded successfully, but the applyMediaUnderstanding audio transcription pipeline is never invoked. The agent receives <media:audio> with the raw .ogg file attached but no automatic transcription occurs.

This affects all channels (confirmed on both Telegram forum topics and WhatsApp groups), not just Telegram.

Root Cause

This affects all channels (confirmed on both Telegram forum topics and WhatsApp groups), not just Telegram.

Fix Action

Workaround

Agent manually calls OpenAI transcription API via curl for each voice message.

PR fix notes

PR #55323: fix: add audio capability to openai-codex media understanding provide

Repository: openclaw/openclaw
Author: pxnt
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/55323

Description (problem / solution / changelog)

The openai-codex provider (OAuth) was missing the audio capability and
transcribeAudio handler, so Pro plan users could not use audio transcription. Add both to match the regular openai provider, reusing the same Whisper API
function.

Closes #55237 Related #55052

Summary

Problem: openai-codex media understanding provider only registered ["image"]
capability, missing "audio" and transcribeAudio handler
Why it matters: OpenAI Pro plan (OAuth) users cannot transcribe audio — forced to use
API key provider or skip transcription entirely
What changed: Added "audio" to capabilities and transcribeAudio: transcribeOpenAiAudio to the codex provider; updated test assertion
What did NOT change: No new functions — reuses the existing transcribeOpenAiAudio that calls /v1/audio/transcriptions

Change Type (select all)

Bug fix

Scope (select all touched areas)

Integrations

Linked Issue/PR

Closes #55237
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

Root cause: openaiCodexMediaUnderstandingProvider was defined with capabilities: ["image"] only and no transcribeAudio handler
Missing detection / guardrail: No test asserted audio capability for codex provider
Prior context: The codex media provider was likely added with image-only initially and
audio was never wired up
Why this regressed now: Not a regression — audio was never supported for openai-codex

Regression Test Plan (if applicable)

Existing coverage already sufficient
Target test or file: extensions/openai/index.test.ts
Scenario the test should lock in: Codex media provider registers ["image", "audio"]
capabilities
Existing test that already covers this: Updated existing assertion at line 196

User-visible / Behavior Changes

openai-codex provider now supports audio transcription via Whisper API

Diagram (if applicable)

N/A

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No — same Whisper endpoint, just now reachable via codex provider
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS 15.x (arm64)
Runtime/container: Node 24
Model/provider: openai-codex

Steps

Configure openai-codex as provider with audio enabled
Send audio attachment through any channel

Expected

Audio is transcribed via Whisper API

Actual (before fix)

Audio transcription skipped — codex provider had no audio capability

Evidence

Failing test/log before + passing after

Human Verification (required)

Verified scenarios: Unit tests pass, format check passes
Edge cases checked: Provider normalization preserves openai-codex ID correctly
What you did not verify: Live audio transcription with real OAuth credentials

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

None

Changed files

.agents/skills/openclaw-parallels-smoke/SKILL.md (modified, +6/-0)
.github/workflows/ci-bun.yml (modified, +3/-1)
.github/workflows/docker-release.yml (modified, +0/-2)
AGENTS.md (modified, +8/-4)
CHANGELOG.md (modified, +39/-0)
apps/android/app/build.gradle.kts (modified, +2/-2)
apps/ios/Config/Version.xcconfig (modified, +3/-3)
apps/ios/README.md (modified, +3/-3)
apps/macos/Sources/OpenClaw/Resources/Info.plist (modified, +2/-2)
apps/macos/Sources/OpenClaw/TalkModeRuntime.swift (modified, +30/-10)
apps/macos/Tests/OpenClawIPCTests/TalkModeRuntimeSpeechTests.swift (modified, +9/-0)
apps/shared/OpenClawKit/Sources/OpenClawKit/TalkSystemSpeechSynthesizer.swift (modified, +34/-3)
apps/shared/OpenClawKit/Tests/OpenClawKitTests/TalkSystemSpeechSynthesizerTests.swift (added, +44/-0)
docs/.generated/config-baseline.json (modified, +470/-1471)
docs/.generated/config-baseline.jsonl (modified, +71/-184)
docs/.generated/plugin-sdk-api-baseline.json (modified, +330/-141)
docs/.generated/plugin-sdk-api-baseline.jsonl (modified, +136/-115)
docs/channels/bluebubbles.md (modified, +2/-1)
docs/channels/googlechat.md (modified, +1/-0)
docs/channels/msteams.md (modified, +2/-1)
docs/cli/gateway.md (modified, +2/-1)
docs/cli/index.md (modified, +9/-1)
docs/cli/models.md (modified, +9/-0)
docs/concepts/memory.md (modified, +7/-3)
docs/concepts/oauth.md (modified, +32/-3)
docs/gateway/authentication.md (modified, +20/-0)
docs/gateway/cli-backends.md (modified, +32/-7)
docs/help/faq.md (modified, +8/-3)
docs/help/testing.md (modified, +34/-9)
docs/install/development-channels.md (modified, +3/-3)
docs/plugins/architecture.md (modified, +9/-8)
docs/plugins/building-plugins.md (modified, +14/-13)
docs/plugins/manifest.md (modified, +39/-2)
docs/plugins/sdk-overview.md (modified, +31/-0)
docs/providers/anthropic.md (modified, +119/-3)
docs/reference/memory-config.md (modified, +14/-7)
docs/reference/secretref-credential-surface.md (modified, +3/-6)
docs/reference/secretref-user-supplied-credentials-matrix.json (modified, +6/-27)
docs/reference/test.md (modified, +1/-1)
docs/reference/wizard.md (modified, +1/-1)
docs/start/wizard-cli-reference.md (modified, +4/-1)
docs/tools/acp-agents.md (modified, +29/-5)
docs/tools/apply-patch.md (modified, +3/-2)
docs/tools/browser.md (modified, +38/-0)
docs/tools/exec.md (modified, +6/-4)
docs/tools/plugin.md (modified, +1/-0)
docs/tools/tts.md (modified, +56/-52)
docs/tts.md (modified, +56/-52)
extensions/acpx/openclaw.plugin.json (modified, +7/-3)
extensions/acpx/package.json (modified, +1/-1)
extensions/acpx/skills/acp-router/SKILL.md (modified, +38/-13)
extensions/acpx/src/config.test.ts (modified, +8/-0)
extensions/acpx/src/config.ts (modified, +64/-208)
extensions/acpx/src/runtime-internals/events.ts (modified, +18/-26)
extensions/acpx/src/runtime-internals/mcp-agent-command.test.ts (modified, +42/-0)
extensions/acpx/src/runtime-internals/mcp-agent-command.ts (modified, +16/-6)
extensions/acpx/src/runtime.test.ts (modified, +25/-0)
extensions/acpx/src/runtime.ts (modified, +3/-0)
extensions/amazon-bedrock/package.json (modified, +1/-1)
extensions/anthropic/cli-backend.ts (added, +59/-0)
extensions/anthropic/cli-migration.test.ts (added, +82/-0)
extensions/anthropic/cli-migration.ts (added, +131/-0)
extensions/anthropic/cli-shared.ts (added, +84/-0)
extensions/anthropic/index.ts (modified, +85/-2)
extensions/anthropic/openclaw.plugin.json (modified, +16/-2)
extensions/anthropic/package.json (modified, +1/-1)
extensions/bluebubbles/channel-config-api.ts (added, +1/-0)
extensions/bluebubbles/package.json (modified, +3/-6)
extensions/bluebubbles/src/actions.test.ts (modified, +33/-1)
extensions/bluebubbles/src/actions.ts (modified, +12/-6)
extensions/bluebubbles/src/channel-shared.ts (modified, +2/-3)
extensions/bluebubbles/src/config-schema.ts (modified, +7/-1)
extensions/bluebubbles/src/config-ui-hints.ts (added, +12/-0)
extensions/bluebubbles/src/monitor-processing.ts (modified, +13/-0)
extensions/bluebubbles/src/monitor.test.ts (modified, +47/-0)
extensions/bluebubbles/src/send.test.ts (modified, +4/-23)
extensions/bluebubbles/src/setup-core.ts (modified, +15/-12)
extensions/bluebubbles/src/test-harness.ts (modified, +21/-13)
extensions/brave/openclaw.plugin.json (modified, +3/-0)
extensions/brave/package.json (modified, +1/-1)
extensions/brave/web-search-provider.ts (added, +1/-0)
extensions/browser/index.test.ts (added, +90/-0)
extensions/browser/index.ts (added, +28/-0)
extensions/browser/openclaw.plugin.json (added, +9/-0)
extensions/browser/package.json (added, +12/-0)
extensions/browser/runtime-api.ts (added, +10/-0)
extensions/browser/src/browser-runtime.ts (added, +87/-0)
extensions/browser/src/browser-tool.actions.ts (renamed, +14/-8)
extensions/browser/src/browser-tool.schema.ts (renamed, +1/-1)
extensions/browser/src/browser-tool.test.ts (renamed, +15/-11)
extensions/browser/src/browser-tool.ts (renamed, +27/-27)
extensions/browser/src/browser/bridge-auth-registry.ts (renamed, +0/-0)
extensions/browser/src/browser/bridge-server.auth.test.ts (renamed, +0/-0)
extensions/browser/src/browser/bridge-server.ts (renamed, +0/-0)
extensions/browser/src/browser/browser-utils.test.ts (renamed, +0/-0)
extensions/browser/src/browser/cdp-proxy-bypass.test.ts (renamed, +0/-0)
extensions/browser/src/browser/cdp-proxy-bypass.ts (renamed, +0/-0)
extensions/browser/src/browser/cdp-timeouts.test.ts (renamed, +0/-0)
extensions/browser/src/browser/cdp-timeouts.ts (renamed, +0/-0)
extensions/browser/src/browser/cdp.helpers.ts (renamed, +0/-0)

PR #55788: Fix/OpenAI codex audio media understanding

Repository: openclaw/openclaw
Author: pxnt
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/55788

Description (problem / solution / changelog)

Closes #55237 Related #55052

Summary

Problem: openai-codex media understanding provider only registered ["image"]
capability, missing "audio" and transcribeAudio handler
Why it matters: OpenAI Pro plan (OAuth) users cannot transcribe audio — forced to use
API key provider or skip transcription entirely
What changed: Added "audio" to capabilities and transcribeAudio: transcribeOpenAiAudio to the codex provider; updated test assertion
What did NOT change: No new functions — reuses the existing transcribeOpenAiAudio that calls /v1/audio/transcriptions

Change Type (select all)

Bug fix

Scope (select all touched areas)

Integrations

Linked Issue/PR

Closes #55237
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

Root cause: openaiCodexMediaUnderstandingProvider was defined with capabilities: ["image"] only and no transcribeAudio handler
Missing detection / guardrail: No test asserted audio capability for codex provider
Prior context: The codex media provider was likely added with image-only initially and
audio was never wired up
Why this regressed now: Not a regression — audio was never supported for openai-codex

Regression Test Plan (if applicable)

Existing coverage already sufficient
Target test or file: extensions/openai/index.test.ts
Scenario the test should lock in: Codex media provider registers ["image", "audio"]
capabilities
Existing test that already covers this: Updated existing assertion at line 196

User-visible / Behavior Changes

openai-codex provider now supports audio transcription via Whisper API

Diagram (if applicable)

N/A

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No — same Whisper endpoint, just now reachable via codex provider
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS 15.x (arm64)
Runtime/container: Node 24
Model/provider: openai-codex

Steps

Configure openai-codex as provider with audio enabled
Send audio attachment through any channel

Expected

Audio is transcribed via Whisper API

Actual (before fix)

Audio transcription skipped — codex provider had no audio capability

Evidence

Failing test/log before + passing after

Human Verification (required)

Verified scenarios: Unit tests pass, format check passes
Edge cases checked: Provider normalization preserves openai-codex ID correctly
What you did not verify: Live audio transcription with real OAuth credentials

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

None

Changed files

extensions/openai/index.test.ts (modified, +2/-1)
extensions/openai/media-understanding-provider.ts (modified, +9/-1)
src/media-understanding/defaults.ts (modified, +1/-0)

PR #61143: fix(signal): resolve contentType for voice notes when signal-cli omits it

Repository: openclaw/openclaw
Author: mindfury
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/61143

Description (problem / solution / changelog)

Summary

Problem: Signal voice notes are saved to disk but the transcription pipeline never runs. signal-cli on Linux omits contentType on voice note attachments, leaving saveMediaBuffer unable to classify the audio (fileTypeFromBuffer fails on ADTS AAC, no filePath fallback). The file is saved without extension, MediaTypes falls back to application/octet-stream, isAudioAttachment() returns false, and selectAttachments exits with outcome: no-attachment — silently.
Why it matters: Completely blocks voice memo transcription on Signal for Linux deployments. tools.media.audio is configured but never triggers.
What changed: In fetchAttachment, run detectMime({ buffer, filePath: filename }) before calling saveMediaBuffer so the extension-based MIME lookup resolves audio/aac from the attachment filename. Also forward attachment.filename as originalFilename so saved files preserve the original extension on disk.
What did NOT change: No core media pipeline changes. No other channel adapters touched. No config schema changes.

AI-assisted: This fix was developed with Claude Code (Opus 4.6). Fully tested — see evidence below.

Dependency: This fix is necessary but not sufficient for end-to-end Signal voice transcription. OpenAI rejects .aac file extensions (returns 400 "Unsupported file format aac"), which is addressed by #61094. Both PRs must land for transcription to work. We verified the full chain locally with both fixes applied — see evidence.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #48614
Related #61094 (.aac → .m4a remap — required for OpenAI provider acceptance)
Related #60421 (transcription errors silently swallowed at default log level)
Related #56010, #55052 (similar symptoms on Telegram)
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: fetchAttachment in extensions/signal/src/monitor.ts passes attachment.contentType ?? undefined to saveMediaBuffer. When signal-cli omits contentType (observed on Linux with signal-cli 0.14.1), saveMediaBuffer calls detectMime({ buffer, headerMime: undefined }) — no filePath, so the extension-based MIME fallback is impossible. fileTypeFromBuffer cannot detect ADTS-format AAC. Result: mime = undefined, file saved as bare UUID without extension, ctx.MediaTypes = ["application/octet-stream"], isAudioAttachment() returns false.
Missing detection / guardrail: No fallback to attachment.filename for MIME resolution. Matrix got this fix in v2026.3.28 (#55692 — forwarding originalFilename to saveMediaBuffer), but Signal was missed.
Contributing context: signal-cli on macOS consistently provides contentType: "audio/aac" on voice note attachments; Linux signal-cli 0.14.1 sometimes omits it. The SignalAttachment type already includes filename?: string but it was never used.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/signal/src/monitor/event-handler.inbound-context.test.ts
Scenario the test should lock in:
1. Event handler threads resolved audio/aac contentType into MsgContext when fetchAttachment returns it (integration wiring test)
2. detectMime({ buffer, filePath: "voice.aac" }) returns "audio/aac" when buffer sniffing fails (core mechanism test — proves bare filename is sufficient for getFileExtension/MIME_BY_EXT lookup)
Why this is the smallest reliable guardrail: fetchAttachment is private, so we test the two halves separately — the event handler's contentType threading (existing test pattern) and the detectMime filename-based resolution (new direct test).
Existing test that already covers this (if any): The existing "forwards all fetched attachments via MediaPaths/MediaTypes" test at line 255 covers the happy path but explicitly expects "application/octet-stream" for attachments without contentType — confirming the bug was baked into test expectations.

Evidence

Unit tests:

pnpm test:extension signal — 19 files, 169/169 passed
pnpm check — clean (lint, format, typecheck)
pnpm build — clean

Live end-to-end verification (with #61094 applied on test branch):

Built v2026.4.2 + this fix + #61094's .aac → .m4a remap
Started gateway from fork against production workspace (~/.openclaw/)
Sent Signal voice note from phone
OpenClaw transcribed: "Signal audio transcription test with Ben Z's fix. Time is 10:15 p.m."
echoTranscript delivered transcript back to Signal chat

curl verification of the .aac rejection (why #61094 is needed):

$ curl https://api.openai.com/v1/audio/transcriptions -F file="@voice.aac" -F model="gpt-4o-mini-transcribe"
→ 400: "Unsupported file format aac"

$ curl https://api.openai.com/v1/audio/transcriptions -F "[email protected];filename=voice.m4a" -F model="gpt-4o-mini-transcribe"
→ 200: {"text": "Signal audio transcription test..."}

Human Verification (required)

Verified scenarios: Full Signal voice note → transcription → echo reply chain on Linux (Ubuntu, signal-cli 0.14.1, OpenAI gpt-4o-mini-transcribe). Sent 3 voice notes across test runs.
Edge cases checked: Attachment with contentType: undefined + filename: "voice.aac" (primary fix path). Verified detectMime resolves correctly with bare filename (no full path).
What I did not verify: Voice notes where both contentType AND filename are missing (falls back to pre-fix behavior — application/octet-stream). Other channels (Telegram, WhatsApp, Discord).

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: detectMime called with bare filename instead of full path — getFileExtension uses path.extname which handles bare filenames correctly, verified by unit test.
- Mitigation: Direct unit test for detectMime({ buffer, filePath: "voice.aac" }) → "audio/aac".
Risk: When both contentType and filename are missing, behavior is unchanged (falls through to undefined). No regression, but also no improvement for that edge case.
- Mitigation: Documented in PR; would require adding voiceNote?: boolean to SignalAttachment type as a future enhancement.

Changed files

extensions/signal/src/monitor.ts (modified, +12/-2)
extensions/signal/src/monitor/event-handler.inbound-context.test.ts (modified, +46/-0)

Code Example

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          {
            "provider": "openai",
            "model": "gpt-4o-mini-transcribe",
            "language": "yue"
          }
        ]
      }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Summary

This affects all channels (confirmed on both Telegram forum topics and WhatsApp groups), not just Telegram.

Environment

OpenClaw: 2026.3.24 (also confirmed on 2026.3.23-2)
OS: macOS 15.3 (arm64), Mac mini
Install: pnpm (global)
Node: v24.14.0
Telegram: forum supergroup (topic 1 / General)

Config (correct per docs)

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          {
            "provider": "openai",
            "model": "gpt-4o-mini-transcribe",
            "language": "yue"
          }
        ]
      }
    }
  }
}

OpenAI API key is available to the gateway service (confirmed in launchd plist + openclaw models status)
OpenAI provider has audio capability registered (verified in dist bundle)
Manual curl to OpenAI transcription API with the same key + same .ogg file succeeds perfectly

Steps to Reproduce

Send a voice message to a Telegram forum topic (or WhatsApp group)
Voice file downloads successfully to ~/.openclaw/media/inbound/
Agent receives <media:audio> as body text
No transcription occurs — no {{Transcript}} set, no [Audio] block replacement

Expected Behavior

applyMediaUnderstandingIfNeeded() should detect MediaPath is set, invoke runCapability("audio"), and transcribe using the configured OpenAI model.

Actual Behavior

Gateway log shows zero audio/transcription/media-understanding entries around voice message receipt
Even with --verbose flag and OPENCLAW_DEBUG_TELEGRAM_INGRESS=1, no Telegram inbound voice processing log appears
WhatsApp shows inbound audio log ([whatsapp] Inbound message ... audio/ogg; codecs=opus) but also no transcription log
The applyMediaUnderstanding function is simply never called

Code Path Analysis

Traced through the minified dist bundle:

Telegram handler: resolveMediaFileRef(msg) correctly includes msg.voice → file downloads OK
Context building: buildTelegramInboundContextPayload sets MediaPath, MediaType, MediaPaths from allMedia
Dispatch: dispatchTelegramMessage → dispatchReplyWithBufferedBlockDispatcher → should reach getReplyFromConfig
Media understanding gate: getReplyFromConfig calls applyMediaUnderstandingIfNeeded() which checks hasInboundMedia(ctx) — this should return true since MediaPath is set
But: the audio transcription pipeline never executes. No log output at all, even in verbose mode.

Possibly Related

2026.3.22 changelog: "Agents/inbound: lazy-load media and link understanding for plain-text turns" — this optimization may incorrectly classify voice-only messages (no text body) as "plain-text turns" and skip media understanding
GitHub issue #7899 (similar report from 2026-02-03)
GitHub issue #14374 (feature request, 2026-02-12)
Discord reports from 2026-02-15 and 2026-03-05

Workaround

Agent manually calls OpenAI transcription API via curl for each voice message.

extent analysis

Fix Plan

To fix the issue with the applyMediaUnderstanding audio transcription pipeline not being invoked, we need to modify the applyMediaUnderstandingIfNeeded function to correctly handle voice messages without text bodies.

Here are the steps:

Update the hasInboundMedia function to check for MediaPath and MediaType in the context payload.
Modify the applyMediaUnderstandingIfNeeded function to call runCapability("audio") when hasInboundMedia returns true.
Add logging to verify that the audio transcription pipeline is being invoked.

Code Changes

// Update hasInboundMedia function
function hasInboundMedia(ctx) {
  return ctx.MediaPath && ctx.MediaType === 'audio';
}

// Modify applyMediaUnderstandingIfNeeded function
function applyMediaUnderstandingIfNeeded(ctx) {
  if (hasInboundMedia(ctx)) {
    console.log('Invoking audio transcription pipeline...');
    runCapability("audio");
  }
}

Verification

To verify that the fix worked, send a voice message to a Telegram forum topic or WhatsApp group and check the gateway log for audio/transcription/media-understanding entries. The log should now show that the audio transcription pipeline is being invoked.

Extra Tips

Make sure to update the openclaw version to the latest release to ensure that the fix is included.
If issues persist, try enabling verbose logging to get more detailed output.
Consider adding additional logging to the applyMediaUnderstandingIfNeeded function to verify that it is being called correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #optimization #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.