openclaw - ✅(Solved) Fix [Bug]: Audio understanding: OpenAI provider returns no transcript and no surfaced error on valid OGG/Opus [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#65076Fetched 2026-04-12 13:25:43
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×2cross-referenced ×1

With tools.media.audio.enabled: true and an OpenAI provider entry whose API key verifiably works against /v1/audio/transcriptions, both the Telegram voice-note path and the openclaw infer audio transcribe CLI silently fail to produce any transcript for valid OGG/Opus files, with no error or warning surfaced in the default gateway log.

Error Message

With tools.media.audio.enabled: true and an OpenAI provider entry whose API key verifiably works against /v1/audio/transcriptions, both the Telegram voice-note path and the openclaw infer audio transcribe CLI silently fail to produce any transcript for valid OGG/Opus files, with no error or warning surfaced in the default gateway log. Error: No transcript returned for audio: /home/maoa/.openclaw/media/inbound/file_0.ogg No [Audio] block replaces Body, {{Transcript}} is not populated, and no error or warn line appears in journalctl --user -u openclaw-gateway.service or in the gateway log file under /tmp/openclaw/openclaw-*.log. No error or warn line contains audio, transcri, ogg, whisper, or media between gateway start and the failed CLI invocation. Verified with: which returns no matches relevant to the failure path, only the earlier unrelated [warn] memory: sqlite-vec unavailable line from the memory-core subsystem, which is not related to tools.media.audio. Only code-path observation (not speculation): dispatch-acp-rBcmOCzP.js:858 in the bundled dist/ wraps applyMediaUnderstanding in try/catch where the error branch only calls logVerbose(...); without --verbose the swallowed error is invisible in the default gateway log. This is an observation about the default diagnostic path, not a claim about the root cause. NOT_ENOUGH_INFO on the exact error being swallowed, because capturing --verbose on a live systemd --user gateway would require restarting the unit with different arguments and interrupting the active Telegram channel; happy to capture it if the maintainer can point me at a scoped verbose flag for tools.media.audio alone.

Root Cause

Only code-path observation (not speculation): dispatch-acp-rBcmOCzP.js:858 in the bundled dist/ wraps applyMediaUnderstanding in try/catch where the error branch only calls logVerbose(...); without --verbose the swallowed error is invisible in the default gateway log. This is an observation about the default diagnostic path, not a claim about the root cause. NOT_ENOUGH_INFO on the exact error being swallowed, because capturing --verbose on a live systemd --user gateway would require restarting the unit with different arguments and interrupting the active Telegram channel; happy to capture it if the maintainer can point me at a scoped verbose flag for tools.media.audio alone.

Fix Action

Fix / Workaround

Only code-path observation (not speculation): dispatch-acp-rBcmOCzP.js:858 in the bundled dist/ wraps applyMediaUnderstanding in try/catch where the error branch only calls logVerbose(...); without --verbose the swallowed error is invisible in the default gateway log. This is an observation about the default diagnostic path, not a claim about the root cause. NOT_ENOUGH_INFO on the exact error being swallowed, because capturing --verbose on a live systemd --user gateway would require restarting the unit with different arguments and interrupting the active Telegram channel; happy to capture it if the maintainer can point me at a scoped verbose flag for tools.media.audio alone.


- This bug is isolated to the `tools.media.audio` transcription pipeline. All other OpenClaw subsystems exercised in this deployment are functional:
    - PDF/XML/ZIP invoice ingestion via a workspace skill
    - `memory-lancedb` with OpenAI embeddings (`text-embedding-3-small`)
    - `plugins.entries.brave.config.webSearch` (Brave Search API)
    - `browser` plugin with Google Chrome 147.0.7727.55 headless (JavaScript-heavy sites scraped successfully)
    - Telegram channel polling and `dmPolicy: pairing` enforcement
- NOT_ENOUGH_INFO on first known bad version or last known good version: this is the first OpenClaw install in this environment and audio was never observed working. This is therefore **not** asserted as a regression.
- Workaround: `tools.media.audio.enabled: false` and text-only user input. Functional, non-disruptive to the rest of the stack.

PR fix notes

PR #65096: fix(media): surface OpenAI audio transcription failures

Description (problem / solution / changelog)

What changed

  • keep all-provider audio transcription failures as a top-level failed media-understanding decision instead of downgrading them to skipped
  • propagate the underlying failure from file-based audio transcription so the CLI reports the real provider error
  • surface failed media reasons in status output and cover the regression at the runner, runtime, status, and CLI layers

Why

The audio runner already recorded provider exceptions as failed attempts, but if no output was produced it still returned a top-level skipped decision. That erased the actual OpenAI failure reason, which is why a valid OGG/Opus transcription failure could degrade into the generic No transcript returned for audio message with no normal-severity media failure signal.

Validation

  • pnpm vitest run src/cli/capability-cli.test.ts src/media-understanding/runner.skip-tiny-audio.test.ts src/media-understanding/runtime.test.ts src/auto-reply/status.test.ts
  • pnpm lint src/cli/capability-cli.test.ts src/media-understanding/types.ts src/media-understanding/runner.entries.ts src/media-understanding/runner.ts src/media-understanding/runtime.ts src/media-understanding/runner.skip-tiny-audio.test.ts src/media-understanding/runtime.test.ts src/auto-reply/status.ts src/auto-reply/status.test.ts
  • local commit hook suite (pnpm check, full lint, policy checks)

Fixes #65076

Changed files

  • CHANGELOG.md (modified, +2/-0)
  • src/auto-reply/status.test.ts (modified, +37/-0)
  • src/auto-reply/status.ts (modified, +8/-4)
  • src/cli/capability-cli.test.ts (modified, +16/-0)
  • src/media-understanding/runner.entries.ts (modified, +43/-9)
  • src/media-understanding/runner.skip-tiny-audio.test.ts (modified, +26/-0)
  • src/media-understanding/runner.ts (modified, +16/-2)
  • src/media-understanding/runtime.test.ts (modified, +39/-0)
  • src/media-understanding/runtime.ts (modified, +13/-0)
  • src/media-understanding/types.ts (modified, +1/-0)

Code Example

**Gateway restart log (redacted, the relevant lines only):**

    Apr 11 11:09:29 [gateway] loading configuration…
    Apr 11 11:09:29 [gateway] resolving authentication…
    Apr 11 11:09:31 [gateway] agent model: openai-codex/gpt-5.4
    Apr 11 11:09:31 [gateway] ready (6 plugins, 2.2s)
    Apr 11 11:09:31 [gateway] starting channels and sidecars...
    Apr 11 11:09:31 [telegram] [default] starting provider (@<redacted>_bot)

No `error` or `warn` line contains `audio`, `transcri`, `ogg`, `whisper`, or `media` between gateway start and the failed CLI invocation. Verified with:

    grep -iE 'audio|transcri|ogg|whisper|media' /tmp/openclaw/openclaw-2026-04-11.log

which returns no matches relevant to the failure path, only the earlier unrelated `[warn] memory: sqlite-vec unavailable` line from the `memory-core` subsystem, which is not related to `tools.media.audio`.

**Session JSONL excerpt showing the unreplaced body on the ACP path (redacted):**

    {"type":"message","message":{"role":"user","content":[{"type":"text","text":"[media attached: /home/maoa/.openclaw/media/inbound/file_0---<uuid>.ogg (audio/ogg; codecs=opus) | .../file_0---<uuid>.ogg]\n...\n<media:audio>"}]}}

No downstream message in the same session contains `[Audio]` or a `Transcript` field derived from this attachment.

**Only code-path observation (not speculation):** `dispatch-acp-rBcmOCzP.js:858` in the bundled `dist/` wraps `applyMediaUnderstanding` in `try/catch` where the error branch only calls `logVerbose(...)`; without `--verbose` the swallowed error is invisible in the default gateway log. This is an observation about the default diagnostic path, not a claim about the root cause. NOT_ENOUGH_INFO on the exact error being swallowed, because capturing `--verbose` on a live `systemd --user` gateway would require restarting the unit with different arguments and interrupting the active Telegram channel; happy to capture it if the maintainer can point me at a scoped verbose flag for `tools.media.audio` alone.
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

With tools.media.audio.enabled: true and an OpenAI provider entry whose API key verifiably works against /v1/audio/transcriptions, both the Telegram voice-note path and the openclaw infer audio transcribe CLI silently fail to produce any transcript for valid OGG/Opus files, with no error or warning surfaced in the default gateway log.

Steps to reproduce

  1. Set the following in ~/.openclaw/openclaw.json:
    • tools.media.audio.enabled: true
    • tools.media.audio.models[0]: { "provider": "openai", "model": "gpt-4o-mini-transcribe" }
    • models.providers.openai: { "baseUrl": "https://api.openai.com/v1", "models": [], "apiKey": "sk-proj-...", "auth": "api-key" }
  2. openclaw config validateConfig valid: ~/.openclaw/openclaw.json.
  3. systemctl --user restart openclaw-gateway.service; gateway log reports ready (6 plugins, ~2s).
  4. Run: openclaw infer audio transcribe --file /home/maoa/.openclaw/media/inbound/file_0.ogg --model openai/gpt-4o-mini-transcribe --language es (Source file is a valid Telegram voice note, audio/ogg; codecs=opus, 120 KB, Spanish speech, previously downloaded by the gateway Telegram channel into media/inbound/.)
  5. Observe the CLI output below in Actual behavior.

Expected behavior

  1. Set the following in ~/.openclaw/openclaw.json:
    • tools.media.audio.enabled: true
    • tools.media.audio.models[0]: { "provider": "openai", "model": "gpt-4o-mini-transcribe" }
    • models.providers.openai: { "baseUrl": "https://api.openai.com/v1", "models": [], "apiKey": "sk-proj-...", "auth": "api-key" }
  2. openclaw config validateConfig valid: ~/.openclaw/openclaw.json.
  3. systemctl --user restart openclaw-gateway.service; gateway log reports ready (6 plugins, ~2s).
  4. Run: openclaw infer audio transcribe --file /home/maoa/.openclaw/media/inbound/file_0.ogg --model openai/gpt-4o-mini-transcribe --language es (Source file is a valid Telegram voice note, audio/ogg; codecs=opus, 120 KB, Spanish speech, previously downloaded by the gateway Telegram channel into media/inbound/.)
  5. Observe the CLI output below in Actual behavior.

Actual behavior

CLI path. The command

openclaw infer audio transcribe --file /home/maoa/.openclaw/media/inbound/file_0.ogg --model openai/gpt-4o-mini-transcribe --language es

returns (redacted real path):

Error: No transcript returned for audio: /home/maoa/.openclaw/media/inbound/file_0.ogg

The string No transcript returned for audio: originates in the bundled capability-cli-C8QLmK_t.js:656, which throws when the resolved transcript text is empty or missing.

ACP / Telegram path. On Telegram voice-note delivery, the gateway downloads the file to ~/.openclaw/media/inbound/file_X---<uuid>.ogg, then delivers the user message to the agent session with the raw template intact:

[media attached: /home/maoa/.openclaw/media/inbound/file_0---<uuid>.ogg (audio/ogg; codecs=opus) | .../file_0---<uuid>.ogg]
...
<media:audio>

No [Audio] block replaces Body, {{Transcript}} is not populated, and no error or warn line appears in journalctl --user -u openclaw-gateway.service or in the gateway log file under /tmp/openclaw/openclaw-*.log.

Counter-evidence that the file and key are valid. The exact same file and the exact same API key return HTTP 200 with the correct transcript when posted directly to OpenAI, outside OpenClaw:

$ curl -s -o /tmp/oai.json -w "%{http_code}\n" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -F "file=@/home/maoa/.openclaw/media/inbound/file_0.ogg" \
    -F "model=gpt-4o-mini-transcribe" \
    https://api.openai.com/v1/audio/transcriptions
200
$ jq -r .text /tmp/oai.json
Es decir que me entiendes si escribo un mensaje así de voz? Confírmame si me entiendes.

The same request with type=audio/ogg, type=audio/ogg; codecs=opus, and type=application/octet-stream all return HTTP 200 with the same transcript; MIME type is not the differentiator.

OpenClaw version

2026.4.9 (0512059)

Operating system

Ubuntu 24.04.4 LTS under WSL2 (kernel 6.6.87.2-microsoft-standard-WSL2)

Install method

npm global — installed under ~/.npm-global/lib/node_modules/openclaw ; gateway runs as systemd --user service (openclaw-gateway.service) on loopback bind

Model

openai/gpt-4o-mini-transcribe (audio transcription; model under test)

Provider / routing chain

openclaw -> api.openai.com/v1/audio/transcriptions (direct, no proxy, no gateway-in-front)

Additional provider/model setup details

  • Agent loop model (not under test for this bug, included for completeness): openai-codex/gpt-5.4 via ChatGPT OAuth, configured under agents.defaults.model.primary. This is a different auth path than the transcription call and is functioning correctly for the text-only agent loop.
  • Transcription auth path used for this bug: models.providers.openai.apiKey = direct OpenAI API key (sk-proj-...), independent from the Codex OAuth profile. The key has Audio + Embeddings scopes in the OpenAI platform dashboard.
  • The same API key is also duplicated under skills.entries.openai-whisper-api.apiKey (where openclaw skills check openai-whisper-api reports ✓ Ready, Environment: ✓ OPENAI_API_KEY) and under plugins.entries.memory-lancedb.config.embedding.apiKey (where the memory-lancedb plugin loads and successfully produces text-embedding-3-small embeddings — confirming the same key works for other OpenAI endpoints from within OpenClaw). Only the tools.media.audio transcription path fails.
  • openclaw infer audio providers reports for openai: {"available": true, "configured": true, "selected": false, "id": "openai", "capabilities": ["image", "audio"]}. NOT_ENOUGH_INFO on whether selected: false is the idle state or the symptom; the maintainer can answer this directly.
  • Telegram channel config: dmPolicy: pairing, groupPolicy: allowlist, single-user setup with 2 paired senders. Voice notes arrive via the standard bot polling provider.

Logs, screenshots, and evidence

**Gateway restart log (redacted, the relevant lines only):**

    Apr 11 11:09:29 [gateway] loading configuration…
    Apr 11 11:09:29 [gateway] resolving authentication…
    Apr 11 11:09:31 [gateway] agent model: openai-codex/gpt-5.4
    Apr 11 11:09:31 [gateway] ready (6 plugins, 2.2s)
    Apr 11 11:09:31 [gateway] starting channels and sidecars...
    Apr 11 11:09:31 [telegram] [default] starting provider (@<redacted>_bot)

No `error` or `warn` line contains `audio`, `transcri`, `ogg`, `whisper`, or `media` between gateway start and the failed CLI invocation. Verified with:

    grep -iE 'audio|transcri|ogg|whisper|media' /tmp/openclaw/openclaw-2026-04-11.log

which returns no matches relevant to the failure path, only the earlier unrelated `[warn] memory: sqlite-vec unavailable` line from the `memory-core` subsystem, which is not related to `tools.media.audio`.

**Session JSONL excerpt showing the unreplaced body on the ACP path (redacted):**

    {"type":"message","message":{"role":"user","content":[{"type":"text","text":"[media attached: /home/maoa/.openclaw/media/inbound/file_0---<uuid>.ogg (audio/ogg; codecs=opus) | .../file_0---<uuid>.ogg]\n...\n<media:audio>"}]}}

No downstream message in the same session contains `[Audio]` or a `Transcript` field derived from this attachment.

**Only code-path observation (not speculation):** `dispatch-acp-rBcmOCzP.js:858` in the bundled `dist/` wraps `applyMediaUnderstanding` in `try/catch` where the error branch only calls `logVerbose(...)`; without `--verbose` the swallowed error is invisible in the default gateway log. This is an observation about the default diagnostic path, not a claim about the root cause. NOT_ENOUGH_INFO on the exact error being swallowed, because capturing `--verbose` on a live `systemd --user` gateway would require restarting the unit with different arguments and interrupting the active Telegram channel; happy to capture it if the maintainer can point me at a scoped verbose flag for `tools.media.audio` alone.

Impact and severity

  • Affected: single-user personal-assistant deployment, Telegram channel with voice-note ingestion; the CLI openclaw infer audio transcribe path is also affected.
  • Severity: blocks workflow for the audio understanding modality; text-only agent workflow is unaffected.
  • Frequency: always. 2/2 distinct voice notes (120 KB Spanish speech, 50 KB short silence) fail via ACP; 2/2 invocations of openclaw infer audio transcribe fail via CLI. 0/0 successes on the OpenClaw path; 2/2 successes against the same files through direct curl to /v1/audio/transcriptions.
  • Consequence: voice notes from users are silently dropped into the agent session as raw media markers without transcripts, so the agent cannot act on voice content; the text-only workflow continues to function.

Additional information

  • This bug is isolated to the tools.media.audio transcription pipeline. All other OpenClaw subsystems exercised in this deployment are functional:
    • PDF/XML/ZIP invoice ingestion via a workspace skill
    • memory-lancedb with OpenAI embeddings (text-embedding-3-small)
    • plugins.entries.brave.config.webSearch (Brave Search API)
    • browser plugin with Google Chrome 147.0.7727.55 headless (JavaScript-heavy sites scraped successfully)
    • Telegram channel polling and dmPolicy: pairing enforcement
  • NOT_ENOUGH_INFO on first known bad version or last known good version: this is the first OpenClaw install in this environment and audio was never observed working. This is therefore not asserted as a regression.
  • Workaround: tools.media.audio.enabled: false and text-only user input. Functional, non-disruptive to the rest of the stack.

Report prepared by Claude (Anthropic) via Claude Code, working as Maelo's technical operator inside his live WSL environment. All evidence in this issue is grounded in direct observation of logs, configs, and CLI output from that environment; nothing has been paraphrased from third parties or inferred from other setups. Where evidence is missing, the literal marker NOT_ENOUGH_INFO appears in place of speculation.

extent analysis

TL;DR

The most likely fix involves investigating and resolving the issue with the tools.media.audio transcription pipeline, potentially related to the selected: false status of the OpenAI provider or the error being swallowed in the dispatch-acp-rBcmOCzP.js code path.

Guidance

  • Verify the selected status of the OpenAI provider by checking the output of openclaw infer audio providers and investigate why it is set to false.
  • Check the gateway log with the --verbose flag to capture the error being swallowed in the dispatch-acp-rBcmOCzP.js code path, which may provide more information about the issue.
  • Test the transcription pipeline with a different audio file or provider to isolate the issue.
  • Review the OpenAI API key configuration and scopes to ensure they are correct and sufficient for the transcription task.

Example

No code snippet is provided as the issue is related to a specific configuration and setup, and modifying code without understanding the root cause may not be effective.

Notes

The issue seems to be isolated to the tools.media.audio transcription pipeline, and all other OpenClaw subsystems are functional. The NOT_ENOUGH_INFO markers indicate areas where more information is needed to fully understand the issue.

Recommendation

Apply a workaround by setting tools.media.audio.enabled: false and using text-only user input until the issue is resolved. This will allow the rest of the stack to function while the transcription pipeline is investigated and fixed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Set the following in ~/.openclaw/openclaw.json:
    • tools.media.audio.enabled: true
    • tools.media.audio.models[0]: { "provider": "openai", "model": "gpt-4o-mini-transcribe" }
    • models.providers.openai: { "baseUrl": "https://api.openai.com/v1", "models": [], "apiKey": "sk-proj-...", "auth": "api-key" }
  2. openclaw config validateConfig valid: ~/.openclaw/openclaw.json.
  3. systemctl --user restart openclaw-gateway.service; gateway log reports ready (6 plugins, ~2s).
  4. Run: openclaw infer audio transcribe --file /home/maoa/.openclaw/media/inbound/file_0.ogg --model openai/gpt-4o-mini-transcribe --language es (Source file is a valid Telegram voice note, audio/ogg; codecs=opus, 120 KB, Spanish speech, previously downloaded by the gateway Telegram channel into media/inbound/.)
  5. Observe the CLI output below in Actual behavior.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING