For Telegram voice notes: 1. OpenClaw transcribes audio upstream via normal inbound audio handling. 2. Because `tools.media.audio.echoTranscript=true`, the transcript is echoed into chat before agent processing. 3. The agent sees transcript/body replacement rather than only a raw `.ogg` attachment.

openclaw - ✅(Solved) Fix v2026.4.8: Telegram voice notes reach agent as raw .ogg, no transcript echo [1 pull requests, 3 comments, 3 participants]

verter58bot · 2026-04-08T20:13:09Z

[openclaw] On OpenClaw v2026.4.8 , Telegram voice notes are reaching the agent as raw .ogg attachments instead of being transcribed upstream, even though tools… On OpenClaw `v2026.4.8`, Telegram voice notes are reaching the agent as raw `.ogg` attachments instead of being transcribed upstream, even though `tools.media.audio.echoTranscript` is enabled and the gateway was restarted. # PR #63472: fix(media): use default STT model for auto audio transcription (#63349) - Repository: openclaw/openclaw - Author: neeravmakwana - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/63472 ## Description (problem / solution / changelog) ## Summary - Problem: Auto-selected inbound audio transcription reused the **active chat model id** (e.g. `gpt-5.4`) as the `/audio/transcriptions` model. That endpoint expects speech-to-text models, so transcription could fail and voice notes surfaced as raw attachments with no transcript echo (`tools.media.audio.echoTranscript`). - Why it matters: Telegram and other channels rely on successful STT before the agent turn; failed STT matches reports in #63349. - What changed: `resolveActiveModelEntry` no longer passes the chat model for the `audio` capability; `runProviderEntry` continues to resolve the provider default (e.g. OpenAI `gpt-4o-transcribe`) when no explicit `tools.media.audio.models` entry applies. - What did NOT change: Explicit `tools.media.audio.models`, image/video active-model behavior, and non-auto entry paths are unchanged. ## Change Type - [x] Bug fix ## Scope - [x] Gateway / orchestration ## Linked Issue/PR - Closes #63349 - [x] This PR fixes a bug or regression ## Root Cause - Root cause: Active provider selection for auto audio correctly picked the same provider as chat but incorrectly forwarded the **conversation** model id into the transcription API. - Missing detection / guardrail: Unit coverage now asserts auto audio does not pass a non-STT chat model when `activeModel` is set. ## Regression Test Plan - Coverage level: Unit test - Target test: `src/media-understanding/runner.auto-audio.test.ts` - Scenario: `activeModel: { provider: "openai", model: "gpt-5.4" }` → transcription request uses `gpt-4o-transcribe`. ## User-visible / Behavior Changes - Inbound voice/audio transcription when using auto model selection with the same provider as the active chat model now uses that provider’s default STT model instead of the chat model. ## Security Impact - New permissions/capabilities? **No** - New network endpoints or data exfiltration risk? **No** ## Changelog - Entry added under **Unreleased → Fixes** in `CHANGELOG.md`. Made with [Cursor](https://cursor.com) ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/media-understanding/runner.auto-audio.test.ts` (modified, +30/-0) - `src/media-understanding/runner.ts` (modified, +67/-1) ## Fixed - Fixed by PR: fix(media): use default STT model for auto audio transcription (#63349) (https://github.com/openclaw/openclaw/pull/63472) ### Summary On OpenClaw `v2026.4.8`, Telegram voice notes are reaching the agent as raw `.ogg` attachments instead of being transcribed upstream, even though `tools.media.audio.echoTranscript` is enabled and the gateway was restarted. ### Environment - OpenClaw: `2026.4.8` - Channel: Telegram direct chat - Host OS: Linux `6.8.0-107-generic` x64 - Node: `22.22.1` - Gateway: local loopback, systemd service running ### Relevant config ```json { "tools": { "media": { "audio": { "echoTranscript": true } } }, "plugins": { "entries": { "telegram": { "enabled": true }, "openai": { "enabled": true } } } } ``` Notes: - no `plugins.allow` - no explicit `tools.media.audio.models` - no `tools.media.audio.enabled: false` So this should be the normal auto-detect path described in the docs. ### Expected behavior For Telegram voice notes: 1. OpenClaw transcribes audio upstream via normal inbound audio handling. 2. Because `tools.media.audio.echoTranscript=true`, the transcript is echoed into chat before agent processing. 3. The agent sees transcript/body replacement rather than only a raw `.ogg` attachment. ### Actual behavior The agent receives the raw `.ogg` attachment directly. Observed symptoms from live tests: - no transcript echo appears in chat - no upstream transcript seems to be injected before the agent turn - after gateway restart, behavior is unchanged We tested multiple short Telegram voice notes in direct chat after restart, with the same result. ### Repro steps 1. Use OpenClaw `v2026.4.8` with Telegram enabled. 2. Configure only: - `tools.media.audio.echoTranscript = true` 3. Do **not** set explicit `tools.media.audio.models`. 4. Send a short Telegram voice note to the bot. 5. Observe that the agent receives raw `.ogg` and no transcript echo is shown. ### Additional context - This setup previously worked in practice on this same deployment a few days earlier, with built-in upstream transcription and visible transcript echo. - `#62205` looks related but may not be the

openclaw2026-04-08 20:13:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#63349•Fetched 2026-04-09 07:54:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×3referenced ×2cross-referenced ×1subscribed ×1

On OpenClaw v2026.4.8, Telegram voice notes are reaching the agent as raw .ogg attachments instead of being transcribed upstream, even though tools.media.audio.echoTranscript is enabled and the gateway was restarted.

Root Cause

For Telegram voice notes:

OpenClaw transcribes audio upstream via normal inbound audio handling.
Because tools.media.audio.echoTranscript=true, the transcript is echoed into chat before agent processing.
The agent sees transcript/body replacement rather than only a raw .ogg attachment.

Fix Action

Fixed

Fixed by PR: fix(media): use default STT model for auto audio transcription (#63349) (https://github.com/openclaw/openclaw/pull/63472)

PR fix notes

PR #63472: fix(media): use default STT model for auto audio transcription (#63349)

Repository: openclaw/openclaw
Author: neeravmakwana
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/63472

Description (problem / solution / changelog)

Summary

Problem: Auto-selected inbound audio transcription reused the active chat model id (e.g. gpt-5.4) as the /audio/transcriptions model. That endpoint expects speech-to-text models, so transcription could fail and voice notes surfaced as raw attachments with no transcript echo (tools.media.audio.echoTranscript).
Why it matters: Telegram and other channels rely on successful STT before the agent turn; failed STT matches reports in #63349.
What changed: resolveActiveModelEntry no longer passes the chat model for the audio capability; runProviderEntry continues to resolve the provider default (e.g. OpenAI gpt-4o-transcribe) when no explicit tools.media.audio.models entry applies.
What did NOT change: Explicit tools.media.audio.models, image/video active-model behavior, and non-auto entry paths are unchanged.

Change Type

Bug fix

Scope

Gateway / orchestration

Linked Issue/PR

Closes #63349
This PR fixes a bug or regression

Root Cause

Root cause: Active provider selection for auto audio correctly picked the same provider as chat but incorrectly forwarded the conversation model id into the transcription API.
Missing detection / guardrail: Unit coverage now asserts auto audio does not pass a non-STT chat model when activeModel is set.

Regression Test Plan

Coverage level: Unit test
Target test: src/media-understanding/runner.auto-audio.test.ts
Scenario: activeModel: { provider: "openai", model: "gpt-5.4" } → transcription request uses gpt-4o-transcribe.

User-visible / Behavior Changes

Inbound voice/audio transcription when using auto model selection with the same provider as the active chat model now uses that provider’s default STT model instead of the chat model.

Security Impact

New permissions/capabilities? No
New network endpoints or data exfiltration risk? No

Changelog

Entry added under Unreleased → Fixes in CHANGELOG.md.

Made with Cursor

Changed files

CHANGELOG.md (modified, +1/-0)
src/media-understanding/runner.auto-audio.test.ts (modified, +30/-0)
src/media-understanding/runner.ts (modified, +67/-1)

Code Example

{
  "tools": {
    "media": {
      "audio": {
        "echoTranscript": true
      }
    }
  },
  "plugins": {
    "entries": {
      "telegram": { "enabled": true },
      "openai": { "enabled": true }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.4.8
Channel: Telegram direct chat
Host OS: Linux 6.8.0-107-generic x64
Node: 22.22.1
Gateway: local loopback, systemd service running

Relevant config

{
  "tools": {
    "media": {
      "audio": {
        "echoTranscript": true
      }
    }
  },
  "plugins": {
    "entries": {
      "telegram": { "enabled": true },
      "openai": { "enabled": true }
    }
  }
}

Notes:

no plugins.allow
no explicit tools.media.audio.models
no tools.media.audio.enabled: false

So this should be the normal auto-detect path described in the docs.

Expected behavior

For Telegram voice notes:

OpenClaw transcribes audio upstream via normal inbound audio handling.
Because tools.media.audio.echoTranscript=true, the transcript is echoed into chat before agent processing.
The agent sees transcript/body replacement rather than only a raw .ogg attachment.

Actual behavior

The agent receives the raw .ogg attachment directly.

Observed symptoms from live tests:

no transcript echo appears in chat
no upstream transcript seems to be injected before the agent turn
after gateway restart, behavior is unchanged

We tested multiple short Telegram voice notes in direct chat after restart, with the same result.

Repro steps

Use OpenClaw v2026.4.8 with Telegram enabled.
Configure only:
- tools.media.audio.echoTranscript = true
Do not set explicit tools.media.audio.models.
Send a short Telegram voice note to the bot.
Observe that the agent receives raw .ogg and no transcript echo is shown.

Additional context

This setup previously worked in practice on this same deployment a few days earlier, with built-in upstream transcription and visible transcript echo.
#62205 looks related but may not be the same root cause, because this config does not use plugins.allow.
Older issues like #17101 / #33784 seem adjacent but not identical.

Question

Is there a known v2026.4.8 regression or an undocumented condition where auto-detected inbound audio transcription is skipped for Telegram voice notes, causing the raw file to reach the agent without transcript echo?

extent analysis

TL;DR

The issue might be resolved by explicitly setting tools.media.audio.enabled to true in the configuration, as the current setup relies on auto-detection which may be faulty in OpenClaw v2026.4.8.

Guidance

Verify that the tools.media.audio module is properly loaded and initialized by checking the startup logs of the OpenClaw gateway for any related errors or warnings.
Check if there are any updates or patches available for OpenClaw v2026.4.8 that may address issues with audio transcription, especially for Telegram voice notes.
Consider setting tools.media.audio.models to a specific model that supports Telegram voice notes, as the auto-detect feature may not be working correctly.
Test the transcription feature with a different audio source or channel to isolate if the issue is specific to Telegram voice notes.

Example

No specific code example is provided due to the lack of direct code references in the issue, but the configuration adjustment might look like this:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "echoTranscript": true
      }
    }
  }
}

Notes

The provided information suggests a potential regression or undocumented condition in OpenClaw v2026.4.8, but without further details or logs, it's challenging to pinpoint the exact cause. The suggestions provided are based on the information given and may not fully resolve the issue.

Recommendation

Apply the workaround by explicitly enabling tools.media.audio and possibly specifying a transcription model, as this might bypass the auto-detection issue suspected in OpenClaw v2026.4.8. This approach is recommended because it directly addresses a potential misconfiguration or omission in the current setup.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

For Telegram voice notes:

OpenClaw transcribes audio upstream via normal inbound audio handling.
Because tools.media.audio.echoTranscript=true, the transcript is echoed into chat before agent processing.
The agent sees transcript/body replacement rather than only a raw .ogg attachment.

#batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix v2026.4.8: Telegram voice notes reach agent as raw .ogg, no transcript echo [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #63472: fix(media): use default STT model for auto audio transcription (#63349)

Description (problem / solution / changelog)

Summary

Change Type

Scope

Linked Issue/PR

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Security Impact

Changelog

Changed files

Code Example

Summary

Environment

Relevant config

Expected behavior

Actual behavior

Repro steps

Additional context

Question

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING