openclaw - ✅(Solved) Fix v2026.4.8: Telegram voice notes reach agent as raw .ogg, no transcript echo [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63349Fetched 2026-04-09 07:54:59
View on GitHub
Comments
3
Participants
3
Timeline
7
Reactions
0
Timeline (top)
commented ×3referenced ×2cross-referenced ×1subscribed ×1

On OpenClaw v2026.4.8, Telegram voice notes are reaching the agent as raw .ogg attachments instead of being transcribed upstream, even though tools.media.audio.echoTranscript is enabled and the gateway was restarted.

Root Cause

For Telegram voice notes:

  1. OpenClaw transcribes audio upstream via normal inbound audio handling.
  2. Because tools.media.audio.echoTranscript=true, the transcript is echoed into chat before agent processing.
  3. The agent sees transcript/body replacement rather than only a raw .ogg attachment.

Fix Action

Fixed

PR fix notes

PR #63472: fix(media): use default STT model for auto audio transcription (#63349)

Description (problem / solution / changelog)

Summary

  • Problem: Auto-selected inbound audio transcription reused the active chat model id (e.g. gpt-5.4) as the /audio/transcriptions model. That endpoint expects speech-to-text models, so transcription could fail and voice notes surfaced as raw attachments with no transcript echo (tools.media.audio.echoTranscript).
  • Why it matters: Telegram and other channels rely on successful STT before the agent turn; failed STT matches reports in #63349.
  • What changed: resolveActiveModelEntry no longer passes the chat model for the audio capability; runProviderEntry continues to resolve the provider default (e.g. OpenAI gpt-4o-transcribe) when no explicit tools.media.audio.models entry applies.
  • What did NOT change: Explicit tools.media.audio.models, image/video active-model behavior, and non-auto entry paths are unchanged.

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Linked Issue/PR

  • Closes #63349
  • This PR fixes a bug or regression

Root Cause

  • Root cause: Active provider selection for auto audio correctly picked the same provider as chat but incorrectly forwarded the conversation model id into the transcription API.
  • Missing detection / guardrail: Unit coverage now asserts auto audio does not pass a non-STT chat model when activeModel is set.

Regression Test Plan

  • Coverage level: Unit test
  • Target test: src/media-understanding/runner.auto-audio.test.ts
  • Scenario: activeModel: { provider: "openai", model: "gpt-5.4" } → transcription request uses gpt-4o-transcribe.

User-visible / Behavior Changes

  • Inbound voice/audio transcription when using auto model selection with the same provider as the active chat model now uses that provider’s default STT model instead of the chat model.

Security Impact

  • New permissions/capabilities? No
  • New network endpoints or data exfiltration risk? No

Changelog

  • Entry added under Unreleased → Fixes in CHANGELOG.md.

Made with Cursor

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/media-understanding/runner.auto-audio.test.ts (modified, +30/-0)
  • src/media-understanding/runner.ts (modified, +67/-1)

Code Example

{
  "tools": {
    "media": {
      "audio": {
        "echoTranscript": true
      }
    }
  },
  "plugins": {
    "entries": {
      "telegram": { "enabled": true },
      "openai": { "enabled": true }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Summary

On OpenClaw v2026.4.8, Telegram voice notes are reaching the agent as raw .ogg attachments instead of being transcribed upstream, even though tools.media.audio.echoTranscript is enabled and the gateway was restarted.

Environment

  • OpenClaw: 2026.4.8
  • Channel: Telegram direct chat
  • Host OS: Linux 6.8.0-107-generic x64
  • Node: 22.22.1
  • Gateway: local loopback, systemd service running

Relevant config

{
  "tools": {
    "media": {
      "audio": {
        "echoTranscript": true
      }
    }
  },
  "plugins": {
    "entries": {
      "telegram": { "enabled": true },
      "openai": { "enabled": true }
    }
  }
}

Notes:

  • no plugins.allow
  • no explicit tools.media.audio.models
  • no tools.media.audio.enabled: false

So this should be the normal auto-detect path described in the docs.

Expected behavior

For Telegram voice notes:

  1. OpenClaw transcribes audio upstream via normal inbound audio handling.
  2. Because tools.media.audio.echoTranscript=true, the transcript is echoed into chat before agent processing.
  3. The agent sees transcript/body replacement rather than only a raw .ogg attachment.

Actual behavior

The agent receives the raw .ogg attachment directly.

Observed symptoms from live tests:

  • no transcript echo appears in chat
  • no upstream transcript seems to be injected before the agent turn
  • after gateway restart, behavior is unchanged

We tested multiple short Telegram voice notes in direct chat after restart, with the same result.

Repro steps

  1. Use OpenClaw v2026.4.8 with Telegram enabled.
  2. Configure only:
    • tools.media.audio.echoTranscript = true
  3. Do not set explicit tools.media.audio.models.
  4. Send a short Telegram voice note to the bot.
  5. Observe that the agent receives raw .ogg and no transcript echo is shown.

Additional context

  • This setup previously worked in practice on this same deployment a few days earlier, with built-in upstream transcription and visible transcript echo.
  • #62205 looks related but may not be the same root cause, because this config does not use plugins.allow.
  • Older issues like #17101 / #33784 seem adjacent but not identical.

Question

Is there a known v2026.4.8 regression or an undocumented condition where auto-detected inbound audio transcription is skipped for Telegram voice notes, causing the raw file to reach the agent without transcript echo?

extent analysis

TL;DR

The issue might be resolved by explicitly setting tools.media.audio.enabled to true in the configuration, as the current setup relies on auto-detection which may be faulty in OpenClaw v2026.4.8.

Guidance

  • Verify that the tools.media.audio module is properly loaded and initialized by checking the startup logs of the OpenClaw gateway for any related errors or warnings.
  • Check if there are any updates or patches available for OpenClaw v2026.4.8 that may address issues with audio transcription, especially for Telegram voice notes.
  • Consider setting tools.media.audio.models to a specific model that supports Telegram voice notes, as the auto-detect feature may not be working correctly.
  • Test the transcription feature with a different audio source or channel to isolate if the issue is specific to Telegram voice notes.

Example

No specific code example is provided due to the lack of direct code references in the issue, but the configuration adjustment might look like this:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "echoTranscript": true
      }
    }
  }
}

Notes

The provided information suggests a potential regression or undocumented condition in OpenClaw v2026.4.8, but without further details or logs, it's challenging to pinpoint the exact cause. The suggestions provided are based on the information given and may not fully resolve the issue.

Recommendation

Apply the workaround by explicitly enabling tools.media.audio and possibly specifying a transcription model, as this might bypass the auto-detection issue suspected in OpenClaw v2026.4.8. This approach is recommended because it directly addresses a potential misconfiguration or omission in the current setup.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For Telegram voice notes:

  1. OpenClaw transcribes audio upstream via normal inbound audio handling.
  2. Because tools.media.audio.echoTranscript=true, the transcript is echoed into chat before agent processing.
  3. The agent sees transcript/body replacement rather than only a raw .ogg attachment.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING