openclaw - 💡(How to fix) Fix Media preflight: carry transcript state into downstream media-understanding to avoid duplicate audio STT [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70580Fetched 2026-04-24 05:56:07
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Channel-level audio preflight helpers can transcribe an inbound voice note before the normal reply pipeline runs, but that transcript state is not carried into downstream applyMediaUnderstanding / attachment selection. As a result, the same inbound audio can be transcribed again on the normal media-understanding path.

Root Cause

  • duplicates STT cost and latency for one inbound voice note
  • can produce different transcript text between command parsing / preflight body rewrite and the later prompt media pass
  • affects multiple channels that use the same preflight pattern, not just WhatsApp

Fix Action

Fix / Workaround

This appears to be shared across channels that do preflight audio transcription before reply dispatch, including:

RAW_BUFFERClick to expand / collapse

Summary

Channel-level audio preflight helpers can transcribe an inbound voice note before the normal reply pipeline runs, but that transcript state is not carried into downstream applyMediaUnderstanding / attachment selection. As a result, the same inbound audio can be transcribed again on the normal media-understanding path.

Why this matters

  • duplicates STT cost and latency for one inbound voice note
  • can produce different transcript text between command parsing / preflight body rewrite and the later prompt media pass
  • affects multiple channels that use the same preflight pattern, not just WhatsApp

Current structural limitation

Today the preflight call uses a temporary context or temporary attachment state, while the downstream media-understanding path recreates fresh attachments from string fields on MsgContext.

That means attachment-local flags such as alreadyTranscribed do not survive into the later pass.

Scope

This appears to be shared across channels that do preflight audio transcription before reply dispatch, including:

  • WhatsApp
  • Telegram
  • Discord

Possible fixes

Any of the following would give channels a real way to express "audio already transcribed":

  1. let applyMediaUnderstanding accept pre-supplied attachment state
  2. persist an explicit already-transcribed marker on MsgContext and honor it during attachment normalization / selection
  3. let downstream media-understanding skip audio STT when a trusted preflight transcript is already present on context

Notes

This issue is intentionally about the shared SDK / media-understanding seam, not about any one channel-specific preflight implementation.

extent analysis

TL;DR

To fix the duplicate transcription issue, modify the applyMediaUnderstanding function to accept pre-supplied attachment state or persist an explicit "already-transcribed" marker on MsgContext.

Guidance

  • Investigate modifying the applyMediaUnderstanding function to accept pre-supplied attachment state, allowing channels to pass the transcript state from preflight helpers.
  • Consider adding an explicit "already-transcribed" marker on MsgContext to track whether an audio has been transcribed, and honor this marker during attachment normalization and selection.
  • Evaluate the feasibility of skipping audio STT in downstream media-understanding when a trusted preflight transcript is already present on context.
  • Review the current implementation of preflight audio transcription and media-understanding paths to identify potential areas for optimization.

Example

No code example is provided due to the lack of specific implementation details in the issue.

Notes

The proposed fixes require modifications to the shared SDK/media-understanding seam, and their implementation may vary depending on the specific requirements and constraints of each channel.

Recommendation

Apply a workaround by modifying the applyMediaUnderstanding function to accept pre-supplied attachment state, as this approach seems to be a more straightforward solution that can be implemented without significant changes to the existing architecture.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Media preflight: carry transcript state into downstream media-understanding to avoid duplicate audio STT [1 participants]