openclaw - 💡(How to fix) Fix Media preflight: carry transcript state into downstream media-understanding to avoid duplicate audio STT [1 participants]

openclaw2026-04-23 11:12:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70580•Fetched 2026-04-24 05:56:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

rogerdigital

Participants

rogerdigital

Channel-level audio preflight helpers can transcribe an inbound voice note before the normal reply pipeline runs, but that transcript state is not carried into downstream applyMediaUnderstanding / attachment selection. As a result, the same inbound audio can be transcribed again on the normal media-understanding path.

Root Cause

duplicates STT cost and latency for one inbound voice note
can produce different transcript text between command parsing / preflight body rewrite and the later prompt media pass
affects multiple channels that use the same preflight pattern, not just WhatsApp

Fix Action

Fix / Workaround

This appears to be shared across channels that do preflight audio transcription before reply dispatch, including:

RAW_BUFFERClick to expand / collapse

Summary

Why this matters

duplicates STT cost and latency for one inbound voice note
can produce different transcript text between command parsing / preflight body rewrite and the later prompt media pass
affects multiple channels that use the same preflight pattern, not just WhatsApp

Current structural limitation

Today the preflight call uses a temporary context or temporary attachment state, while the downstream media-understanding path recreates fresh attachments from string fields on MsgContext.

That means attachment-local flags such as alreadyTranscribed do not survive into the later pass.

Scope

This appears to be shared across channels that do preflight audio transcription before reply dispatch, including:

WhatsApp
Telegram
Discord

Possible fixes

Any of the following would give channels a real way to express "audio already transcribed":

let applyMediaUnderstanding accept pre-supplied attachment state
persist an explicit already-transcribed marker on MsgContext and honor it during attachment normalization / selection
let downstream media-understanding skip audio STT when a trusted preflight transcript is already present on context

Notes

This issue is intentionally about the shared SDK / media-understanding seam, not about any one channel-specific preflight implementation.

extent analysis

TL;DR

To fix the duplicate transcription issue, modify the applyMediaUnderstanding function to accept pre-supplied attachment state or persist an explicit "already-transcribed" marker on MsgContext.

Guidance

Investigate modifying the applyMediaUnderstanding function to accept pre-supplied attachment state, allowing channels to pass the transcript state from preflight helpers.
Consider adding an explicit "already-transcribed" marker on MsgContext to track whether an audio has been transcribed, and honor this marker during attachment normalization and selection.
Evaluate the feasibility of skipping audio STT in downstream media-understanding when a trusted preflight transcript is already present on context.
Review the current implementation of preflight audio transcription and media-understanding paths to identify potential areas for optimization.

Example

No code example is provided due to the lack of specific implementation details in the issue.

Notes

The proposed fixes require modifications to the shared SDK/media-understanding seam, and their implementation may vary depending on the specific requirements and constraints of each channel.

Recommendation

Apply a workaround by modifying the applyMediaUnderstanding function to accept pre-supplied attachment state, as this approach seems to be a more straightforward solution that can be implemented without significant changes to the existing architecture.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Media preflight: carry transcript state into downstream media-understanding to avoid duplicate audio STT [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Why this matters

Current structural limitation

Scope

Possible fixes

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Media preflight: carry transcript state into downstream media-understanding to avoid duplicate audio STT [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Why this matters

Current structural limitation

Scope

Possible fixes

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING