openclaw - ✅(Solved) Fix [Feature]: Voice messages to agent don't work on Matrix [1 pull requests, 1 comments, 2 participants]

frankdierolf · 2026-05-05T18:54:21Z

[openclaw] PR 78069: feat matrix : transcribe inbound voice notes before mention gate - Repository: openclaw/openclaw - Author: frankdierolf - State: open | me… # PR #78069: feat(matrix): transcribe inbound voice notes before mention gate - Repository: openclaw/openclaw - Author: frankdierolf - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/78069 ## Description (problem / solution / changelog) ## Summary - **Problem**: Inbound voice messages on Matrix reach the agent as raw audio attachments with no transcript. The model improvises a reply instead of answering, and `requireMention: true` rooms drop voice notes entirely because there's no text to match the mention regex. - **Why it matters**: Discord, Telegram, WhatsApp, and Feishu already transcribe inbound voice via `transcribeFirstAudio` before the mention gate. Matrix is the lone holdout — voice-only Matrix users have no way to talk to their agent. - **What changed**: Wire the existing `transcribeFirstAudio` helper into the Matrix monitor handler before the mention gate, mirroring the Discord/Telegram pattern. The transcript feeds into the mention check (so a voice note that says the bot's name reaches the agent in `requireMention` rooms) and into `BodyForAgent` (so the agent reads the transcript instead of a placeholder). `MediaTranscribedIndexes` is set so downstream tools don't re-transcribe the same audio. - **What did NOT change (scope boundary)**: No core media-understanding changes. No new config keys (operators control via the existing global `tools.media.audio.enabled`). Outbound TTS untouched. E2EE crypto path untouched (decryption stays inside `downloadMatrixMedia`; preflight receives the plaintext path). No changes to other channels. ## Change Type (select all) - [x] Feature ## Scope (select all touched areas) - [x] Integrations ## Linked Issue/PR - Closes #78016 - [ ] This PR fixes a bug or regression ## Root Cause (if applicable) N/A — feature parity request, not a regression. ## Regression Test Plan (if applicable) N/A — feature parity. Test coverage: - 15 unit tests in `extensions/matrix/src/matrix/monitor/preflight-audio.test.ts` covering the audio-detection predicate, transcript formatter, and runtime caller (happy paths, error swallowing, abort-signal short-circuit, MediaPaths/MediaTypes ctx shape). - 9 integration tests in `extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts` covering: DM voice notes, `m.file` with audio mime, mention-gate bypass via transcript, mention-gate drop without transcript match, transcription failure fallback, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling. - Existing `handler.test.ts`, `handler.media-failure.test.ts`, and the rest of the matrix monitor suite: 396/396 passing, no regressions. ## User-visible / Behavior Changes - Inbound voice notes (`m.audio`, plus `m.file` carrying an `audio/*` mimetype) on Matrix now get transcribed before the mention gate. A voice note that mentions the bot by name (per the existing `mentionRegexes`) bypasses `requireMention: true` rooms, matching Discord/Telegram. - The transcript is wrapped with `[Audio transcript (machine-generated, untrusted)]: …` framing (with `JSON.stringify` escaping) so prompt-injection content inside the audio cannot impersonate system instructions. - Bare-filename audio bodies (e.g. `voice.ogg`, auto-set by Element) are replaced with the existing `[matrix audio attachment]` placeholder so the agent sees a clear audio marker rather than a stray filename. The download-failed path was already doing this; we extend it to the success path for audio specifically. - Operators can disable globally via `tools.media.audio.enabled: false`. ## Diagram (if applicable) ```text Before: [user voice note] -> [Matrix handler] -> mention gate (drops if requireMention:true & no @mention text) -> media download -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[matrix media]" } -> agent improvises a reply about "an audio attachment" After: [user voice note] -> [Matrix handler] -> audio detect + early download + transcribeFirstAudio -> mention gate (sees transcript text, can match @bot mentions) -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[Audio transcript ...]: \"...\"", MediaTranscribedIndexes: [0] } -> agent answers the spoken question ``` ## Security Impact (required) - New permissions/capabilities? **No.** - Secrets/tokens handling changed? **No.** - New/changed network calls? **No new outbound endpoint** — uses the operator's existing `tools.media.audio.provider`, the same one Discord/Telegram/WhatsApp/Feishu already call. No new API keys required. - Command/tool execution surface changed? **No.** - Data access scope changed? **Yes (minor)** — audio attachment bytes (decrypted plaintext for E2EE rooms) are now sent to the operator-configured STT provider on Matrix, matching peer-channel behavior. Mitigation: gated by `tools.media.audio.enabled`; documented i

openclaw2026-05-05 18:54:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#78016•Fetched 2026-05-06 06:17:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

frankdierolf

Participants

clawsweeper[bot]

frankdierolf

Timeline (top)

mentioned ×2subscribed ×2commented ×1cross-referenced ×1

Fix Action

Solution

There should be a solution — probably similar to what the other channels are doing. I don't know exactly, but hopefully not too hard to wire up.

Best regards, and thanks a lot for all the work, Frank

P.S. I read and approve this message ^^

PR fix notes

PR #78069: feat(matrix): transcribe inbound voice notes before mention gate

Repository: openclaw/openclaw
Author: frankdierolf
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/78069

Description (problem / solution / changelog)

Summary

Problem: Inbound voice messages on Matrix reach the agent as raw audio attachments with no transcript. The model improvises a reply instead of answering, and requireMention: true rooms drop voice notes entirely because there's no text to match the mention regex.
Why it matters: Discord, Telegram, WhatsApp, and Feishu already transcribe inbound voice via transcribeFirstAudio before the mention gate. Matrix is the lone holdout — voice-only Matrix users have no way to talk to their agent.
What changed: Wire the existing transcribeFirstAudio helper into the Matrix monitor handler before the mention gate, mirroring the Discord/Telegram pattern. The transcript feeds into the mention check (so a voice note that says the bot's name reaches the agent in requireMention rooms) and into BodyForAgent (so the agent reads the transcript instead of a placeholder). MediaTranscribedIndexes is set so downstream tools don't re-transcribe the same audio.
What did NOT change (scope boundary): No core media-understanding changes. No new config keys (operators control via the existing global tools.media.audio.enabled). Outbound TTS untouched. E2EE crypto path untouched (decryption stays inside downloadMatrixMedia; preflight receives the plaintext path). No changes to other channels.

Change Type (select all)

Feature

Scope (select all touched areas)

Integrations

Linked Issue/PR

Closes #78016
This PR fixes a bug or regression

Root Cause (if applicable)

N/A — feature parity request, not a regression.

Regression Test Plan (if applicable)

N/A — feature parity. Test coverage:

15 unit tests in extensions/matrix/src/matrix/monitor/preflight-audio.test.ts covering the audio-detection predicate, transcript formatter, and runtime caller (happy paths, error swallowing, abort-signal short-circuit, MediaPaths/MediaTypes ctx shape).
9 integration tests in extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts covering: DM voice notes, m.file with audio mime, mention-gate bypass via transcript, mention-gate drop without transcript match, transcription failure fallback, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling.
Existing handler.test.ts, handler.media-failure.test.ts, and the rest of the matrix monitor suite: 396/396 passing, no regressions.

User-visible / Behavior Changes

Inbound voice notes (m.audio, plus m.file carrying an audio/* mimetype) on Matrix now get transcribed before the mention gate. A voice note that mentions the bot by name (per the existing mentionRegexes) bypasses requireMention: true rooms, matching Discord/Telegram.
The transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: … framing (with JSON.stringify escaping) so prompt-injection content inside the audio cannot impersonate system instructions.
Bare-filename audio bodies (e.g. voice.ogg, auto-set by Element) are replaced with the existing [matrix audio attachment] placeholder so the agent sees a clear audio marker rather than a stray filename. The download-failed path was already doing this; we extend it to the success path for audio specifically.
Operators can disable globally via tools.media.audio.enabled: false.

Diagram (if applicable)

Before:
[user voice note] -> [Matrix handler]
  -> mention gate (drops if requireMention:true & no @mention text)
  -> media download
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[matrix media]" }
  -> agent improvises a reply about "an audio attachment"

After:
[user voice note] -> [Matrix handler]
  -> audio detect + early download + transcribeFirstAudio
  -> mention gate (sees transcript text, can match @bot mentions)
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[Audio transcript ...]: \"...\"", MediaTranscribedIndexes: [0] }
  -> agent answers the spoken question

Security Impact (required)

New permissions/capabilities? No.
Secrets/tokens handling changed? No.
New/changed network calls? No new outbound endpoint — uses the operator's existing tools.media.audio.provider, the same one Discord/Telegram/WhatsApp/Feishu already call. No new API keys required.
Command/tool execution surface changed? No.
Data access scope changed? Yes (minor) — audio attachment bytes (decrypted plaintext for E2EE rooms) are now sent to the operator-configured STT provider on Matrix, matching peer-channel behavior. Mitigation: gated by tools.media.audio.enabled; documented in docs/channels/matrix.md.

Repro + Verification

Environment

OS: Linux (Debian)
Runtime/container: OpenClaw gateway in Docker
Model/provider: OpenAI gpt-4o-mini-transcribe (via tools.media.audio.provider: openai)
Integration/channel: Matrix (self-hosted Synapse, Element clients)
Relevant config (redacted): tools.media.audio.{enabled, provider: openai, model: gpt-4o-mini-transcribe}

Steps

Send a voice note from Element to a Matrix bot in a requireMention: true room, saying the bot's name in the recording.
Send a voice note in a DM to the bot, with no text caption.
Send a non-audio attachment (e.g. an image) to the same room to confirm unchanged behavior.

Expected

(1) Bot transcribes the voice note, sees the spoken bot mention in the transcript, and replies normally.
(2) Bot replies based on the transcribed body.
(3) Behavior unchanged from main.

Actual

(1) (2) (3) Asserted by integration tests in handler.audio-preflight.test.ts. Live end-to-end on a personal homeserver is deferred (see Human Verification below).

Evidence

Failing test/log before + passing after — the integration test file was authored TDD-style. Initial run showed 5 failed assertions on expect(transcribeFirstAudioMock).toHaveBeenCalledTimes(1); once the handler was wired, all 9 cases turned green.
Trace/log snippets — N/A; tests mock downloadMatrixMedia and transcribeFirstAudio following the Discord pattern.
Screenshot/recording — N/A.
Perf numbers — N/A.

Human Verification (required)

Verified scenarios: TDD-driven integration tests covering DM voice, room-mention-bypass via transcript, room-mention-drop without match, transcription-failure fallback, non-audio bypass, single-download verification, E2EE-encrypted audio, and size-limit handling. All 9 cases green. Full matrix monitor suite runs clean (396/396, including pre-existing handler.media-failure.test.ts and handler.body-for-agent.test.ts).
Edge cases checked: Encrypted media (file: { url, key, iv, hashes, v }) decryption + transcription path. m.file with audio mimetype. Bare-filename body normalization. Abort signal short-circuit before SDK load. Empty / undefined transcript fallthrough.
What I did NOT verify: Live end-to-end run against my own Synapse + Element clients on this branch. Reason: testing a fork build requires a custom Docker image; the framework-side path (transcribeFirstAudio) is exercised by the existing src/media-understanding test suite, and our handler integration is locked in by the new tests.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes — additive only.
Config/env changes? No — reuses existing tools.media.audio configuration.
Migration needed? No.

Risks and Mitigations

Risk: Prompt injection via voice transcript ("ignore prior instructions and …").
- Mitigation: Transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: ${JSON.stringify(transcript)} framing before reaching the agent body, mirroring Telegram exactly. JSON.stringify escapes control characters and quote chars.
Risk: STT cost amplification — every audio message in a watched room triggers an STT API call.
- Mitigation: Operator-controlled via tools.media.audio.enabled. Existing room/sender allowlist in the Matrix handler runs BEFORE the preflight code, so unauthorized senders never trigger transcription.
Risk: Reordering the audio download (now runs before mention gate) could shift failure semantics.
- Mitigation: The existing media block now reuses the preflight result via !media && !mediaDownloadFailed guards. Same exception types propagate through the same logger paths. Existing handler.media-failure.test.ts still passes unchanged.

Notes for reviewers

The audio download + error-handling block in handler.ts duplicates ~25 lines with the existing media block (different scope, slightly different encrypted: Boolean(...) flag). Kept as-is for minimal-diff. Happy to extract into a helper if preferred.
The early content-extraction block at handler.ts ~876-887 (earlyContentInfo etc.) duplicates the later block at ~1088-1096 because the audio path needs the info BEFORE the mention gate while non-audio doesn't. Same minimal-diff trade-off.
No disableAudioPreflight per-room knob (Telegram has one). Operators rely on global tools.media.audio.enabled: false. Easy to add if the team wants finer-grained control — happy to send a follow-up.

This PR was developed with AI — [AI-assisted].

Suggested reviewer: @gumadeiras (Matrix-area maintainer).

P.S. I didn't tested it live so far. Still need to. First lets send the pr and lets see from there.

Changed files

CHANGELOG.md (modified, +4/-0)
docs/channels/matrix.md (modified, +15/-0)
extensions/matrix/CHANGELOG.md (modified, +1/-0)
extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts (added, +352/-0)
extensions/matrix/src/matrix/monitor/handler.ts (modified, +97/-6)
extensions/matrix/src/matrix/monitor/preflight-audio.runtime.ts (added, +9/-0)
extensions/matrix/src/matrix/monitor/preflight-audio.test.ts (added, +158/-0)
extensions/matrix/src/matrix/monitor/preflight-audio.ts (added, +72/-0)

RAW_BUFFERClick to expand / collapse

Hola, Frank here — I encountered a problem.

Problem

When I send a voice message to my agent on Matrix (Element & co.), it doesn't work. The agent gets the audio but doesn't actually hear it — it just makes up a polite reply instead of answering my question.

How I encountered it

Running my own OpenClaw instance with seven friends. Over the last weeks, outbound voice replies got shipped (those lovely native voice bubbles) and E2EE got more stable — really happy about both. So naturally we wanted to send voice messages the other way too. Tried it, didn't work.

I noticed the other channels (Discord, Telegram, WhatsApp, Feishu) handle this already. Matrix doesn't seem to.

Solution

There should be a solution — probably similar to what the other channels are doing. I don't know exactly, but hopefully not too hard to wire up.

Best regards, and thanks a lot for all the work, Frank

P.S. I read and approve this message ^^

extent analysis

TL;DR

Implementing support for inbound voice messages in Matrix, similar to other channels like Discord and Telegram, is likely the fix.

Guidance

Investigate how other channels (Discord, Telegram, WhatsApp, Feishu) handle inbound voice messages to identify a potential solution.
Review the OpenClaw instance configuration to ensure it is set up to support inbound voice messages.
Check the Matrix protocol documentation to see if there are any specific requirements or limitations for handling voice messages.
Consider reaching out to the OpenClaw community or Matrix developers for guidance on implementing inbound voice message support.

Notes

The issue lacks technical details about the OpenClaw instance and Matrix configuration, making it difficult to provide a more specific solution.

Recommendation

Apply workaround: Implement a custom solution to handle inbound voice messages, potentially using existing implementations from other channels as a reference, until official support is available.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#logging issue #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Feature]: Voice messages to agent don't work on Matrix [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Solution

PR fix notes

PR #78069: feat(matrix): transcribe inbound voice notes before mention gate

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Notes for reviewers

Changed files

Problem

How I encountered it

Solution

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING