openclaw - ✅(Solved) Fix [Feature]: Voice messages to agent don't work on Matrix [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78016Fetched 2026-05-06 06:17:54
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
2
Timeline (top)
mentioned ×2subscribed ×2commented ×1cross-referenced ×1

Fix Action

Solution

There should be a solution — probably similar to what the other channels are doing. I don't know exactly, but hopefully not too hard to wire up.

Best regards, and thanks a lot for all the work, Frank

P.S. I read and approve this message ^^

PR fix notes

PR #78069: feat(matrix): transcribe inbound voice notes before mention gate

Description (problem / solution / changelog)

Summary

  • Problem: Inbound voice messages on Matrix reach the agent as raw audio attachments with no transcript. The model improvises a reply instead of answering, and requireMention: true rooms drop voice notes entirely because there's no text to match the mention regex.
  • Why it matters: Discord, Telegram, WhatsApp, and Feishu already transcribe inbound voice via transcribeFirstAudio before the mention gate. Matrix is the lone holdout — voice-only Matrix users have no way to talk to their agent.
  • What changed: Wire the existing transcribeFirstAudio helper into the Matrix monitor handler before the mention gate, mirroring the Discord/Telegram pattern. The transcript feeds into the mention check (so a voice note that says the bot's name reaches the agent in requireMention rooms) and into BodyForAgent (so the agent reads the transcript instead of a placeholder). MediaTranscribedIndexes is set so downstream tools don't re-transcribe the same audio.
  • What did NOT change (scope boundary): No core media-understanding changes. No new config keys (operators control via the existing global tools.media.audio.enabled). Outbound TTS untouched. E2EE crypto path untouched (decryption stays inside downloadMatrixMedia; preflight receives the plaintext path). No changes to other channels.

Change Type (select all)

  • Feature

Scope (select all touched areas)

  • Integrations

Linked Issue/PR

  • Closes #78016
  • This PR fixes a bug or regression

Root Cause (if applicable)

N/A — feature parity request, not a regression.

Regression Test Plan (if applicable)

N/A — feature parity. Test coverage:

  • 15 unit tests in extensions/matrix/src/matrix/monitor/preflight-audio.test.ts covering the audio-detection predicate, transcript formatter, and runtime caller (happy paths, error swallowing, abort-signal short-circuit, MediaPaths/MediaTypes ctx shape).
  • 9 integration tests in extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts covering: DM voice notes, m.file with audio mime, mention-gate bypass via transcript, mention-gate drop without transcript match, transcription failure fallback, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling.
  • Existing handler.test.ts, handler.media-failure.test.ts, and the rest of the matrix monitor suite: 396/396 passing, no regressions.

User-visible / Behavior Changes

  • Inbound voice notes (m.audio, plus m.file carrying an audio/* mimetype) on Matrix now get transcribed before the mention gate. A voice note that mentions the bot by name (per the existing mentionRegexes) bypasses requireMention: true rooms, matching Discord/Telegram.
  • The transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: … framing (with JSON.stringify escaping) so prompt-injection content inside the audio cannot impersonate system instructions.
  • Bare-filename audio bodies (e.g. voice.ogg, auto-set by Element) are replaced with the existing [matrix audio attachment] placeholder so the agent sees a clear audio marker rather than a stray filename. The download-failed path was already doing this; we extend it to the success path for audio specifically.
  • Operators can disable globally via tools.media.audio.enabled: false.

Diagram (if applicable)

Before:
[user voice note] -> [Matrix handler]
  -> mention gate (drops if requireMention:true & no @mention text)
  -> media download
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[matrix media]" }
  -> agent improvises a reply about "an audio attachment"

After:
[user voice note] -> [Matrix handler]
  -> audio detect + early download + transcribeFirstAudio
  -> mention gate (sees transcript text, can match @bot mentions)
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[Audio transcript ...]: \"...\"", MediaTranscribedIndexes: [0] }
  -> agent answers the spoken question

Security Impact (required)

  • New permissions/capabilities? No.
  • Secrets/tokens handling changed? No.
  • New/changed network calls? No new outbound endpoint — uses the operator's existing tools.media.audio.provider, the same one Discord/Telegram/WhatsApp/Feishu already call. No new API keys required.
  • Command/tool execution surface changed? No.
  • Data access scope changed? Yes (minor) — audio attachment bytes (decrypted plaintext for E2EE rooms) are now sent to the operator-configured STT provider on Matrix, matching peer-channel behavior. Mitigation: gated by tools.media.audio.enabled; documented in docs/channels/matrix.md.

Repro + Verification

Environment

  • OS: Linux (Debian)
  • Runtime/container: OpenClaw gateway in Docker
  • Model/provider: OpenAI gpt-4o-mini-transcribe (via tools.media.audio.provider: openai)
  • Integration/channel: Matrix (self-hosted Synapse, Element clients)
  • Relevant config (redacted): tools.media.audio.{enabled, provider: openai, model: gpt-4o-mini-transcribe}

Steps

  1. Send a voice note from Element to a Matrix bot in a requireMention: true room, saying the bot's name in the recording.
  2. Send a voice note in a DM to the bot, with no text caption.
  3. Send a non-audio attachment (e.g. an image) to the same room to confirm unchanged behavior.

Expected

  • (1) Bot transcribes the voice note, sees the spoken bot mention in the transcript, and replies normally.
  • (2) Bot replies based on the transcribed body.
  • (3) Behavior unchanged from main.

Actual

  • (1) (2) (3) Asserted by integration tests in handler.audio-preflight.test.ts. Live end-to-end on a personal homeserver is deferred (see Human Verification below).

Evidence

  • Failing test/log before + passing after — the integration test file was authored TDD-style. Initial run showed 5 failed assertions on expect(transcribeFirstAudioMock).toHaveBeenCalledTimes(1); once the handler was wired, all 9 cases turned green.
  • Trace/log snippets — N/A; tests mock downloadMatrixMedia and transcribeFirstAudio following the Discord pattern.
  • Screenshot/recording — N/A.
  • Perf numbers — N/A.

Human Verification (required)

  • Verified scenarios: TDD-driven integration tests covering DM voice, room-mention-bypass via transcript, room-mention-drop without match, transcription-failure fallback, non-audio bypass, single-download verification, E2EE-encrypted audio, and size-limit handling. All 9 cases green. Full matrix monitor suite runs clean (396/396, including pre-existing handler.media-failure.test.ts and handler.body-for-agent.test.ts).
  • Edge cases checked: Encrypted media (file: { url, key, iv, hashes, v }) decryption + transcription path. m.file with audio mimetype. Bare-filename body normalization. Abort signal short-circuit before SDK load. Empty / undefined transcript fallthrough.
  • What I did NOT verify: Live end-to-end run against my own Synapse + Element clients on this branch. Reason: testing a fork build requires a custom Docker image; the framework-side path (transcribeFirstAudio) is exercised by the existing src/media-understanding test suite, and our handler integration is locked in by the new tests.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes — additive only.
  • Config/env changes? No — reuses existing tools.media.audio configuration.
  • Migration needed? No.

Risks and Mitigations

  • Risk: Prompt injection via voice transcript ("ignore prior instructions and …").
    • Mitigation: Transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: ${JSON.stringify(transcript)} framing before reaching the agent body, mirroring Telegram exactly. JSON.stringify escapes control characters and quote chars.
  • Risk: STT cost amplification — every audio message in a watched room triggers an STT API call.
    • Mitigation: Operator-controlled via tools.media.audio.enabled. Existing room/sender allowlist in the Matrix handler runs BEFORE the preflight code, so unauthorized senders never trigger transcription.
  • Risk: Reordering the audio download (now runs before mention gate) could shift failure semantics.
    • Mitigation: The existing media block now reuses the preflight result via !media && !mediaDownloadFailed guards. Same exception types propagate through the same logger paths. Existing handler.media-failure.test.ts still passes unchanged.

Notes for reviewers

  • The audio download + error-handling block in handler.ts duplicates ~25 lines with the existing media block (different scope, slightly different encrypted: Boolean(...) flag). Kept as-is for minimal-diff. Happy to extract into a helper if preferred.
  • The early content-extraction block at handler.ts ~876-887 (earlyContentInfo etc.) duplicates the later block at ~1088-1096 because the audio path needs the info BEFORE the mention gate while non-audio doesn't. Same minimal-diff trade-off.
  • No disableAudioPreflight per-room knob (Telegram has one). Operators rely on global tools.media.audio.enabled: false. Easy to add if the team wants finer-grained control — happy to send a follow-up.

This PR was developed with AI — [AI-assisted].

Suggested reviewer: @gumadeiras (Matrix-area maintainer).


P.S. I didn't tested it live so far. Still need to. First lets send the pr and lets see from there.

Changed files

  • CHANGELOG.md (modified, +4/-0)
  • docs/channels/matrix.md (modified, +15/-0)
  • extensions/matrix/CHANGELOG.md (modified, +1/-0)
  • extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts (added, +352/-0)
  • extensions/matrix/src/matrix/monitor/handler.ts (modified, +97/-6)
  • extensions/matrix/src/matrix/monitor/preflight-audio.runtime.ts (added, +9/-0)
  • extensions/matrix/src/matrix/monitor/preflight-audio.test.ts (added, +158/-0)
  • extensions/matrix/src/matrix/monitor/preflight-audio.ts (added, +72/-0)
RAW_BUFFERClick to expand / collapse

Hola, Frank here — I encountered a problem.

Problem

When I send a voice message to my agent on Matrix (Element & co.), it doesn't work. The agent gets the audio but doesn't actually hear it — it just makes up a polite reply instead of answering my question.

How I encountered it

Running my own OpenClaw instance with seven friends. Over the last weeks, outbound voice replies got shipped (those lovely native voice bubbles) and E2EE got more stable — really happy about both. So naturally we wanted to send voice messages the other way too. Tried it, didn't work.

I noticed the other channels (Discord, Telegram, WhatsApp, Feishu) handle this already. Matrix doesn't seem to.

Solution

There should be a solution — probably similar to what the other channels are doing. I don't know exactly, but hopefully not too hard to wire up.

Best regards, and thanks a lot for all the work, Frank

P.S. I read and approve this message ^^

extent analysis

TL;DR

Implementing support for inbound voice messages in Matrix, similar to other channels like Discord and Telegram, is likely the fix.

Guidance

  • Investigate how other channels (Discord, Telegram, WhatsApp, Feishu) handle inbound voice messages to identify a potential solution.
  • Review the OpenClaw instance configuration to ensure it is set up to support inbound voice messages.
  • Check the Matrix protocol documentation to see if there are any specific requirements or limitations for handling voice messages.
  • Consider reaching out to the OpenClaw community or Matrix developers for guidance on implementing inbound voice message support.

Notes

The issue lacks technical details about the OpenClaw instance and Matrix configuration, making it difficult to provide a more specific solution.

Recommendation

Apply workaround: Implement a custom solution to handle inbound voice messages, potentially using existing implementations from other channels as a reference, until official support is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Feature]: Voice messages to agent don't work on Matrix [1 pull requests, 1 comments, 2 participants]