openclaw - 💡(How to fix) Fix Feature: canonical voice-input path for custom chat clients (mobile/webchat) without local ASR [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70007Fetched 2026-04-23 07:30:27
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

OpenClaw already has strong voice-note behavior on native channels like Telegram / WhatsApp, but there does not seem to be a clear, documented, supported way for a custom chat client (for example a mobile app or WebChat-like surface) to send a user-initiated voice message into the Gateway and let OpenClaw handle the audio server-side.

Right now, custom clients appear to have to either:

  1. run local ASR and send text, or
  2. rely on undocumented / inconsistent audio-upload behavior.

For mobile clients this is a real gap, because on-device streaming ASR quality is often noticeably worse than server-side handling.

Root Cause

A common product shape is:

  • custom Android / iOS app
  • long-lived Gateway connection for proactive updates / notifications
  • press-to-talk voice input from the phone
  • server-side transcription / media understanding on the Gateway side

That is a different use case from:

  • a node exposing device capabilities to the agent, or
  • Telegram / WhatsApp / WeChat channel ingestion

In other words: this is about custom chat ingress, not node control.

Fix Action

Fix / Workaround

  • "For custom clients, voice input should use X"
  • "Audio attachments on chat.send are/are not supported"
  • "/v1/responses will/will not accept input_audio"
  • "If unsupported today, the recommended workaround is Y"
RAW_BUFFERClick to expand / collapse

Summary

OpenClaw already has strong voice-note behavior on native channels like Telegram / WhatsApp, but there does not seem to be a clear, documented, supported way for a custom chat client (for example a mobile app or WebChat-like surface) to send a user-initiated voice message into the Gateway and let OpenClaw handle the audio server-side.

Right now, custom clients appear to have to either:

  1. run local ASR and send text, or
  2. rely on undocumented / inconsistent audio-upload behavior.

For mobile clients this is a real gap, because on-device streaming ASR quality is often noticeably worse than server-side handling.

Why this matters

A common product shape is:

  • custom Android / iOS app
  • long-lived Gateway connection for proactive updates / notifications
  • press-to-talk voice input from the phone
  • server-side transcription / media understanding on the Gateway side

That is a different use case from:

  • a node exposing device capabilities to the agent, or
  • Telegram / WhatsApp / WeChat channel ingestion

In other words: this is about custom chat ingress, not node control.

Current gap

From the public docs + public source, the picture seems incomplete for custom clients:

  • POST /v1/responses publicly documents message, input_image, and input_file, but not input_audio
  • input_file is currently text / markdown / html / csv / json / pdf only
  • the Gateway protocol schema includes chat.send.attachments
  • but the current public attachment parser in src/gateway/chat-attachments.ts appears image-focused and drops non-image attachments
  • issue #11133 (WebChat/Canvas image upload) explicitly treats video/audio/PDF attachments as non-goals for the first iteration

That leaves custom clients without a canonical voice-input contract, even though voice notes are an important input mode on mobile.

Related issues

These issues look adjacent, but not identical:

  • #14374 Automatic Voice Note Transcription
  • #5741 audio transcription not triggered for some voice messages
  • #13924 WhatsApp voice messages broken on all model providers
  • #40229 Native Voice Input with Press-to-Talk (WeChat-style)
  • #41363 Native Voice Input for Control UI (ChatGPT-style)
  • #11133 WebChat/Canvas image upload UI + unified attachment flow across channels

They suggest audio is important and still evolving, but I could not find a single issue focused on the missing custom-client audio ingress contract.

Request

Please consider defining one supported path for user-initiated voice input from custom clients, for example one of:

  1. A documented input_audio shape for POST /v1/responses
  2. A supported chat.send audio-attachment path with a documented attachment schema
  3. A documented upload flow like upload media -> get media ref -> send via chat.send, where the Gateway routes the audio through tools.media.audio / media understanding / transcript injection

What would help most

Even if full implementation takes time, a short docs answer would already unblock integrators:

  • "For custom clients, voice input should use X"
  • "Audio attachments on chat.send are/are not supported"
  • "/v1/responses will/will not accept input_audio"
  • "If unsupported today, the recommended workaround is Y"

That would make it much easier to build mobile clients without guessing between channel-only behavior, node behavior, and partial OpenAI-compatible HTTP surfaces.

extent analysis

TL;DR

The most likely fix is to define a supported path for user-initiated voice input from custom clients, such as a documented input_audio shape for POST /v1/responses or a supported chat.send audio-attachment path.

Guidance

  • Review the Gateway protocol schema and src/gateway/chat-attachments.ts to understand the current attachment parsing limitations and potential areas for extension.
  • Consider adding support for audio attachments to the chat.send endpoint, including a documented attachment schema.
  • Evaluate the feasibility of introducing a new input_audio shape for POST /v1/responses to handle voice input from custom clients.
  • Investigate the upload media -> get media ref -> send via chat.send flow as a potential workaround or long-term solution.

Example

No code snippet is provided due to the lack of specific technical details in the issue.

Notes

The solution may require updates to the Gateway protocol schema, attachment parsing, and documentation. It is essential to ensure backward compatibility and consider the impact on existing custom clients.

Recommendation

Apply a workaround, such as using the upload media -> get media ref -> send via chat.send flow, until a supported path for user-initiated voice input from custom clients is defined and implemented. This approach allows custom clients to handle voice input without relying on undocumented or inconsistent behavior.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature: canonical voice-input path for custom chat clients (mobile/webchat) without local ASR [1 participants]