openclaw - 💡(How to fix) Fix Feature: canonical voice-input path for custom chat clients (mobile/webchat) without local ASR [1 participants]

openclaw2026-04-22 06:00:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70007•Fetched 2026-04-23 07:30:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

susu3621

Participants

susu3621

OpenClaw already has strong voice-note behavior on native channels like Telegram / WhatsApp, but there does not seem to be a clear, documented, supported way for a custom chat client (for example a mobile app or WebChat-like surface) to send a user-initiated voice message into the Gateway and let OpenClaw handle the audio server-side.

Right now, custom clients appear to have to either:

run local ASR and send text, or
rely on undocumented / inconsistent audio-upload behavior.

For mobile clients this is a real gap, because on-device streaming ASR quality is often noticeably worse than server-side handling.

Root Cause

A common product shape is:

custom Android / iOS app
long-lived Gateway connection for proactive updates / notifications
press-to-talk voice input from the phone
server-side transcription / media understanding on the Gateway side

That is a different use case from:

a node exposing device capabilities to the agent, or
Telegram / WhatsApp / WeChat channel ingestion

In other words: this is about custom chat ingress, not node control.

Fix Action

Fix / Workaround

"For custom clients, voice input should use X"
"Audio attachments on chat.send are/are not supported"
"/v1/responses will/will not accept input_audio"
"If unsupported today, the recommended workaround is Y"

RAW_BUFFERClick to expand / collapse

Summary

Right now, custom clients appear to have to either:

run local ASR and send text, or
rely on undocumented / inconsistent audio-upload behavior.

For mobile clients this is a real gap, because on-device streaming ASR quality is often noticeably worse than server-side handling.

Why this matters

A common product shape is:

custom Android / iOS app
long-lived Gateway connection for proactive updates / notifications
press-to-talk voice input from the phone
server-side transcription / media understanding on the Gateway side

That is a different use case from:

a node exposing device capabilities to the agent, or
Telegram / WhatsApp / WeChat channel ingestion

In other words: this is about custom chat ingress, not node control.

Current gap

From the public docs + public source, the picture seems incomplete for custom clients:

POST /v1/responses publicly documents message, input_image, and input_file, but not input_audio
input_file is currently text / markdown / html / csv / json / pdf only
the Gateway protocol schema includes chat.send.attachments
but the current public attachment parser in src/gateway/chat-attachments.ts appears image-focused and drops non-image attachments
issue #11133 (WebChat/Canvas image upload) explicitly treats video/audio/PDF attachments as non-goals for the first iteration

That leaves custom clients without a canonical voice-input contract, even though voice notes are an important input mode on mobile.

Related issues

These issues look adjacent, but not identical:

#14374 Automatic Voice Note Transcription
#5741 audio transcription not triggered for some voice messages
#13924 WhatsApp voice messages broken on all model providers
#40229 Native Voice Input with Press-to-Talk (WeChat-style)
#41363 Native Voice Input for Control UI (ChatGPT-style)
#11133 WebChat/Canvas image upload UI + unified attachment flow across channels

They suggest audio is important and still evolving, but I could not find a single issue focused on the missing custom-client audio ingress contract.

Request

Please consider defining one supported path for user-initiated voice input from custom clients, for example one of:

A documented input_audio shape for POST /v1/responses
A supported chat.send audio-attachment path with a documented attachment schema
A documented upload flow like upload media -> get media ref -> send via chat.send, where the Gateway routes the audio through tools.media.audio / media understanding / transcript injection

What would help most

Even if full implementation takes time, a short docs answer would already unblock integrators:

"For custom clients, voice input should use X"
"Audio attachments on chat.send are/are not supported"
"/v1/responses will/will not accept input_audio"
"If unsupported today, the recommended workaround is Y"

That would make it much easier to build mobile clients without guessing between channel-only behavior, node behavior, and partial OpenAI-compatible HTTP surfaces.

extent analysis

TL;DR

The most likely fix is to define a supported path for user-initiated voice input from custom clients, such as a documented input_audio shape for POST /v1/responses or a supported chat.send audio-attachment path.

Guidance

Review the Gateway protocol schema and src/gateway/chat-attachments.ts to understand the current attachment parsing limitations and potential areas for extension.
Consider adding support for audio attachments to the chat.send endpoint, including a documented attachment schema.
Evaluate the feasibility of introducing a new input_audio shape for POST /v1/responses to handle voice input from custom clients.
Investigate the upload media -> get media ref -> send via chat.send flow as a potential workaround or long-term solution.

Example

No code snippet is provided due to the lack of specific technical details in the issue.

Notes

The solution may require updates to the Gateway protocol schema, attachment parsing, and documentation. It is essential to ensure backward compatibility and consider the impact on existing custom clients.

Recommendation

Apply a workaround, such as using the upload media -> get media ref -> send via chat.send flow, until a supported path for user-initiated voice input from custom clients is defined and implemented. This approach allows custom clients to handle voice input without relying on undocumented or inconsistent behavior.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature: canonical voice-input path for custom chat clients (mobile/webchat) without local ASR [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Why this matters

Current gap

Related issues

Request

What would help most

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature: canonical voice-input path for custom chat clients (mobile/webchat) without local ASR [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Why this matters

Current gap

Related issues

Request

What would help most

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING