openclaw - 💡(How to fix) Fix [Feature]: Session-bound IM voice mode / voice handoff [1 comments, 2 participants]

richardmqq · 2026-05-05T12:13:06Z

[openclaw] Add a session-bound IM voice mode / voice-room handoff so a user can move the current Telegram/Discord/Slack/etc. OpenClaw session into a live voice… Add a session-bound IM voice mode / voice-room handoff so a user can move the current Telegram/Discord/Slack/etc. OpenClaw session into a live voice interaction without losing channel/session context. ## Summary Add a session-bound IM voice mode / voice-room handoff so a user can move the current Telegram/Discord/Slack/etc. OpenClaw session into a live voice interaction without losing channel/session context. ## Problem to solve OpenClaw now has several voice-related surfaces, but they solve different problems: - Control UI Talk mode: good browser realtime voice UX, but it lives in the OpenClaw web/control surface rather than feeling native to the current IM thread. - macOS Voice Wake / push-to-talk: good local microphone capture into the active OpenClaw session, but it is local-device oriented and not an IM call experience. - voice-call plugin: good telephony/inbound/outbound phone-call workflow via Twilio/Telnyx/Plivo, but it is phone-number/webhook oriented rather than “continue this exact Telegram/Discord/Slack thread as a voice call/room”. - Third-party experiments such as https://github.com/davidguttman/clawkie-talkie fill the gap by generating a session-bound browser voice handoff URL from an existing OpenClaw conversation. The missing UX is: from an IM channel, say something like “switch to voice”, receive/open a voice entrypoint that is bound to this exact OpenClaw session/thread, talk naturally, and have the transcript + assistant response remain anchored to the original IM session. Without this, users either: 1. use Control UI Talk mode and mentally map it back to the IM conversation, 2. use phone-call plugins even when they do not want a phone call, 3. send voice notes/audio files asynchronously, or 4. install a separate web handoff daemon/project. ## Proposed solution Provide a first-class OpenClaw “voice handoff” / “IM talk mode” abstraction that can be invoked from any supported channel session. Desired UX: 1. In Telegram/Discord/Slack/etc., user says “switch to voice” or runs a slash command such as `/voice`. 2. OpenClaw replies with a session-bound voice link or launches a supported native channel call/voice-room integration when available. 3. The voice room is scoped to the current OpenClaw session identity (prefer actual sessionId; include sessionKey/channel/target/accountId only as routing/debug metadata where safe). 4. User speaks in the voice room. STT transcript is mirrored into the original IM thread/session. 5. OpenClaw runs the agent turn against the same session context and delivers the answer back to the original IM channel/thread. 6. The voice surface can optionally play TTS back to the user. 7. The written IM transcript remains canonical. Possible architecture: - Add a core voice-handoff capability that builds short-lived/session-bound voice entrypoints from trusted runtime context. - Reuse existing Talk mode realtime provider plumbing where possible. - Reuse `openclaw infer audio` and `openclaw infer tts` for fallback/non-realtime mode. - Let channel plugins optionally implement native voice-room/call support; otherwise fall back to OpenClaw-hosted web/PWA voice room. - Keep provider credentials on the Gateway. Browser/IM clients should receive only constrained/ephemeral session material, never standard provider API keys. - Preserve channel delivery semantics: the agent reply should be delivered as if the user had typed in the original channel/thread. ## Alternatives considered - Control UI Talk mode only: useful, but it is not an IM-native continuation path and is awkward when the conversation started in Telegram/Discord/Slack. - voice-call plugin only: useful for telephony, inbound/outbound calls, and meeting integrations, but overkill and semantically different for “continue this IM thread by voice”. It also requires phone/webhook/provider setup. - Voice notes/audio attachments: asynchronous, not a live talk mode, and often lack full-duplex/interruption/room semantics. - Third-party Clawkie Talkie style handoff: promising proof of concept, but it requires a separate daemon/skill/install path and duplicates session-routing logic that likely belongs in OpenClaw core or an official plugin. ## Impact Affected users/systems/channels: - Users who primarily interact with OpenClaw through Telegram, Discord, Slack, WhatsApp, etc. - Mobile users who want to continue a rich existing session by voice while walking/driving/away from the desk. - Multi-channel OpenClaw setups where the written IM thread is the source of truth. Severity: medium. Not a core correctness bug, but it blocks a high-value mobile/voice workflow and creates pressure for parallel third-party bridges. Frequency: whenever a user wants long-form voice steering for an already-active IM conversation. Consequence: - Extra manual context switching into Control UI

Summary

Add a session-bound IM voice mode / voice-room handoff so a user can move the current Telegram/Discord/Slack/etc. OpenClaw session into a live voice interaction without losing channel/session context.

Problem to solve

OpenClaw now has several voice-related surfaces, but they solve different problems:

Control UI Talk mode: good browser realtime voice UX, but it lives in the OpenClaw web/control surface rather than feeling native to the current IM thread.
macOS Voice Wake / push-to-talk: good local microphone capture into the active OpenClaw session, but it is local-device oriented and not an IM call experience.
voice-call plugin: good telephony/inbound/outbound phone-call workflow via Twilio/Telnyx/Plivo, but it is phone-number/webhook oriented rather than “continue this exact Telegram/Discord/Slack thread as a voice call/room”.
Third-party experiments such as https://github.com/davidguttman/clawkie-talkie fill the gap by generating a session-bound browser voice handoff URL from an existing OpenClaw conversation.

The missing UX is: from an IM channel, say something like “switch to voice”, receive/open a voice entrypoint that is bound to this exact OpenClaw session/thread, talk naturally, and have the transcript + assistant response remain anchored to the original IM session.

Without this, users either:

use Control UI Talk mode and mentally map it back to the IM conversation,
use phone-call plugins even when they do not want a phone call,
send voice notes/audio files asynchronously, or
install a separate web handoff daemon/project.

Proposed solution

Provide a first-class OpenClaw “voice handoff” / “IM talk mode” abstraction that can be invoked from any supported channel session.

Desired UX:

In Telegram/Discord/Slack/etc., user says “switch to voice” or runs a slash command such as /voice.
OpenClaw replies with a session-bound voice link or launches a supported native channel call/voice-room integration when available.
The voice room is scoped to the current OpenClaw session identity (prefer actual sessionId; include sessionKey/channel/target/accountId only as routing/debug metadata where safe).
User speaks in the voice room. STT transcript is mirrored into the original IM thread/session.
OpenClaw runs the agent turn against the same session context and delivers the answer back to the original IM channel/thread.
The voice surface can optionally play TTS back to the user.
The written IM transcript remains canonical.

Possible architecture:

Add a core voice-handoff capability that builds short-lived/session-bound voice entrypoints from trusted runtime context.
Reuse existing Talk mode realtime provider plumbing where possible.
Reuse openclaw infer audio and openclaw infer tts for fallback/non-realtime mode.
Let channel plugins optionally implement native voice-room/call support; otherwise fall back to OpenClaw-hosted web/PWA voice room.
Keep provider credentials on the Gateway. Browser/IM clients should receive only constrained/ephemeral session material, never standard provider API keys.
Preserve channel delivery semantics: the agent reply should be delivered as if the user had typed in the original channel/thread.

Alternatives considered

Control UI Talk mode only: useful, but it is not an IM-native continuation path and is awkward when the conversation started in Telegram/Discord/Slack.
voice-call plugin only: useful for telephony, inbound/outbound calls, and meeting integrations, but overkill and semantically different for “continue this IM thread by voice”. It also requires phone/webhook/provider setup.
Voice notes/audio attachments: asynchronous, not a live talk mode, and often lack full-duplex/interruption/room semantics.
Third-party Clawkie Talkie style handoff: promising proof of concept, but it requires a separate daemon/skill/install path and duplicates session-routing logic that likely belongs in OpenClaw core or an official plugin.

Impact

Affected users/systems/channels:

Users who primarily interact with OpenClaw through Telegram, Discord, Slack, WhatsApp, etc.
Mobile users who want to continue a rich existing session by voice while walking/driving/away from the desk.
Multi-channel OpenClaw setups where the written IM thread is the source of truth.

Severity: medium. Not a core correctness bug, but it blocks a high-value mobile/voice workflow and creates pressure for parallel third-party bridges.

Frequency: whenever a user wants long-form voice steering for an already-active IM conversation.

Consequence:

Extra manual context switching into Control UI Talk mode.
Confusion between Talk mode, voice-call telephony, voice notes, and external voice handoff projects.
More custom glue code/daemon installs for a workflow that could be a first-class OpenClaw capability.

Evidence/examples

Existing OpenClaw docs indicate several related but distinct surfaces:
- Control UI Talk mode: browser realtime voice sessions using provider-specific realtime transports / Gateway relay.
- macOS Voice Wake / push-to-talk: local capture forwarded to active gateway/agent, reply delivered to last-used provider.
- voice-call plugin: Twilio/Telnyx/Plivo telephony with inbound/outbound calls, realtime voice, streaming transcription.
Third-party proof of concept: https://github.com/davidguttman/clawkie-talkie
- It creates a “switch to voice” handoff URL for an existing OpenClaw session.
- It uses a local daemon + browser UI + WebRTC, shells out to OpenClaw for STT/agent/TTS, and preserves the original OpenClaw session as the transcript source of truth.

Additional information

This request is not necessarily asking for Telegram/Discord native group-call automation on day one. A generic OpenClaw-hosted browser/PWA voice room that is safely bound to the current IM session would already cover the main UX gap. Native channel call/voice-room integrations could be optional adapters later.

Important constraints:

Do not expose provider API keys to browser or IM clients.
Do not rely on guessable/forged session keys from untrusted user text.
Handoff data should be generated only from trusted runtime/session context.
Prefer actual stored sessionId for the agent turn; use sessionKey/channel/target/accountId as optional routing metadata.
Written IM session transcript should remain canonical.

extent analysis

TL;DR

Implement a session-bound voice handoff capability in OpenClaw to enable users to switch from an IM conversation to a live voice interaction without losing context.

Guidance

Introduce a core voice-handoff capability: Develop a feature that generates short-lived, session-bound voice entrypoints from trusted runtime context, reusing existing Talk mode realtime provider plumbing where possible.
Reuse existing STT and TTS capabilities: Leverage openclaw infer audio and openclaw infer tts for fallback or non-realtime modes to ensure seamless voice interactions.
Implement native voice-room/call support: Allow channel plugins to optionally implement native voice-room/call support, falling back to OpenClaw-hosted web/PWA voice rooms when necessary.
Preserve channel delivery semantics: Ensure that agent replies are delivered as if the user had typed in the original channel/thread, maintaining the written IM transcript as canonical.

Example

A possible implementation could involve creating a /voice slash command that triggers the generation of a session-bound voice link, which the user can click to join a voice room. The voice room would then mirror the STT transcript into the original IM thread, and the OpenClaw agent would respond accordingly.

Notes

The implementation should prioritize security and adhere to the constraints outlined in the issue, such as not exposing provider API keys to browser or IM clients and generating handoff data only from trusted runtime/session context.

Recommendation

Apply a workaround by implementing a generic OpenClaw-hosted browser/PWA voice room that is safely bound to the current IM session, with native channel call/voice-room integrations as optional adapters for future development. This approach addresses the main UX gap while ensuring security and adherence to the outlined constraints.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Feature]: Session-bound IM voice mode / voice handoff [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Feature]: Session-bound IM voice mode / voice handoff [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING