openclaw - 💡(How to fix) Fix [Feature]: Pluggable STT backend for macOS Push-to-Talk

StepCodex · 2026-05-27T03:23:17Z

[openclaw] Allow the macOS VoicePushToTalk flow to use a configurable speech-to-text backend, including Gateway tools.media.audio or a private/local ASR servic… Allow the macOS VoicePushToTalk flow to use a configurable speech-to-text backend, including Gateway tools.media.audio or a private/local ASR service, instead of always using Apple Speech. ## Fix / Workaround The workaround is to build a separate sidecar PTT wrapper that records audio, sends it to the preferred ASR backend, then injects text back into OpenClaw. That works, but it duplicates hotkey handling, recording state, failure handling, session routing, and delivery behavior that the macOS app already handles better. Severity: Medium. Current native PTT works, but the STT backend is not configurable, so users with privacy, language quality, cost, or deployment constraints need a sidecar workaround. ### Summary Allow the macOS VoicePushToTalk flow to use a configurable speech-to-text backend, including Gateway tools.media.audio or a private/local ASR service, instead of always using Apple Speech. ### Problem to solve The macOS app's native Push-to-Talk experience has the right UX primitives: global hotkey handling, microphone capture, overlay state, app lifecycle integration, and existing forwarding into OpenClaw sessions. However, the current flow appears to couple VoicePushToTalk directly to Apple Speech. That makes the native PTT path hard to use for users who already have a preferred OpenClaw audio transcription stack, such as tools.media.audio, a local Whisper/Qwen/Parakeet service, or a trusted private LAN ASR host. The workaround is to build a separate sidecar PTT wrapper that records audio, sends it to the preferred ASR backend, then injects text back into OpenClaw. That works, but it duplicates hotkey handling, recording state, failure handling, session routing, and delivery behavior that the macOS app already handles better. This is difficult to solve as a normal third-party plugin because the missing abstraction is inside the native macOS PTT capture/transcription pipeline, before text reaches the Gateway/session layer. ### Proposed solution Add an STT backend abstraction for macOS VoicePushToTalk. Suggested behavior: - Keep Apple Speech as the default backend and preserve the existing realtime partial transcript behavior. - Add a batch backend for Push-to-Talk: - Hotkey down: record microphone audio locally. - Hotkey up: finalize the audio. - Send the audio to the configured backend. - Receive transcript. - Reuse the existing VoiceWakeForwarder/session delivery path. - Support a Gateway-backed option that uses OpenClaw's configured tools.media.audio transcription path. - Optionally support a local/private HTTP backend or command backend later, if maintainers think that is acceptable from a security and UX perspective. - In batch mode, the overlay can show state transitions such as Recording -> Transcribing -> Sent/Failed instead of live partial transcript text. - Keep wake-word mode on Apple Speech initially unless a streaming/wake-word backend abstraction is added later. This request is specifically for native macOS Push-to-Talk voice input. It is not asking for full realtime voice conversation mode or speech-to-speech. ### Alternatives considered - Separate wrapper app: works, but duplicates hotkey capture, microphone capture, overlay state, retries, notifications, and session delivery logic. It is also more fragile across macOS permissions and app lifecycle changes. - Arbitrary external shell command from the macOS app: flexible, but may be too broad from a security and support perspective. - Full realtime/talk-mode backend replacement: useful, but larger than this request. Batch PTT transcription is a smaller and more incremental capability. - Continue using only Apple Speech: simple, but prevents users from using their existing private/local OpenClaw transcription setup. ### Impact Affected users/systems/channels: macOS users who use native Push-to-Talk and want local, private, or custom speech-to-text routing. Severity: Medium. Current native PTT works, but the STT backend is not configurable, so users with privacy, language quality, cost, or deployment constraints need a sidecar workaround. Frequency: Every macOS PTT voice input for users who do not want Apple Speech as the transcription backend. Consequence: Extra custom tooling, duplicated native app behavior, harder debugging, weaker status/failure UX, and less reliable session/delivery integration than the built-in macOS app could provide. ### Evidence/examples - macOS Voice Wake / Push-to-Talk documentation describes the current PTT path: https://docs.openclaw.ai/platforms/mac/voicewake - OpenClaw already has audio transcription configuration through tools.media.audio: https://docs.openclaw.ai/nodes/audio - Related but not identical issues: - Telegram/Groq audio transcription regression: https://github.com/openclaw/openclaw/issues/59502 - Telegram voice STT regression: ht

Fix Action

Fix / Workaround

The workaround is to build a separate sidecar PTT wrapper that records audio, sends it to the preferred ASR backend, then injects text back into OpenClaw. That works, but it duplicates hotkey handling, recording state, failure handling, session routing, and delivery behavior that the macOS app already handles better.

Severity: Medium. Current native PTT works, but the STT backend is not configurable, so users with privacy, language quality, cost, or deployment constraints need a sidecar workaround.

VoicePushToTalk capture layer -> STTBackend protocol/interface -> AppleSpeechSTTBackend default implementation -> GatewayMediaAudioSTTBackend batch implementation -> existing VoiceWakeForwarder/session delivery path

Summary

Allow the macOS VoicePushToTalk flow to use a configurable speech-to-text backend, including Gateway tools.media.audio or a private/local ASR service, instead of always using Apple Speech.

Problem to solve

The macOS app's native Push-to-Talk experience has the right UX primitives: global hotkey handling, microphone capture, overlay state, app lifecycle integration, and existing forwarding into OpenClaw sessions.

However, the current flow appears to couple VoicePushToTalk directly to Apple Speech. That makes the native PTT path hard to use for users who already have a preferred OpenClaw audio transcription stack, such as tools.media.audio, a local Whisper/Qwen/Parakeet service, or a trusted private LAN ASR host.

This is difficult to solve as a normal third-party plugin because the missing abstraction is inside the native macOS PTT capture/transcription pipeline, before text reaches the Gateway/session layer.

Proposed solution

Add an STT backend abstraction for macOS VoicePushToTalk.

Suggested behavior:

Keep Apple Speech as the default backend and preserve the existing realtime partial transcript behavior.
Add a batch backend for Push-to-Talk:
- Hotkey down: record microphone audio locally.
- Hotkey up: finalize the audio.
- Send the audio to the configured backend.
- Receive transcript.
- Reuse the existing VoiceWakeForwarder/session delivery path.
Support a Gateway-backed option that uses OpenClaw's configured tools.media.audio transcription path.
Optionally support a local/private HTTP backend or command backend later, if maintainers think that is acceptable from a security and UX perspective.
In batch mode, the overlay can show state transitions such as Recording -> Transcribing -> Sent/Failed instead of live partial transcript text.
Keep wake-word mode on Apple Speech initially unless a streaming/wake-word backend abstraction is added later.

This request is specifically for native macOS Push-to-Talk voice input. It is not asking for full realtime voice conversation mode or speech-to-speech.

Alternatives considered

Separate wrapper app: works, but duplicates hotkey capture, microphone capture, overlay state, retries, notifications, and session delivery logic. It is also more fragile across macOS permissions and app lifecycle changes.
Arbitrary external shell command from the macOS app: flexible, but may be too broad from a security and support perspective.
Full realtime/talk-mode backend replacement: useful, but larger than this request. Batch PTT transcription is a smaller and more incremental capability.
Continue using only Apple Speech: simple, but prevents users from using their existing private/local OpenClaw transcription setup.

Impact

Affected users/systems/channels: macOS users who use native Push-to-Talk and want local, private, or custom speech-to-text routing.

Severity: Medium. Current native PTT works, but the STT backend is not configurable, so users with privacy, language quality, cost, or deployment constraints need a sidecar workaround.

Frequency: Every macOS PTT voice input for users who do not want Apple Speech as the transcription backend.

Consequence: Extra custom tooling, duplicated native app behavior, harder debugging, weaker status/failure UX, and less reliable session/delivery integration than the built-in macOS app could provide.

Evidence/examples

macOS Voice Wake / Push-to-Talk documentation describes the current PTT path: https://docs.openclaw.ai/platforms/mac/voicewake
OpenClaw already has audio transcription configuration through tools.media.audio: https://docs.openclaw.ai/nodes/audio
Related but not identical issues:
- Telegram/Groq audio transcription regression: https://github.com/openclaw/openclaw/issues/59502
- Telegram voice STT regression: https://github.com/openclaw/openclaw/issues/62205
- Telegram voice -> local Whisper STT/TTS feature: https://github.com/openclaw/openclaw/issues/18424

Additional information

A concrete target architecture could be:

VoicePushToTalk capture layer
-> STTBackend protocol/interface
-> AppleSpeechSTTBackend default implementation
-> GatewayMediaAudioSTTBackend batch implementation
-> existing VoiceWakeForwarder/session delivery path

The important compatibility constraint is that existing Apple Speech behavior should remain the default. Users who do not configure another backend should see no behavior change.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Feature]: Pluggable STT backend for macOS Push-to-Talk

Recommended Tools

GitHub issue graph ai analysis

Root Cause