openclaw - 💡(How to fix) Fix [Feature]: Pluggable STT backend for macOS Push-to-Talk

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Allow the macOS VoicePushToTalk flow to use a configurable speech-to-text backend, including Gateway tools.media.audio or a private/local ASR service, instead of always using Apple Speech.

Root Cause

This is difficult to solve as a normal third-party plugin because the missing abstraction is inside the native macOS PTT capture/transcription pipeline, before text reaches the Gateway/session layer.

Fix Action

Fix / Workaround

The workaround is to build a separate sidecar PTT wrapper that records audio, sends it to the preferred ASR backend, then injects text back into OpenClaw. That works, but it duplicates hotkey handling, recording state, failure handling, session routing, and delivery behavior that the macOS app already handles better.

Severity: Medium. Current native PTT works, but the STT backend is not configurable, so users with privacy, language quality, cost, or deployment constraints need a sidecar workaround.

Code Example

VoicePushToTalk capture layer
-> STTBackend protocol/interface
-> AppleSpeechSTTBackend default implementation
-> GatewayMediaAudioSTTBackend batch implementation
-> existing VoiceWakeForwarder/session delivery path
RAW_BUFFERClick to expand / collapse

Summary

Allow the macOS VoicePushToTalk flow to use a configurable speech-to-text backend, including Gateway tools.media.audio or a private/local ASR service, instead of always using Apple Speech.

Problem to solve

The macOS app's native Push-to-Talk experience has the right UX primitives: global hotkey handling, microphone capture, overlay state, app lifecycle integration, and existing forwarding into OpenClaw sessions.

However, the current flow appears to couple VoicePushToTalk directly to Apple Speech. That makes the native PTT path hard to use for users who already have a preferred OpenClaw audio transcription stack, such as tools.media.audio, a local Whisper/Qwen/Parakeet service, or a trusted private LAN ASR host.

The workaround is to build a separate sidecar PTT wrapper that records audio, sends it to the preferred ASR backend, then injects text back into OpenClaw. That works, but it duplicates hotkey handling, recording state, failure handling, session routing, and delivery behavior that the macOS app already handles better.

This is difficult to solve as a normal third-party plugin because the missing abstraction is inside the native macOS PTT capture/transcription pipeline, before text reaches the Gateway/session layer.

Proposed solution

Add an STT backend abstraction for macOS VoicePushToTalk.

Suggested behavior:

  • Keep Apple Speech as the default backend and preserve the existing realtime partial transcript behavior.
  • Add a batch backend for Push-to-Talk:
    • Hotkey down: record microphone audio locally.
    • Hotkey up: finalize the audio.
    • Send the audio to the configured backend.
    • Receive transcript.
    • Reuse the existing VoiceWakeForwarder/session delivery path.
  • Support a Gateway-backed option that uses OpenClaw's configured tools.media.audio transcription path.
  • Optionally support a local/private HTTP backend or command backend later, if maintainers think that is acceptable from a security and UX perspective.
  • In batch mode, the overlay can show state transitions such as Recording -> Transcribing -> Sent/Failed instead of live partial transcript text.
  • Keep wake-word mode on Apple Speech initially unless a streaming/wake-word backend abstraction is added later.

This request is specifically for native macOS Push-to-Talk voice input. It is not asking for full realtime voice conversation mode or speech-to-speech.

Alternatives considered

  • Separate wrapper app: works, but duplicates hotkey capture, microphone capture, overlay state, retries, notifications, and session delivery logic. It is also more fragile across macOS permissions and app lifecycle changes.
  • Arbitrary external shell command from the macOS app: flexible, but may be too broad from a security and support perspective.
  • Full realtime/talk-mode backend replacement: useful, but larger than this request. Batch PTT transcription is a smaller and more incremental capability.
  • Continue using only Apple Speech: simple, but prevents users from using their existing private/local OpenClaw transcription setup.

Impact

Affected users/systems/channels: macOS users who use native Push-to-Talk and want local, private, or custom speech-to-text routing.

Severity: Medium. Current native PTT works, but the STT backend is not configurable, so users with privacy, language quality, cost, or deployment constraints need a sidecar workaround.

Frequency: Every macOS PTT voice input for users who do not want Apple Speech as the transcription backend.

Consequence: Extra custom tooling, duplicated native app behavior, harder debugging, weaker status/failure UX, and less reliable session/delivery integration than the built-in macOS app could provide.

Evidence/examples

Additional information

A concrete target architecture could be:

VoicePushToTalk capture layer
-> STTBackend protocol/interface
-> AppleSpeechSTTBackend default implementation
-> GatewayMediaAudioSTTBackend batch implementation
-> existing VoiceWakeForwarder/session delivery path

The important compatibility constraint is that existing Apple Speech behavior should remain the default. Users who do not configure another backend should see no behavior change.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Feature]: Pluggable STT backend for macOS Push-to-Talk