openclaw - 💡(How to fix) Fix Pluggable STT Providers for voice-call Plugin [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#68697Fetched 2026-04-19 15:08:31
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

Root Cause

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

Fix Action

Fix / Workaround

The voice-call plugin's streaming STT is hardcoded to openai-realtime in three places: the zod schema enum, the initializeMediaStreaming() method, and the openclaw.plugin.json config schema. There is no way to use an alternative STT provider (AWS Transcribe, Deepgram, local Whisper, etc.) without patching compiled dist files. This forces all voice-call users to depend on OpenAI for transcription regardless of their infrastructure preferences.

  • Patching the compiled dist files after each update (current workaround — fragile, requires a systemd watcher or manual reapplication)
  • Using a SOCKS proxy or external adapter to intercept the OpenAI WebSocket and redirect to another provider (over-engineered, adds latency)
  • Disabling streaming STT and using buffered transcription (loses real-time capability)

Also related: the responseAgent config field from #9635 would complement this — currently const agentId = "main" is hardcoded in the response generator, requiring a separate patch to route voice calls to a specific agent. Together, pluggable STT + responseAgent would make the voice-call plugin fully configurable for multi-agent, multi-provider voice setups.

RAW_BUFFERClick to expand / collapse

Summary

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

Problem to solve

The voice-call plugin's streaming STT is hardcoded to openai-realtime in three places: the zod schema enum, the initializeMediaStreaming() method, and the openclaw.plugin.json config schema. There is no way to use an alternative STT provider (AWS Transcribe, Deepgram, local Whisper, etc.) without patching compiled dist files. This forces all voice-call users to depend on OpenAI for transcription regardless of their infrastructure preferences.

Proposed solution

Add api.registerRealtimeTranscriptionProvider(provider) to the plugin SDK. The provider interface already exists informally — OpenAIRealtimeSTTProvider has a clean contract: createSession() returning a session with sendAudio(), onTranscript(), onPartial(), onSpeechStart(), close(), and isConnected(). Making this pluggable requires: expanding the sttProvider config to accept any registered provider ID, and having initializeMediaStreaming() resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.

Alternatives considered

  • Patching the compiled dist files after each update (current workaround — fragile, requires a systemd watcher or manual reapplication)
  • Using a SOCKS proxy or external adapter to intercept the OpenAI WebSocket and redirect to another provider (over-engineered, adds latency)
  • Disabling streaming STT and using buffered transcription (loses real-time capability)

Impact

Opens the voice-call plugin to the broader STT ecosystem. AWS Transcribe, Deepgram, Azure Speech, and local Whisper all have streaming transcription APIs. Users running on their own hardware (edge boxes, self-hosted) benefit most — they can choose providers based on cost, latency, privacy, or regulatory requirements rather than being locked to OpenAI.

Evidence/examples

've built an AWS Transcribe STT provider that implements the same interface as OpenAIRealtimeSTTProvider, including mu-law to PCM decoding for Twilio, speech detection from partial results, and configurable silence thresholds. Full source: https://github.com/agenticbrian/openclaw-provider-aws-polly/blob/master/transcribe-stt.js — ready to contribute as a PR if the pluggable interface lands.

Additional information

Also related: the responseAgent config field from #9635 would complement this — currently const agentId = "main" is hardcoded in the response generator, requiring a separate patch to route voice calls to a specific agent. Together, pluggable STT + responseAgent would make the voice-call plugin fully configurable for multi-agent, multi-provider voice setups.

extent analysis

TL;DR

Implement a pluggable STT provider interface in the voice-call plugin to allow users to choose from various transcription providers.

Guidance

  • Define the api.registerRealtimeTranscriptionProvider(provider) method in the plugin SDK to register alternative STT providers.
  • Expand the sttProvider config to accept any registered provider ID, enabling users to select their preferred provider.
  • Modify the initializeMediaStreaming() method to resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.
  • Consider contributing the existing AWS Transcribe STT provider implementation as a reference example for other providers.

Example

// Example provider registration
api.registerRealtimeTranscriptionProvider('aws-transcribe', {
  createSession: () => { /* implementation */ },
  sendAudio: () => { /* implementation */ },
  onTranscript: () => { /* implementation */ },
  onPartial: () => { /* implementation */ },
  onSpeechStart: () => { /* implementation */ },
  close: () => { /* implementation */ },
  isConnected: () => { /* implementation */ },
});

Notes

The proposed solution requires careful consideration of the provider interface and registry implementation to ensure seamless integration with various STT providers.

Recommendation

Apply the proposed workaround by implementing the pluggable STT provider interface, as it provides a flexible and scalable solution for users to choose their preferred transcription providers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Pluggable STT Providers for voice-call Plugin [1 participants]