openclaw - 💡(How to fix) Fix Pluggable STT Providers for voice-call Plugin [1 participants]

openclaw2026-04-18 21:12:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#68697•Fetched 2026-04-19 15:08:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

agenticbrian

Participants

agenticbrian

Timeline (top)

labeled ×1

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

Root Cause

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

Fix Action

Fix / Workaround

The voice-call plugin's streaming STT is hardcoded to openai-realtime in three places: the zod schema enum, the initializeMediaStreaming() method, and the openclaw.plugin.json config schema. There is no way to use an alternative STT provider (AWS Transcribe, Deepgram, local Whisper, etc.) without patching compiled dist files. This forces all voice-call users to depend on OpenAI for transcription regardless of their infrastructure preferences.

Patching the compiled dist files after each update (current workaround — fragile, requires a systemd watcher or manual reapplication)
Using a SOCKS proxy or external adapter to intercept the OpenAI WebSocket and redirect to another provider (over-engineered, adds latency)
Disabling streaming STT and using buffered transcription (loses real-time capability)

Also related: the responseAgent config field from #9635 would complement this — currently const agentId = "main" is hardcoded in the response generator, requiring a separate patch to route voice calls to a specific agent. Together, pluggable STT + responseAgent would make the voice-call plugin fully configurable for multi-agent, multi-provider voice setups.

RAW_BUFFERClick to expand / collapse

Summary

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

Problem to solve

Proposed solution

Add api.registerRealtimeTranscriptionProvider(provider) to the plugin SDK. The provider interface already exists informally — OpenAIRealtimeSTTProvider has a clean contract: createSession() returning a session with sendAudio(), onTranscript(), onPartial(), onSpeechStart(), close(), and isConnected(). Making this pluggable requires: expanding the sttProvider config to accept any registered provider ID, and having initializeMediaStreaming() resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.

Alternatives considered

Patching the compiled dist files after each update (current workaround — fragile, requires a systemd watcher or manual reapplication)
Using a SOCKS proxy or external adapter to intercept the OpenAI WebSocket and redirect to another provider (over-engineered, adds latency)
Disabling streaming STT and using buffered transcription (loses real-time capability)

Impact

Opens the voice-call plugin to the broader STT ecosystem. AWS Transcribe, Deepgram, Azure Speech, and local Whisper all have streaming transcription APIs. Users running on their own hardware (edge boxes, self-hosted) benefit most — they can choose providers based on cost, latency, privacy, or regulatory requirements rather than being locked to OpenAI.

Evidence/examples

've built an AWS Transcribe STT provider that implements the same interface as OpenAIRealtimeSTTProvider, including mu-law to PCM decoding for Twilio, speech detection from partial results, and configurable silence thresholds. Full source: https://github.com/agenticbrian/openclaw-provider-aws-polly/blob/master/transcribe-stt.js — ready to contribute as a PR if the pluggable interface lands.

Additional information

extent analysis

TL;DR

Implement a pluggable STT provider interface in the voice-call plugin to allow users to choose from various transcription providers.

Guidance

Define the api.registerRealtimeTranscriptionProvider(provider) method in the plugin SDK to register alternative STT providers.
Expand the sttProvider config to accept any registered provider ID, enabling users to select their preferred provider.
Modify the initializeMediaStreaming() method to resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.
Consider contributing the existing AWS Transcribe STT provider implementation as a reference example for other providers.

Example

// Example provider registration
api.registerRealtimeTranscriptionProvider('aws-transcribe', {
  createSession: () => { /* implementation */ },
  sendAudio: () => { /* implementation */ },
  onTranscript: () => { /* implementation */ },
  onPartial: () => { /* implementation */ },
  onSpeechStart: () => { /* implementation */ },
  close: () => { /* implementation */ },
  isConnected: () => { /* implementation */ },
});

Notes

The proposed solution requires careful consideration of the provider interface and registry implementation to ensure seamless integration with various STT providers.

Recommendation

Apply the proposed workaround by implementing the pluggable STT provider interface, as it provides a flexible and scalable solution for users to choose their preferred transcription providers.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Pluggable STT Providers for voice-call Plugin [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Pluggable STT Providers for voice-call Plugin [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING