openclaw - 💡(How to fix) Fix Feature request: add Speaches providers for STT and TTS

StepCodex · 2026-05-11T17:04:55Z

[openclaw] Please add first-class Speaches provider support for both speech-to-text STT and text-to-speech TTS in OpenClaw voice calls. Speaches exposes OpenAI… Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls. Speaches exposes OpenAI-compatible endpoints and can run local/free models such as faster-whisper for STT and Kokoro ONNX for TTS. It would be useful to configure it directly as a supported provider instead of relying on custom extension glue or treating it as generic OpenAI-compatible plumbing. ## Summary Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls. Speaches exposes OpenAI-compatible endpoints and can run local/free models such as faster-whisper for STT and Kokoro ONNX for TTS. It would be useful to configure it directly as a supported provider instead of relying on custom extension glue or treating it as generic OpenAI-compatible plumbing. ## Motivation Local voice calls benefit from a fully local/free speech stack: - STT: Speaches + faster-whisper models, e.g. `Systran/faster-distil-whisper-small.en` - TTS: Speaches + Kokoro, e.g. `speaches-ai/Kokoro-82M-v1.0-ONNX` This is especially useful for development, privacy-sensitive installs, and low-cost personal deployments. ## Requested behavior Add documented provider support for: 1. **Speaches STT provider** - Realtime transcription for Twilio Media Streams / voice-call streaming - Configurable `baseUrl`, `apiKey`, `model`, VAD/silence settings, and Twilio μ-law conversion if needed - Works with Speaches `/v1/realtime` transcription sessions 2. **Speaches TTS provider** - Voice-call TTS via Speaches OpenAI-compatible `/v1/audio/speech` - Configurable `baseUrl`, `apiKey`, `model`, and `voice` - Should work with Kokoro models served by Speaches 3. **Docs/config examples** - Example OpenClaw config for local Speaches STT + TTS - Recommended models for latency-sensitive calls - Notes about CPU latency and preloading models ## Example config shape ```jsonc { "plugins": { "entries": { "voice-call": { "config": { "streaming": { "enabled": true, "provider": "speaches", "providers": { "speaches": { "baseUrl": "http://127.0.0.1:8000/v1", "model": "Systran/faster-distil-whisper-small.en", "apiKey": "...", "silenceDurationMs": 500, "vadThreshold": 0.5, "convertTwilioMulaw": true } } }, "tts": { "provider": "speaches", "providers": { "speaches": { "baseUrl": "http://127.0.0.1:8000/v1", "model": "speaches-ai/Kokoro-82M-v1.0-ONNX", "voice": "af_sky", "apiKey": "..." } } } } } } } } ``` ## Why not just use OpenAI-compatible config? OpenAI-compatible endpoints cover part of this, but Speaches has practical differences that are worth making first-class: - local model names and preload behavior - realtime transcription websocket behavior - Twilio audio format conversion concerns - model latency guidance for CPU-only installs - clearer docs for fully local voice-call setups ## Acceptance criteria - `speaches` is selectable as an STT/realtime transcription provider for voice-call streaming. - `speaches` is selectable as a TTS provider for voice calls. - Docs include a working local Speaches config example. - Provider config validates with helpful errors when Speaches is unreachable or model config is missing.

openclaw2026-05-11 17:04:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls.

Speaches exposes OpenAI-compatible endpoints and can run local/free models such as faster-whisper for STT and Kokoro ONNX for TTS. It would be useful to configure it directly as a supported provider instead of relying on custom extension glue or treating it as generic OpenAI-compatible plumbing.

Root Cause

Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls.

Code Example

{
  "plugins": {
    "entries": {
      "voice-call": {
        "config": {
          "streaming": {
            "enabled": true,
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "Systran/faster-distil-whisper-small.en",
                "apiKey": "...",
                "silenceDurationMs": 500,
                "vadThreshold": 0.5,
                "convertTwilioMulaw": true
              }
            }
          },
          "tts": {
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
                "voice": "af_sky",
                "apiKey": "..."
              }
            }
          }
        }
      }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Summary

Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls.

Motivation

Local voice calls benefit from a fully local/free speech stack:

STT: Speaches + faster-whisper models, e.g. Systran/faster-distil-whisper-small.en
TTS: Speaches + Kokoro, e.g. speaches-ai/Kokoro-82M-v1.0-ONNX

This is especially useful for development, privacy-sensitive installs, and low-cost personal deployments.

Requested behavior

Add documented provider support for:

Speaches STT provider
- Realtime transcription for Twilio Media Streams / voice-call streaming
- Configurable baseUrl, apiKey, model, VAD/silence settings, and Twilio μ-law conversion if needed
- Works with Speaches /v1/realtime transcription sessions
Speaches TTS provider
- Voice-call TTS via Speaches OpenAI-compatible /v1/audio/speech
- Configurable baseUrl, apiKey, model, and voice
- Should work with Kokoro models served by Speaches
Docs/config examples
- Example OpenClaw config for local Speaches STT + TTS
- Recommended models for latency-sensitive calls
- Notes about CPU latency and preloading models

Example config shape

{
  "plugins": {
    "entries": {
      "voice-call": {
        "config": {
          "streaming": {
            "enabled": true,
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "Systran/faster-distil-whisper-small.en",
                "apiKey": "...",
                "silenceDurationMs": 500,
                "vadThreshold": 0.5,
                "convertTwilioMulaw": true
              }
            }
          },
          "tts": {
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
                "voice": "af_sky",
                "apiKey": "..."
              }
            }
          }
        }
      }
    }
  }
}

Why not just use OpenAI-compatible config?

OpenAI-compatible endpoints cover part of this, but Speaches has practical differences that are worth making first-class:

local model names and preload behavior
realtime transcription websocket behavior
Twilio audio format conversion concerns
model latency guidance for CPU-only installs
clearer docs for fully local voice-call setups

Acceptance criteria

speaches is selectable as an STT/realtime transcription provider for voice-call streaming.
speaches is selectable as a TTS provider for voice calls.
Docs include a working local Speaches config example.
Provider config validates with helpful errors when Speaches is unreachable or model config is missing.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature request: add Speaches providers for STT and TTS

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Motivation

Requested behavior

Example config shape

Why not just use OpenAI-compatible config?

Acceptance criteria

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature request: add Speaches providers for STT and TTS

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Motivation

Requested behavior

Example config shape

Why not just use OpenAI-compatible config?

Acceptance criteria

Still need to ship something?

RELATED_DISCOVERY

TRENDING