openclaw - 💡(How to fix) Fix Feature request: add Speaches providers for STT and TTS

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls.

Speaches exposes OpenAI-compatible endpoints and can run local/free models such as faster-whisper for STT and Kokoro ONNX for TTS. It would be useful to configure it directly as a supported provider instead of relying on custom extension glue or treating it as generic OpenAI-compatible plumbing.

Root Cause

Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls.

Speaches exposes OpenAI-compatible endpoints and can run local/free models such as faster-whisper for STT and Kokoro ONNX for TTS. It would be useful to configure it directly as a supported provider instead of relying on custom extension glue or treating it as generic OpenAI-compatible plumbing.

Code Example

{
  "plugins": {
    "entries": {
      "voice-call": {
        "config": {
          "streaming": {
            "enabled": true,
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "Systran/faster-distil-whisper-small.en",
                "apiKey": "...",
                "silenceDurationMs": 500,
                "vadThreshold": 0.5,
                "convertTwilioMulaw": true
              }
            }
          },
          "tts": {
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
                "voice": "af_sky",
                "apiKey": "..."
              }
            }
          }
        }
      }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Summary

Please add first-class Speaches provider support for both speech-to-text (STT) and text-to-speech (TTS) in OpenClaw voice calls.

Speaches exposes OpenAI-compatible endpoints and can run local/free models such as faster-whisper for STT and Kokoro ONNX for TTS. It would be useful to configure it directly as a supported provider instead of relying on custom extension glue or treating it as generic OpenAI-compatible plumbing.

Motivation

Local voice calls benefit from a fully local/free speech stack:

  • STT: Speaches + faster-whisper models, e.g. Systran/faster-distil-whisper-small.en
  • TTS: Speaches + Kokoro, e.g. speaches-ai/Kokoro-82M-v1.0-ONNX

This is especially useful for development, privacy-sensitive installs, and low-cost personal deployments.

Requested behavior

Add documented provider support for:

  1. Speaches STT provider

    • Realtime transcription for Twilio Media Streams / voice-call streaming
    • Configurable baseUrl, apiKey, model, VAD/silence settings, and Twilio μ-law conversion if needed
    • Works with Speaches /v1/realtime transcription sessions
  2. Speaches TTS provider

    • Voice-call TTS via Speaches OpenAI-compatible /v1/audio/speech
    • Configurable baseUrl, apiKey, model, and voice
    • Should work with Kokoro models served by Speaches
  3. Docs/config examples

    • Example OpenClaw config for local Speaches STT + TTS
    • Recommended models for latency-sensitive calls
    • Notes about CPU latency and preloading models

Example config shape

{
  "plugins": {
    "entries": {
      "voice-call": {
        "config": {
          "streaming": {
            "enabled": true,
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "Systran/faster-distil-whisper-small.en",
                "apiKey": "...",
                "silenceDurationMs": 500,
                "vadThreshold": 0.5,
                "convertTwilioMulaw": true
              }
            }
          },
          "tts": {
            "provider": "speaches",
            "providers": {
              "speaches": {
                "baseUrl": "http://127.0.0.1:8000/v1",
                "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
                "voice": "af_sky",
                "apiKey": "..."
              }
            }
          }
        }
      }
    }
  }
}

Why not just use OpenAI-compatible config?

OpenAI-compatible endpoints cover part of this, but Speaches has practical differences that are worth making first-class:

  • local model names and preload behavior
  • realtime transcription websocket behavior
  • Twilio audio format conversion concerns
  • model latency guidance for CPU-only installs
  • clearer docs for fully local voice-call setups

Acceptance criteria

  • speaches is selectable as an STT/realtime transcription provider for voice-call streaming.
  • speaches is selectable as a TTS provider for voice calls.
  • Docs include a working local Speaches config example.
  • Provider config validates with helpful errors when Speaches is unreachable or model config is missing.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature request: add Speaches providers for STT and TTS