openclaw - 💡(How to fix) Fix [Feature]: voice-call plugin: Support Google Gemini Live as end-to-end audio provider (STT+LLM+TTS bypass) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60093Fetched 2026-04-08 02:36:25
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
1
Author
Participants
Timeline (top)
commented ×1

Code Example

Current (3-step, ~5-7s latency):
┌──────────┐    ┌─────────────────┐    ┌───────────┐    ┌──────────────┐    ┌──────────┐
Caller  │───▶│ Twilio Media    │───▶│ OpenAI    │───▶│ Agent LLM    │───▶│ ElevenLabs│
 (Phone)  │◀───│ Stream (WS)     │◀───│ Realtime (Claude/GPT) │    │ TTS└──────────┘    └─────────────────┘    │ STT       │    └──────────────┘    └──────────┘
                                       └───────────┘

Proposed (1-step, ~1-1.5s latency):
┌──────────┐    ┌─────────────────┐    ┌──────────────────────┐
Caller  │───▶│ Twilio Media    │───▶│ Gemini Live API (Phone)  │◀───│ Stream (WS)     │◀───│ (audio-in/audio-out)└──────────┘    └─────────────────┘    └──────────────────────┘
RAW_BUFFERClick to expand / collapse

Original Request

Create a github feature request on openclaw to support Google Live

English Translation: Add support for Google Gemini Live API in the voice-call plugin, enabling end-to-end audio conversations without the separate STT→LLM→TTS pipeline.

Agent's Two Cents (could be wrong)

Everything below is the AI agent's best guess based on the current codebase. Take with a grain of salt — the original request above is the only thing that came from a human.

Problem / Motivation

The current voice-call architecture uses a 3-step pipeline: OpenAI Realtime STT → Agent LLM (e.g., Claude Opus) → ElevenLabs TTS. This introduces ~4-7 seconds of latency per turn, making real-time phone conversations feel unnatural. Google Gemini Live API (gemini-2.5-flash) offers native bidirectional audio — audio in, audio out — in a single model call, potentially reducing latency to ~1-1.5 seconds.

Proposed Solution

Add a new streaming mode to the voice-call plugin that routes Twilio media stream audio directly to the Gemini Live WebSocket API, bypassing the separate STT and TTS steps entirely. The Gemini model would handle speech understanding, reasoning, and speech synthesis in one pass.

Architecture Diagram

Current (3-step, ~5-7s latency):
┌──────────┐    ┌─────────────────┐    ┌───────────┐    ┌──────────────┐    ┌──────────┐
│  Caller  │───▶│ Twilio Media    │───▶│ OpenAI    │───▶│ Agent LLM    │───▶│ ElevenLabs│
│ (Phone)  │◀───│ Stream (WS)     │◀───│ Realtime  │    │ (Claude/GPT) │    │ TTS      │
└──────────┘    └─────────────────┘    │ STT       │    └──────────────┘    └──────────┘
                                       └───────────┘

Proposed (1-step, ~1-1.5s latency):
┌──────────┐    ┌─────────────────┐    ┌──────────────────────┐
│  Caller  │───▶│ Twilio Media    │───▶│ Gemini Live API      │
│ (Phone)  │◀───│ Stream (WS)     │◀───│ (audio-in/audio-out) │
└──────────┘    └─────────────────┘    └──────────────────────┘

Dependencies & Potential Blockers

  • Voice-call plugin STT provider is currently hardcoded to openai-realtime enum: z.enum(["openai-realtime"]) in config.ts
  • Need to abstract the streaming pipeline to support alternative end-to-end audio models
  • Gemini Live API uses a different WebSocket protocol than OpenAI Realtime
  • Agent context/tools integration: Gemini would need the system prompt and tool definitions that currently go through the Agent LLM

External Setup Required

  • ⚠️ API key / credentials: Google AI API key with Gemini Live access
  • ⚠️ Model access: gemini-2.5-flash-native-audio-preview may require allowlist/waitlist

How to Validate

  • Configure sttProvider: "gemini-live" in voice-call config
  • Make an outbound call, speak to the agent, and receive audio responses
  • Measure end-to-end latency: should be < 2 seconds per turn
  • Verify the agent can access conversation context and tools

Scope Estimate

large

Key Files/Modules Likely Involved

  • extensions/voice-call/src/config.ts — add gemini-live to sttProvider enum
  • extensions/voice-call/src/providers/stt-openai-realtime.ts — reference for new provider
  • extensions/voice-call/src/media-stream.ts — audio routing
  • extensions/voice-call/src/manager.ts — call lifecycle management
  • New file: extensions/voice-call/src/providers/gemini-live.ts

Rough Implementation Sketch

  • Abstract current OpenAIRealtimeSTTProvider into a generic AudioStreamProvider interface
  • Implement GeminiLiveProvider that:
    • Opens WebSocket to generativelanguage.googleapis.com Live API
    • Sends raw PCM audio from Twilio media stream
    • Receives audio responses and forwards back to Twilio
    • Handles tool calls via Gemini function calling
  • Add config option: streaming.provider: "openai-realtime" | "gemini-live"
  • Support injecting system prompt and conversation context into Gemini session

Open Questions

  • Should Gemini Live completely replace the STT→LLM→TTS pipeline, or be an alternative mode?
  • How to handle agent tools/skills that currently rely on the intermediate text transcript?
  • Should conversation transcripts still be captured for logging even in audio-native mode?
  • Gemini Live audio format compatibility with Twilio (mulaw 8kHz vs PCM 16kHz)?

Potential Risks or Gotchas

  • Audio format conversion between Twilio (mulaw 8kHz) and Gemini (likely PCM 16kHz)
  • Gemini Live API is still in preview — may have breaking changes
  • Loss of flexibility: with separate STT/LLM/TTS you can mix providers; with Gemini Live it's all-in-one
  • Tool calling latency in Gemini may differ from dedicated LLM providers

Related Issues

  • #45561 — [Feature]: Native Gemini Live API Support (gemini-2.5-flash-native-audio-preview) — broader request for Gemini Live support across OpenClaw
  • #7200 — Feature Request: Real-time Voice Conversation Support

extent analysis

TL;DR

To support Google Live API in the voice-call plugin, implement a new streaming mode that routes Twilio media stream audio directly to the Gemini Live WebSocket API, bypassing the separate STT and TTS steps.

Guidance

  • Abstract the current OpenAIRealtimeSTTProvider into a generic AudioStreamProvider interface to support alternative end-to-end audio models.
  • Implement a GeminiLiveProvider that handles speech understanding, reasoning, and speech synthesis in one pass, and integrates with the Agent context and tools.
  • Update the config.ts file to include gemini-live as an option for the sttProvider enum and add a new config option for the streaming provider.
  • Verify the implementation by configuring sttProvider: "gemini-live" and measuring end-to-end latency, which should be less than 2 seconds per turn.

Example

// extensions/voice-call/src/providers/gemini-live.ts
import { AudioStreamProvider } from './audio-stream-provider';

class GeminiLiveProvider implements AudioStreamProvider {
  async init(): Promise<void> {
    // Open WebSocket to Gemini Live API
  }

  async sendAudio(audio: Buffer): Promise<void> {
    // Send raw PCM audio from Twilio media stream to Gemini Live API
  }

  async receiveAudio(): Promise<Buffer> {
    // Receive audio responses from Gemini Live API and forward back to Twilio
  }
}

Notes

The implementation may require handling audio format conversion between Twilio and Gemini Live API, as well as potential breaking changes in the Gemini Live API.

Recommendation

Apply a workaround by implementing the GeminiLiveProvider and updating the config.ts file to support the new streaming mode, as this will allow for a more efficient and low-latency voice conversation experience.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Feature]: voice-call plugin: Support Google Gemini Live as end-to-end audio provider (STT+LLM+TTS bypass) [1 comments, 2 participants]