openclaw - 💡(How to fix) Fix [Feature]: voice-call plugin: Support Google Gemini Live as end-to-end audio provider (STT+LLM+TTS bypass) [1 comments, 2 participants]

xinbenlv · 2026-04-03T05:41:13Z

[openclaw] Original Request Create a github feature request on openclaw to support Google Live English Translation: Add support for Google Gemini Live API in t… ## Original Request > Create a github feature request on openclaw to support Google Live > **English Translation:** Add support for Google Gemini Live API in the voice-call plugin, enabling end-to-end audio conversations without the separate STT→LLM→TTS pipeline. ## Agent's Two Cents (could be wrong) > Everything below is the AI agent's best guess based on the current codebase. > Take with a grain of salt — the original request above is the only thing that came from a human. ### Problem / Motivation The current voice-call architecture uses a 3-step pipeline: OpenAI Realtime STT → Agent LLM (e.g., Claude Opus) → ElevenLabs TTS. This introduces **~4-7 seconds of latency** per turn, making real-time phone conversations feel unnatural. Google Gemini Live API (`gemini-2.5-flash`) offers native bidirectional audio — audio in, audio out — in a single model call, potentially reducing latency to **~1-1.5 seconds**. ### Proposed Solution Add a new streaming mode to the voice-call plugin that routes Twilio media stream audio directly to the Gemini Live WebSocket API, bypassing the separate STT and TTS steps entirely. The Gemini model would handle speech understanding, reasoning, and speech synthesis in one pass. ### Architecture Diagram ``` Current (3-step, ~5-7s latency): ┌──────────┐ ┌─────────────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────┐ │ Caller │───▶│ Twilio Media │───▶│ OpenAI │───▶│ Agent LLM │───▶│ ElevenLabs│ │ (Phone) │◀───│ Stream (WS) │◀───│ Realtime │ │ (Claude/GPT) │ │ TTS │ └──────────┘ └─────────────────┘ │ STT │ └──────────────┘ └──────────┘ └───────────┘ Proposed (1-step, ~1-1.5s latency): ┌──────────┐ ┌─────────────────┐ ┌──────────────────────┐ │ Caller │───▶│ Twilio Media │───▶│ Gemini Live API │ │ (Phone) │◀───│ Stream (WS) │◀───│ (audio-in/audio-out) │ └──────────┘ └─────────────────┘ └──────────────────────┘ ``` ### Dependencies & Potential Blockers - Voice-call plugin STT provider is currently hardcoded to `openai-realtime` enum: `z.enum(["openai-realtime"])` in `config.ts` - Need to abstract the streaming pipeline to support alternative end-to-end audio models - Gemini Live API uses a different WebSocket protocol than OpenAI Realtime - Agent context/tools integration: Gemini would need the system prompt and tool definitions that currently go through the Agent LLM ### External Setup Required - ⚠️ **API key / credentials**: Google AI API key with Gemini Live access - ⚠️ **Model access**: `gemini-2.5-flash-native-audio-preview` may require allowlist/waitlist ### How to Validate - Configure `sttProvider: "gemini-live"` in voice-call config - Make an outbound call, speak to the agent, and receive audio responses - Measure end-to-end latency: should be < 2 seconds per turn - Verify the agent can access conversation context and tools ### Scope Estimate large ### Key Files/Modules Likely Involved - `extensions/voice-call/src/config.ts` — add `gemini-live` to `sttProvider` enum - `extensions/voice-call/src/providers/stt-openai-realtime.ts` — reference for new provider - `extensions/voice-call/src/media-stream.ts` — audio routing - `extensions/voice-call/src/manager.ts` — call lifecycle management - New file: `extensions/voice-call/src/providers/gemini-live.ts` ### Rough Implementation Sketch - Abstract current `OpenAIRealtimeSTTProvider` into a generic `AudioStreamProvider` interface - Implement `GeminiLiveProvider` that: - Opens WebSocket to `generativelanguage.googleapis.com` Live API - Sends raw PCM audio from Twilio media stream - Receives audio responses and forwards back to Twilio - Handles tool calls via Gemini function calling - Add config option: `streaming.provider: "openai-realtime" | "gemini-live"` - Support injecting system prompt and conversation context into Gemini session ### Open Questions - Should Gemini Live completely replace the STT→LLM→TTS pipeline, or be an alternative mode? - How to handle agent tools/skills that currently rely on the intermediate text transcript? - Should conversation transcripts still be captured for logging even in audio-native mode? - Gemini Live audio format compatibility with Twilio (mulaw 8kHz vs PCM 16kHz)? ### Potential Risks or Gotchas - Audio format conversion between Twilio (mulaw 8kHz) and Gemini (likely PCM 16kHz) - Gemini Live API is still in preview — may have breaking changes - Loss of flexibility: with separate STT/LLM/TTS you can mix providers; with Gemini Live it's all-in-one - Tool calling latency in Gemini may differ from dedicated LLM providers ### Related Issues - #45561 — [Feature]: Native Gemini Live API Support (gemini-2.5-flash-native-audio-preview) — broader request for Gemini Live support across OpenClaw - #7200 — Feature Request: Real-time Voice Conversation Support

Code Example

Current (3-step, ~5-7s latency):
┌──────────┐    ┌─────────────────┐    ┌───────────┐    ┌──────────────┐    ┌──────────┐
│  Caller  │───▶│ Twilio Media    │───▶│ OpenAI    │───▶│ Agent LLM    │───▶│ ElevenLabs│
│ (Phone)  │◀───│ Stream (WS)     │◀───│ Realtime  │    │ (Claude/GPT) │    │ TTS      │
└──────────┘    └─────────────────┘    │ STT       │    └──────────────┘    └──────────┘
                                       └───────────┘

Proposed (1-step, ~1-1.5s latency):
┌──────────┐    ┌─────────────────┐    ┌──────────────────────┐
│  Caller  │───▶│ Twilio Media    │───▶│ Gemini Live API      │
│ (Phone)  │◀───│ Stream (WS)     │◀───│ (audio-in/audio-out) │
└──────────┘    └─────────────────┘    └──────────────────────┘

Original Request

Create a github feature request on openclaw to support Google Live

English Translation: Add support for Google Gemini Live API in the voice-call plugin, enabling end-to-end audio conversations without the separate STT→LLM→TTS pipeline.

Agent's Two Cents (could be wrong)

Everything below is the AI agent's best guess based on the current codebase. Take with a grain of salt — the original request above is the only thing that came from a human.

Problem / Motivation

The current voice-call architecture uses a 3-step pipeline: OpenAI Realtime STT → Agent LLM (e.g., Claude Opus) → ElevenLabs TTS. This introduces ~4-7 seconds of latency per turn, making real-time phone conversations feel unnatural. Google Gemini Live API (gemini-2.5-flash) offers native bidirectional audio — audio in, audio out — in a single model call, potentially reducing latency to ~1-1.5 seconds.

Proposed Solution

Add a new streaming mode to the voice-call plugin that routes Twilio media stream audio directly to the Gemini Live WebSocket API, bypassing the separate STT and TTS steps entirely. The Gemini model would handle speech understanding, reasoning, and speech synthesis in one pass.

Architecture Diagram

Current (3-step, ~5-7s latency):
┌──────────┐    ┌─────────────────┐    ┌───────────┐    ┌──────────────┐    ┌──────────┐
│  Caller  │───▶│ Twilio Media    │───▶│ OpenAI    │───▶│ Agent LLM    │───▶│ ElevenLabs│
│ (Phone)  │◀───│ Stream (WS)     │◀───│ Realtime  │    │ (Claude/GPT) │    │ TTS      │
└──────────┘    └─────────────────┘    │ STT       │    └──────────────┘    └──────────┘
                                       └───────────┘

Proposed (1-step, ~1-1.5s latency):
┌──────────┐    ┌─────────────────┐    ┌──────────────────────┐
│  Caller  │───▶│ Twilio Media    │───▶│ Gemini Live API      │
│ (Phone)  │◀───│ Stream (WS)     │◀───│ (audio-in/audio-out) │
└──────────┘    └─────────────────┘    └──────────────────────┘

Dependencies & Potential Blockers

Voice-call plugin STT provider is currently hardcoded to openai-realtime enum: z.enum(["openai-realtime"]) in config.ts
Need to abstract the streaming pipeline to support alternative end-to-end audio models
Gemini Live API uses a different WebSocket protocol than OpenAI Realtime
Agent context/tools integration: Gemini would need the system prompt and tool definitions that currently go through the Agent LLM

External Setup Required

⚠️ API key / credentials: Google AI API key with Gemini Live access
⚠️ Model access: gemini-2.5-flash-native-audio-preview may require allowlist/waitlist

How to Validate

Configure sttProvider: "gemini-live" in voice-call config
Make an outbound call, speak to the agent, and receive audio responses
Measure end-to-end latency: should be < 2 seconds per turn
Verify the agent can access conversation context and tools

Scope Estimate

large

Key Files/Modules Likely Involved

extensions/voice-call/src/config.ts — add gemini-live to sttProvider enum
extensions/voice-call/src/providers/stt-openai-realtime.ts — reference for new provider
extensions/voice-call/src/media-stream.ts — audio routing
extensions/voice-call/src/manager.ts — call lifecycle management
New file: extensions/voice-call/src/providers/gemini-live.ts

Rough Implementation Sketch

Abstract current OpenAIRealtimeSTTProvider into a generic AudioStreamProvider interface
Implement GeminiLiveProvider that:
- Opens WebSocket to generativelanguage.googleapis.com Live API
- Sends raw PCM audio from Twilio media stream
- Receives audio responses and forwards back to Twilio
- Handles tool calls via Gemini function calling
Add config option: streaming.provider: "openai-realtime" | "gemini-live"
Support injecting system prompt and conversation context into Gemini session

Open Questions

Should Gemini Live completely replace the STT→LLM→TTS pipeline, or be an alternative mode?
How to handle agent tools/skills that currently rely on the intermediate text transcript?
Should conversation transcripts still be captured for logging even in audio-native mode?
Gemini Live audio format compatibility with Twilio (mulaw 8kHz vs PCM 16kHz)?

Potential Risks or Gotchas

Audio format conversion between Twilio (mulaw 8kHz) and Gemini (likely PCM 16kHz)
Gemini Live API is still in preview — may have breaking changes
Loss of flexibility: with separate STT/LLM/TTS you can mix providers; with Gemini Live it's all-in-one
Tool calling latency in Gemini may differ from dedicated LLM providers

Related Issues

#45561 — [Feature]: Native Gemini Live API Support (gemini-2.5-flash-native-audio-preview) — broader request for Gemini Live support across OpenClaw
#7200 — Feature Request: Real-time Voice Conversation Support

extent analysis

TL;DR

To support Google Live API in the voice-call plugin, implement a new streaming mode that routes Twilio media stream audio directly to the Gemini Live WebSocket API, bypassing the separate STT and TTS steps.

Guidance

Abstract the current OpenAIRealtimeSTTProvider into a generic AudioStreamProvider interface to support alternative end-to-end audio models.
Implement a GeminiLiveProvider that handles speech understanding, reasoning, and speech synthesis in one pass, and integrates with the Agent context and tools.
Update the config.ts file to include gemini-live as an option for the sttProvider enum and add a new config option for the streaming provider.
Verify the implementation by configuring sttProvider: "gemini-live" and measuring end-to-end latency, which should be less than 2 seconds per turn.

Example

// extensions/voice-call/src/providers/gemini-live.ts
import { AudioStreamProvider } from './audio-stream-provider';

class GeminiLiveProvider implements AudioStreamProvider {
  async init(): Promise<void> {
    // Open WebSocket to Gemini Live API
  }

  async sendAudio(audio: Buffer): Promise<void> {
    // Send raw PCM audio from Twilio media stream to Gemini Live API
  }

  async receiveAudio(): Promise<Buffer> {
    // Receive audio responses from Gemini Live API and forward back to Twilio
  }
}

Notes

The implementation may require handling audio format conversion between Twilio and Gemini Live API, as well as potential breaking changes in the Gemini Live API.

Recommendation

Apply a workaround by implementing the GeminiLiveProvider and updating the config.ts file to support the new streaming mode, as this will allow for a more efficient and low-latency voice conversation experience.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Feature]: voice-call plugin: Support Google Gemini Live as end-to-end audio provider (STT+LLM+TTS bypass) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Original Request

Agent's Two Cents (could be wrong)

Problem / Motivation

Proposed Solution

Architecture Diagram

Dependencies & Potential Blockers

External Setup Required

How to Validate

Scope Estimate

Key Files/Modules Likely Involved

Rough Implementation Sketch

Open Questions

Potential Risks or Gotchas

Related Issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Feature]: voice-call plugin: Support Google Gemini Live as end-to-end audio provider (STT+LLM+TTS bypass) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Original Request

Agent's Two Cents (could be wrong)

Problem / Motivation

Proposed Solution

Architecture Diagram

Dependencies & Potential Blockers

External Setup Required

How to Validate

Scope Estimate

Key Files/Modules Likely Involved

Rough Implementation Sketch

Open Questions

Potential Risks or Gotchas

Related Issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING