ollama - 💡(How to fix) Fix Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15807Fetched 2026-04-26 05:06:08
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Fix Action

Fix / Workaround

IssueTitleStateRelevance
#1168Support WhisperForConditionalGenerationOpen (63 👍)Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming.
#5424Supports voice recognition and text-to-speech capabilitiesOpenGeneric request for STT + TTS with extension framework. Not specific to streaming/realtime conversation.
#9804Sesame family models, Realtime voice mode?OpenModel-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope.
#11798Add Audio Input Support for Multimodal ModelsOpen (10 comments)Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice.
#7514Realtime API like OpenAIClosed (merged into #1168)Was the closest prior issue to this exact request. Closed by maintainer jmorganca on 2024-12-23.

Code Example

POST /api/realtime
Upgrade: websocket

---

┌─────────────┐      WebSocket       ┌─────────────────────────────────────┐
Client    │ ◄──────────────────► │           Ollama Server  (Mic/Spk)  │    audio/text/events │                                     │
└─────────────┘                      │  ┌─────────┐  ┌──────┐  ┌────────┐  │
                                     │  │  STT   │──►│ LLM  │──►│  TTS   │  │
(local) │  │      │  (local) │  │
                                     │  └─────────┘  └──────┘  └────────┘  │
                                     │         ▲              │            │
VAD / Buffer   Streaming Audio                                     └─────────────────────────────────────┘

---

# Start a realtime voice session
ollama realtime llama3.2 --voice default

---

import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio
RAW_BUFFERClick to expand / collapse

Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation)

Status

  • File Created: 2026-04-25
  • Upstream Submitted: No — awaiting Ben's approval to submit to ollama/ollama

Feature Description

Requesting native support for realtime bidirectional voice conversation in Ollama — the ability to hold a natural, low-latency spoken dialogue with an LLM, similar to OpenAI's Realtime API. This is distinct from the current audio-input-only multimodal support or external STT→LLM→TTS chaining.

What "Realtime Voice Chat" Means Here

CapabilityCurrent OllamaThis Feature Request
Audio file input for multimodal models✅ (e.g., Qwen2-Audio)Not the same thing
Speech-to-text (STT)❌ No native supportNot sufficient
Text-to-speech (TTS)❌ No native supportNot sufficient
Streaming audio-in → streaming audio-out❌ Not supportedThis is the ask
Conversational turn-taking with voice activity detection (VAD)❌ Not supportedThis is the ask
Low-latency (<500ms) voice response❌ Not supportedThis is the ask

The desired behavior: a single WebSocket (or SSE) connection where:

  1. Client streams raw audio (e.g., PCM16 @ 24kHz) from the microphone.
  2. Ollama handles speech recognition, LLM inference, and speech synthesis in a continuous pipeline.
  3. Ollama streams synthesized audio back to the client in near real-time.
  4. Turn-taking, interruption handling, and VAD are managed natively or exposed as events.

Use Cases

  1. Accessibility: Hands-free, eyes-free interaction for users with motor or vision impairments.
  2. Productivity: Dictate and converse with local models during coding, driving, or manual work.
  3. Education/Language Learning: Practice speaking with a local AI tutor without sending voice data to third parties.
  4. Embedded & Edge Devices: Voice-enabled local assistants on Raspberry Pi, home servers, or offline workstations.
  5. Privacy: Full voice-to-voice AI interaction without cloud audio processing.

Why It Matters

  • Privacy & Sovereignty: Open-source voice models (Sesame, Qwen2-Audio, Whisper) are advancing rapidly. Users want to run them locally, but Ollama only supports audio input for text output — not true voice conversation.
  • Gap vs. OpenAI Realtime API: Cloud providers are pulling ahead in conversational UX. A local-first alternative keeps the open-source ecosystem competitive.
  • Foundation exists: Ollama already runs audio-capable models and has a streaming API. Extending it to handle audio-out and manage the audio pipeline is a natural evolution.
  • Community demand: Multiple issues (see below) show sustained interest, but they are fragmented across STT-only, TTS-only, or model-specific requests.

Related Existing Issues

IssueTitleStateRelevance
#1168Support WhisperForConditionalGenerationOpen (63 👍)Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming.
#5424Supports voice recognition and text-to-speech capabilitiesOpenGeneric request for STT + TTS with extension framework. Not specific to streaming/realtime conversation.
#9804Sesame family models, Realtime voice mode?OpenModel-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope.
#11798Add Audio Input Support for Multimodal ModelsOpen (10 comments)Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice.
#7514Realtime API like OpenAIClosed (merged into #1168)Was the closest prior issue to this exact request. Closed by maintainer jmorganca on 2024-12-23.

Conclusion: No existing open issue specifically covers native bidirectional streaming voice conversation as a first-class Ollama feature. #1168 is the closest but is STT-only. This feature request is broader and distinct enough to warrant its own issue.


Suggested Implementation Approach

The following is a high-level proposal for discussion. Ollama maintainers should define the canonical design.

1. API Surface

Extend the Ollama API with a new realtime endpoint:

POST /api/realtime
Upgrade: websocket

Client → Server:

  • session.init — model name, voice settings, system prompt
  • audio.append — base64-encoded audio chunks (PCM16, 24kHz)
  • input_audio_buffer.commit — signal end of user turn
  • conversation.item.create — inject text/tools/events
  • session.update — change voice, instructions, or temperature mid-session

Server → Client:

  • conversation.item.created — transcript (user + assistant)
  • response.audio.delta — base64-encoded synthesized audio chunks
  • response.audio.done — end of assistant response
  • response.done — response complete
  • input_audio_buffer.speech_started / speech_stopped — VAD events

2. Architecture

┌─────────────┐      WebSocket       ┌─────────────────────────────────────┐
│   Client    │ ◄──────────────────► │           Ollama Server             │
│  (Mic/Spk)  │    audio/text/events │                                     │
└─────────────┘                      │  ┌─────────┐  ┌──────┐  ┌────────┐  │
                                     │  │  STT   │──►│ LLM  │──►│  TTS   │  │
                                     │  │(local) │  │      │  │(local) │  │
                                     │  └─────────┘  └──────┘  └────────┘  │
                                     │         ▲              │            │
                                     │    VAD / Buffer   Streaming Audio  │
                                     └─────────────────────────────────────┘

3. Component Breakdown

ComponentOptions / Notes
STT EngineWhisper (ggml/gguf via whisper.cpp), or native model audio encoder (Qwen2-Audio, etc.)
LLMAny Ollama text model; system prompt controls persona and tool use
TTS EngineLocal option: Sesame CSM-1B, Piper, Coqui TTS, or MeloTTS. Could be model-specific.
VADSilero VAD, webrtcvad, or native model attention. Detects speech start/stop to trigger STT.
Audio FormatPCM16, 24kHz mono (input); PCM16 or opus (output). Matches OpenAI Realtime API conventions.
InterruptionClient sends conversation.item.truncate on new speech detection; server cancels in-flight TTS.

4. CLI & SDK

# Start a realtime voice session
ollama realtime llama3.2 --voice default
import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio

5. Incremental Rollout Phases

  1. Phase 1 — STT streaming: Accept streaming audio, emit text transcript events (extends #1168).
  2. Phase 2 — TTS streaming: Add TTS model support; emit audio deltas from text responses.
  3. Phase 3 — Native voice models: Support end-to-end audio-in/audio-out models (e.g., GPT-4o-style native audio).
  4. Phase 4 — Interruptions & VAD: Full conversational turn-taking, barge-in, and voice activity detection.

6. Backwards Compatibility

  • New /api/realtime endpoint; no breaking changes to existing /api/generate or /api/chat.
  • Audio models loaded via standard ollama run / Modelfile mechanism.

Open Questions for Maintainers

  1. Should Ollama bundle a default TTS/STT model, or should users bring their own?
  2. Is the goal to support pipeline STT→LLM→TTS, or native end-to-end audio models (or both)?
  3. What is the preferred transport: WebSocket, SSE, or HTTP/2 bidirectional streams?
  4. Should voice conversations support tools/function calling mid-stream?

References

extent analysis

TL;DR

Implement a new WebSocket endpoint /api/realtime to enable native bidirectional streaming voice conversation in Ollama.

Guidance

  1. Extend Ollama API: Create a new endpoint /api/realtime to handle streaming audio and text events.
  2. Define API Surface: Establish clear protocols for client-server communication, including session.init, audio.append, and response.audio.delta events.
  3. Choose Components: Select suitable STT, LLM, TTS, and VAD components, considering options like Whisper, Sesame CSM-1B, and Silero VAD.
  4. Implement Phased Rollout: Break down development into phases, starting with STT streaming, then adding TTS streaming, native voice models, and finally interruptions and VAD.
  5. Ensure Backwards Compatibility: Design the new endpoint to coexist with existing API endpoints without introducing breaking changes.

Example

import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio

Notes

The implementation details, such as the choice of components and transport protocol, should be discussed and decided by Ollama maintainers.

Recommendation

Apply the proposed workaround by implementing the new /api/realtime endpoint and phased rollout plan to enable native bidirectional streaming voice conversation in Ollama. This approach allows for a structured development process and ensures backwards compatibility with existing API endpoints.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING