ollama - 💡(How to fix) Fix Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation) [1 participants]

volzb · 2026-04-25T06:23:45Z

[ollama] Feature Request: Native Realtime Voice Chat Bidirectional Streaming Audio Conversation Status - File Created: 2026-04-25 - Upstream Submitted: No — aw… ## Fix / Workaround | Issue | Title | State | Relevance | |-------|-------|-------|-----------| | [#1168](https://github.com/ollama/ollama/issues/1168) | Support WhisperForConditionalGeneration | **Open** (63 👍) | Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming. | | [#5424](https://github.com/ollama/ollama/issues/5424) | Supports voice recognition and text-to-speech capabilities | **Open** | Generic request for STT + TTS with extension framework. Not specific to streaming/realtime conversation. | | [#9804](https://github.com/ollama/ollama/issues/9804) | Sesame family models, Realtime voice mode? | **Open** | Model-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope. | | [#11798](https://github.com/ollama/ollama/issues/11798) | Add Audio Input Support for Multimodal Models | **Open** (10 comments) | Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice. | | ~~#7514~~ | ~~Realtime API like OpenAI~~ | **Closed** (merged into #1168) | Was the closest prior issue to this exact request. Closed by maintainer `jmorganca` on 2024-12-23. | # Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation) ## Status - **File Created:** 2026-04-25 - **Upstream Submitted:** No — awaiting Ben's approval to submit to ollama/ollama --- ## Feature Description Requesting native support for **realtime bidirectional voice conversation** in Ollama — the ability to hold a natural, low-latency spoken dialogue with an LLM, similar to OpenAI's Realtime API. This is distinct from the current audio-input-only multimodal support or external STT→LLM→TTS chaining. ### What "Realtime Voice Chat" Means Here | Capability | Current Ollama | This Feature Request | |------------|---------------|----------------------| | Audio file input for multimodal models | ✅ (e.g., Qwen2-Audio) | Not the same thing | | Speech-to-text (STT) | ❌ No native support | Not sufficient | | Text-to-speech (TTS) | ❌ No native support | Not sufficient | | **Streaming audio-in → streaming audio-out** | ❌ Not supported | **This is the ask** | | Conversational turn-taking with voice activity detection (VAD) | ❌ Not supported | **This is the ask** | | Low-latency (<500ms) voice response | ❌ Not supported | **This is the ask** | The desired behavior: a single WebSocket (or SSE) connection where: 1. Client streams raw audio (e.g., PCM16 @ 24kHz) from the microphone. 2. Ollama handles speech recognition, LLM inference, and speech synthesis in a continuous pipeline. 3. Ollama streams synthesized audio back to the client in near real-time. 4. Turn-taking, interruption handling, and VAD are managed natively or exposed as events. --- ## Use Cases 1. **Accessibility**: Hands-free, eyes-free interaction for users with motor or vision impairments. 2. **Productivity**: Dictate and converse with local models during coding, driving, or manual work. 3. **Education/Language Learning**: Practice speaking with a local AI tutor without sending voice data to third parties. 4. **Embedded & Edge Devices**: Voice-enabled local assistants on Raspberry Pi, home servers, or offline workstations. 5. **Privacy**: Full voice-to-voice AI interaction without cloud audio processing. --- ## Why It Matters - **Privacy & Sovereignty**: Open-source voice models (Sesame, Qwen2-Audio, Whisper) are advancing rapidly. Users want to run them locally, but Ollama only supports audio *input* for text output — not true voice conversation. - **Gap vs. OpenAI Realtime API**: Cloud providers are pulling ahead in conversational UX. A local-first alternative keeps the open-source ecosystem competitive. - **Foundation exists**: Ollama already runs audio-capable models and has a streaming API. Extending it to handle audio-out and manage the audio pipeline is a natural evolution. - **Community demand**: Multiple issues (see below) show sustained interest, but they are fragmented across STT-only, TTS-only, or model-specific requests. --- ## Related Existing Issues | Issue | Title | State | Relevance | |-------|-------|-------|-----------| | [#1168](https://github.com/ollama/ollama/issues/1168) | Support WhisperForConditionalGeneration | **Open** (63 👍) | Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming. | | [#5424](https://github.com/ollama/ollama/issues/5424) | Supports voice recognition and text-to-speech capabilities | **Open** | Generic request for STT + TTS with extension framework. Not specific to streaming/realtime conversation. | | [#9804](https://github.com/ollama/ollama/issues/9804) | Sesame family models, Realtime voice mode? | **Open** | Model-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower i

ollama2026-04-25 06:23:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15807•Fetched 2026-04-26 05:06:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

volzb

Participants

volzb

Fix Action

Fix / Workaround

Issue	Title	State	Relevance
#1168	Support WhisperForConditionalGeneration	Open (63 👍)	Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming.
#5424	Supports voice recognition and text-to-speech capabilities	Open	Generic request for STT + TTS with extension framework. Not specific to streaming/realtime conversation.
#9804	Sesame family models, Realtime voice mode?	Open	Model-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope.
#11798	Add Audio Input Support for Multimodal Models	Open (10 comments)	Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice.
~~#7514~~	~~Realtime API like OpenAI~~	Closed (merged into #1168)	Was the closest prior issue to this exact request. Closed by maintainer `jmorganca` on 2024-12-23.

Code Example

POST /api/realtime
Upgrade: websocket

---

┌─────────────┐      WebSocket       ┌─────────────────────────────────────┐
│   Client    │ ◄──────────────────► │           Ollama Server             │
│  (Mic/Spk)  │    audio/text/events │                                     │
└─────────────┘                      │  ┌─────────┐  ┌──────┐  ┌────────┐  │
                                     │  │  STT   │──►│ LLM  │──►│  TTS   │  │
                                     │  │(local) │  │      │  │(local) │  │
                                     │  └─────────┘  └──────┘  └────────┘  │
                                     │         ▲              │            │
                                     │    VAD / Buffer   Streaming Audio  │
                                     └─────────────────────────────────────┘

---

# Start a realtime voice session
ollama realtime llama3.2 --voice default

---

import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio

RAW_BUFFERClick to expand / collapse

Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation)

Status

File Created: 2026-04-25
Upstream Submitted: No — awaiting Ben's approval to submit to ollama/ollama

Feature Description

Requesting native support for realtime bidirectional voice conversation in Ollama — the ability to hold a natural, low-latency spoken dialogue with an LLM, similar to OpenAI's Realtime API. This is distinct from the current audio-input-only multimodal support or external STT→LLM→TTS chaining.

What "Realtime Voice Chat" Means Here

Capability	Current Ollama	This Feature Request
Audio file input for multimodal models	✅ (e.g., Qwen2-Audio)	Not the same thing
Speech-to-text (STT)	❌ No native support	Not sufficient
Text-to-speech (TTS)	❌ No native support	Not sufficient
Streaming audio-in → streaming audio-out	❌ Not supported	This is the ask
Conversational turn-taking with voice activity detection (VAD)	❌ Not supported	This is the ask
Low-latency (<500ms) voice response	❌ Not supported	This is the ask

The desired behavior: a single WebSocket (or SSE) connection where:

Client streams raw audio (e.g., PCM16 @ 24kHz) from the microphone.
Ollama handles speech recognition, LLM inference, and speech synthesis in a continuous pipeline.
Ollama streams synthesized audio back to the client in near real-time.
Turn-taking, interruption handling, and VAD are managed natively or exposed as events.

Use Cases

Accessibility: Hands-free, eyes-free interaction for users with motor or vision impairments.
Productivity: Dictate and converse with local models during coding, driving, or manual work.
Education/Language Learning: Practice speaking with a local AI tutor without sending voice data to third parties.
Embedded & Edge Devices: Voice-enabled local assistants on Raspberry Pi, home servers, or offline workstations.
Privacy: Full voice-to-voice AI interaction without cloud audio processing.

Why It Matters

Privacy & Sovereignty: Open-source voice models (Sesame, Qwen2-Audio, Whisper) are advancing rapidly. Users want to run them locally, but Ollama only supports audio input for text output — not true voice conversation.
Gap vs. OpenAI Realtime API: Cloud providers are pulling ahead in conversational UX. A local-first alternative keeps the open-source ecosystem competitive.
Foundation exists: Ollama already runs audio-capable models and has a streaming API. Extending it to handle audio-out and manage the audio pipeline is a natural evolution.
Community demand: Multiple issues (see below) show sustained interest, but they are fragmented across STT-only, TTS-only, or model-specific requests.

Related Existing Issues

Issue	Title	State	Relevance
#1168	Support WhisperForConditionalGeneration	Open (63 👍)	Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming.
#5424	Supports voice recognition and text-to-speech capabilities	Open	Generic request for STT + TTS with extension framework. Not specific to streaming/realtime conversation.
#9804	Sesame family models, Realtime voice mode?	Open	Model-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope.
#11798	Add Audio Input Support for Multimodal Models	Open (10 comments)	Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice.
~~#7514~~	~~Realtime API like OpenAI~~	Closed (merged into #1168)	Was the closest prior issue to this exact request. Closed by maintainer `jmorganca` on 2024-12-23.

Conclusion: No existing open issue specifically covers native bidirectional streaming voice conversation as a first-class Ollama feature. #1168 is the closest but is STT-only. This feature request is broader and distinct enough to warrant its own issue.

Suggested Implementation Approach

The following is a high-level proposal for discussion. Ollama maintainers should define the canonical design.

1. API Surface

Extend the Ollama API with a new realtime endpoint:

POST /api/realtime
Upgrade: websocket

Client → Server:

session.init — model name, voice settings, system prompt
audio.append — base64-encoded audio chunks (PCM16, 24kHz)
input_audio_buffer.commit — signal end of user turn
conversation.item.create — inject text/tools/events
session.update — change voice, instructions, or temperature mid-session

Server → Client:

conversation.item.created — transcript (user + assistant)
response.audio.delta — base64-encoded synthesized audio chunks
response.audio.done — end of assistant response
response.done — response complete
input_audio_buffer.speech_started / speech_stopped — VAD events

2. Architecture

┌─────────────┐      WebSocket       ┌─────────────────────────────────────┐
│   Client    │ ◄──────────────────► │           Ollama Server             │
│  (Mic/Spk)  │    audio/text/events │                                     │
└─────────────┘                      │  ┌─────────┐  ┌──────┐  ┌────────┐  │
                                     │  │  STT   │──►│ LLM  │──►│  TTS   │  │
                                     │  │(local) │  │      │  │(local) │  │
                                     │  └─────────┘  └──────┘  └────────┘  │
                                     │         ▲              │            │
                                     │    VAD / Buffer   Streaming Audio  │
                                     └─────────────────────────────────────┘

3. Component Breakdown

Component	Options / Notes
STT Engine	Whisper (ggml/gguf via whisper.cpp), or native model audio encoder (Qwen2-Audio, etc.)
LLM	Any Ollama text model; system prompt controls persona and tool use
TTS Engine	Local option: Sesame CSM-1B, Piper, Coqui TTS, or MeloTTS. Could be model-specific.
VAD	Silero VAD, webrtcvad, or native model attention. Detects speech start/stop to trigger STT.
Audio Format	PCM16, 24kHz mono (input); PCM16 or opus (output). Matches OpenAI Realtime API conventions.
Interruption	Client sends `conversation.item.truncate` on new speech detection; server cancels in-flight TTS.

4. CLI & SDK

# Start a realtime voice session
ollama realtime llama3.2 --voice default

import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio

5. Incremental Rollout Phases

Phase 1 — STT streaming: Accept streaming audio, emit text transcript events (extends #1168).
Phase 2 — TTS streaming: Add TTS model support; emit audio deltas from text responses.
Phase 3 — Native voice models: Support end-to-end audio-in/audio-out models (e.g., GPT-4o-style native audio).
Phase 4 — Interruptions & VAD: Full conversational turn-taking, barge-in, and voice activity detection.

6. Backwards Compatibility

New /api/realtime endpoint; no breaking changes to existing /api/generate or /api/chat.
Audio models loaded via standard ollama run / Modelfile mechanism.

Open Questions for Maintainers

Should Ollama bundle a default TTS/STT model, or should users bring their own?
Is the goal to support pipeline STT→LLM→TTS, or native end-to-end audio models (or both)?
What is the preferred transport: WebSocket, SSE, or HTTP/2 bidirectional streams?
Should voice conversations support tools/function calling mid-stream?

References

OpenAI Realtime API
Sesame CSM-1B (Conversational Speech Model)
Qwen2-Audio
whisper.cpp
Ollama issues: #1168, #5424, #9804, #11798

extent analysis

TL;DR

Implement a new WebSocket endpoint /api/realtime to enable native bidirectional streaming voice conversation in Ollama.

Guidance

Extend Ollama API: Create a new endpoint /api/realtime to handle streaming audio and text events.
Define API Surface: Establish clear protocols for client-server communication, including session.init, audio.append, and response.audio.delta events.
Choose Components: Select suitable STT, LLM, TTS, and VAD components, considering options like Whisper, Sesame CSM-1B, and Silero VAD.
Implement Phased Rollout: Break down development into phases, starting with STT streaming, then adding TTS streaming, native voice models, and finally interruptions and VAD.
Ensure Backwards Compatibility: Design the new endpoint to coexist with existing API endpoints without introducing breaking changes.

Example

import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio

Notes

The implementation details, such as the choice of components and transport protocol, should be discussed and decided by Ollama maintainers.

Recommendation

Apply the proposed workaround by implementing the new /api/realtime endpoint and phased rollout plan to enable native bidirectional streaming voice conversation in Ollama. This approach allows for a structured development process and ensures backwards compatibility with existing API endpoints.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.