hermes - 💡(How to fix) Fix [Feature]: Real-time Voice Conversation Mode (voice-in, voice-out, low-latency)

Code Example

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Microphone   │────▶│ STT Engine   │────▶│ Agent       │
│ (pyaudio/    │     │ (Whisper     │     │ (LLM)       │
│  portaudio)  │     │  live/RT API)│     │             │
└─────────────┘     └──────────────┘     └─────────────┘
                                               │
                                               ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Speaker      │◀────│ TTS Engine   │◀────│ Agent       │
│ (pyaudio)   │     │ (ElevenLabs/ │     │ Response    │
│              │     │  edge-tts)   │     │ Text        │
└─────────────┘     └──────────────┘     └─────────────┘

Problem

Current interaction with Hermes Agent is text-only (CLI/TUI/IM platforms). There's no way to have a natural, real-time voice conversation with the agent — the kind where you speak, it listens, thinks, and responds verbally with low latency, like talking to a person.

With the rise of Thinking Machines Lab's Interaction Model, GPT-5 Realtime, and Google Gemini Omni, voice-first agent interaction is becoming a major trend in 2026. Hermes Agent, as the fastest-growing open-source agent framework, should lead in this direction too.

Proposed Feature

A real-time voice conversation mode in Hermes Agent's terminal/TUI that enables:

Microphone input → Capture user's speech
Streaming STT → Real-time speech-to-text (e.g., Whisper live / Deepgram / Realtime API)
Low-latency LLM response → Agent processes input and generates response with minimal delay
Streaming TTS → Natural-sounding voice output (e.g., ElevenLabs / edge-tts / OpenAI TTS)
Turn-taking — natural conversation flow with interruption handling

Use Cases

Hands-free coding assistance (while debugging, reading code)
Voice-driven research and Q&A
Accessibility for users who prefer speech over typing
Mobile-focused interactions (voice in = voice out)
Natural conversation for brainstorming / thinking out loud with the agent

Suggested Architecture (High-Level)

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Microphone   │────▶│ STT Engine   │────▶│ Agent       │
│ (pyaudio/    │     │ (Whisper     │     │ (LLM)       │
│  portaudio)  │     │  live/RT API)│     │             │
└─────────────┘     └──────────────┘     └─────────────┘
                                               │
                                               ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Speaker      │◀────│ TTS Engine   │◀────│ Agent       │
│ (pyaudio)   │     │ (ElevenLabs/ │     │ Response    │
│              │     │  edge-tts)   │     │ Text        │
└─────────────┘     └──────────────┘     └─────────────┘

Implementation Ideas

Option A: Built-in voice mode in the TUI — toggle between text and voice with a hotkey
Option B: Plugin/skill-based — voice is a standalone skill that wraps existing agent capabilities
Option C: WebRTC-based voice gateway — connect via SIP/WebRTC for phone-like experience
Latency optimization: Use streaming LLM responses + chunked TTS to achieve sub-500ms end-to-end latency

#32820 Architecture v0: control plane, brain, memory, voice, client, workers
#33898 Hide or configure voice transcript status rows in editable dictation mode (TTS/STT infra already exists)
#35622 Allow tts over ssh if pulseaudio is reachable

Why Now

2026 is being called "the year of voice agents" on X/Twitter. GPT-5 Realtime, Thinking Machines Lab's Interaction Model, and Google Gemini all point to voice-first interaction as the next frontier. Hermes Agent has a unique advantage — it's already the most-used open agent on OpenRouter. Adding real-time voice would be a killer differentiator.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Feature]: Real-time Voice Conversation Mode (voice-in, voice-out, low-latency)

Recommended Tools

GitHub issue graph ai analysis