hermes - 💡(How to fix) Fix [Feature]: Real-time Voice Conversation Mode (voice-in, voice-out, low-latency)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
Microphone   │────▶│ STT Engine   │────▶│ Agent (pyaudio/ (Whisper (LLM)│  portaudio)  │     │  live/RT API)│     │             │
└─────────────┘     └──────────────┘     └─────────────┘
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
Speaker      │◀────│ TTS Engine   │◀────│ Agent (pyaudio) (ElevenLabs/ │     │ Response│              │     │  edge-tts)   │     │ Text└─────────────┘     └──────────────┘     └─────────────┘
RAW_BUFFERClick to expand / collapse

Problem

Current interaction with Hermes Agent is text-only (CLI/TUI/IM platforms). There's no way to have a natural, real-time voice conversation with the agent — the kind where you speak, it listens, thinks, and responds verbally with low latency, like talking to a person.

With the rise of Thinking Machines Lab's Interaction Model, GPT-5 Realtime, and Google Gemini Omni, voice-first agent interaction is becoming a major trend in 2026. Hermes Agent, as the fastest-growing open-source agent framework, should lead in this direction too.

Proposed Feature

A real-time voice conversation mode in Hermes Agent's terminal/TUI that enables:

  1. Microphone input → Capture user's speech
  2. Streaming STT → Real-time speech-to-text (e.g., Whisper live / Deepgram / Realtime API)
  3. Low-latency LLM response → Agent processes input and generates response with minimal delay
  4. Streaming TTS → Natural-sounding voice output (e.g., ElevenLabs / edge-tts / OpenAI TTS)
  5. Turn-taking — natural conversation flow with interruption handling

Use Cases

  • Hands-free coding assistance (while debugging, reading code)
  • Voice-driven research and Q&A
  • Accessibility for users who prefer speech over typing
  • Mobile-focused interactions (voice in = voice out)
  • Natural conversation for brainstorming / thinking out loud with the agent

Suggested Architecture (High-Level)

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Microphone   │────▶│ STT Engine   │────▶│ Agent       │
│ (pyaudio/    │     │ (Whisper     │     │ (LLM)       │
│  portaudio)  │     │  live/RT API)│     │             │
└─────────────┘     └──────────────┘     └─────────────┘
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Speaker      │◀────│ TTS Engine   │◀────│ Agent       │
│ (pyaudio)   │     │ (ElevenLabs/ │     │ Response    │
│              │     │  edge-tts)   │     │ Text        │
└─────────────┘     └──────────────┘     └─────────────┘

Implementation Ideas

  • Option A: Built-in voice mode in the TUI — toggle between text and voice with a hotkey
  • Option B: Plugin/skill-based — voice is a standalone skill that wraps existing agent capabilities
  • Option C: WebRTC-based voice gateway — connect via SIP/WebRTC for phone-like experience
  • Latency optimization: Use streaming LLM responses + chunked TTS to achieve sub-500ms end-to-end latency

Related

  • #32820 Architecture v0: control plane, brain, memory, voice, client, workers
  • #33898 Hide or configure voice transcript status rows in editable dictation mode (TTS/STT infra already exists)
  • #35622 Allow tts over ssh if pulseaudio is reachable

Why Now

2026 is being called "the year of voice agents" on X/Twitter. GPT-5 Realtime, Thinking Machines Lab's Interaction Model, and Google Gemini all point to voice-first interaction as the next frontier. Hermes Agent has a unique advantage — it's already the most-used open agent on OpenRouter. Adding real-time voice would be a killer differentiator.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Feature]: Real-time Voice Conversation Mode (voice-in, voice-out, low-latency)