hermes - 💡(How to fix) Fix Idea: Discord voice-channel participation / opt-in auto-join mode

discord: voice: auto_join: false auto_join_channels: [] # allowlisted VC IDs auto_join_users: [] # optional user allowlist require_text_opt_in: true # require a text-channel command/session binding first idle_timeout_seconds: 300

Feature / Idea

Explore a Discord voice-channel participation mode for Hermes where the bot can join a voice channel, listen to authorized users, transcribe speech, route it through the normal Hermes agent pipeline, and speak replies back into the same voice channel.

Hermes already has the core voice-channel path documented/implemented (/voice join): Discord VC audio → STT → agent → TTS → VC playback. This issue is to capture the idea and possible next-step UX around making that capability easier and safer to use.

Motivation

In Discord-heavy workflows, a natural interaction model is:

a voice channel starts or a user enters a VC,
Hermes can be invited or optionally auto-join under policy,
authorized users speak naturally,
Hermes understands the utterance, responds with voice, and mirrors transcript/replies into the bound text channel.

This would make Hermes feel less like a text bot with voice attachments and more like a live assistant in a Discord room.

Current Understanding

Current documented flow appears to be explicit/manual:

User joins a Discord voice channel.
User runs /voice join from a text channel.
Hermes joins the user's current VC.
Hermes listens to allowed users only, transcribes via STT, processes through the agent, and speaks replies via TTS.
/voice leave disconnects.

This is a good privacy-preserving default. The idea here is not to remove that, but to consider a controlled mode that can react to voice-channel state more automatically.

Possible UX / Scope

Phase 1: Improve explicit mode

Make /voice join setup failures highly actionable.
Add a concise voice doctor/status message when dependencies are missing:
- PyNaCl
- davey
- faster-whisper or cloud STT key
- Opus / ffmpeg
Surface whether the current Discord bot has Connect, Speak, and voice activity permissions.

Phase 2: Opt-in auto-join policy

Add a disabled-by-default config such as:

discord:
  voice:
    auto_join: false
    auto_join_channels: []       # allowlisted VC IDs
    auto_join_users: []          # optional user allowlist
    require_text_opt_in: true    # require a text-channel command/session binding first
    idle_timeout_seconds: 300

Possible behavior:

If an authorized user enters an allowlisted voice channel, Hermes can join automatically.
Or Hermes can post a text prompt/button asking whether to join.
Hermes should leave after idle timeout or when no authorized users remain.
Transcripts and replies should be mirrored to a configured/bound text channel.

Phase 3: Voice activity awareness

Detect who is speaking and map audio to user identity via Discord voice events / SSRC mapping.
Ignore unauthorized users silently.
Prevent echo by pausing listener while Hermes TTS is playing.
Consider wake-word or push-to-talk style activation to avoid sending every utterance to the agent.

STT / TTS Options

Recommended lightweight defaults:

STT: local faster-whisper (tiny/base/small) for no API cost.
STT alternative: Groq Whisper for lower latency.
TTS: Edge TTS as a free default.
Premium TTS: ElevenLabs / OpenAI / MiniMax if configured.

Privacy / Safety Considerations

This feature should remain explicitly opt-in.

Important guardrails:

Never auto-join arbitrary VCs by default.
Only process audio from DISCORD_ALLOWED_USERS or configured allowlists.
Clearly announce in the text channel when Hermes joins and is listening.
Mirror transcripts to a bound text channel for transparency.
Provide /voice leave and automatic idle timeout.
Avoid recording/storing raw audio unless explicitly configured.

Open Questions

Should auto-join be fully automatic, or should it post an approval prompt first?
Should voice interaction require a wake word, push-to-talk, or mention phrase?
How should Hermes choose the text channel for transcripts if a VC has no obvious paired text channel?
What is the right default local STT model (tiny, base, or small) for latency vs quality?
Should this be implemented inside the existing Discord platform adapter or as a separate voice orchestrator/plugin?

Notes from local investigation

On one tested Hermes environment, the existing Discord bot permissions were already sufficient (Connect, Speak, Use VAD, Send Voice Messages) and edge-tts, ffmpeg, and Opus were present. Missing runtime pieces were PyNaCl, davey, and an STT provider (faster-whisper locally or a cloud STT key).

This suggests the feature is feasible with the existing architecture; the main work is around UX, dependency checking, configuration, and safe opt-in behavior.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering