hermes - 💡(How to fix) Fix Feature: live meeting voice bridge via Vexa /speak

StepCodex · 2026-06-05T15:50:31Z

[hermes] Feature description Add a first-class live meeting voice bridge so a Hermes agent can participate in an online meeting as a real-time voice participan… ## Feature description Add a first-class live meeting voice bridge so a Hermes agent can participate in an online meeting as a real-time voice participant, not only as a post-meeting summarizer or text backchannel. Target behavior: - The agent joins or attaches to an active Teams/Google Meet/Zoom meeting through a meeting-bot runtime such as Vexa. - Hermes receives live transcript/audio events from the meeting. - The user can explicitly ask the agent to speak, or a configured policy can allow limited autonomous interventions. - Hermes generates a short answer, renders TTS, and injects it into the bot microphone so meeting participants hear the agent in the call. - The agent can also optionally send meeting chat messages, but voice is the main missing path. ## Motivation Hermes already has: - messaging gateway + tools; - TTS providers; - Teams meeting pipeline for post-meeting summaries/transcripts; - live meeting/backchannel patterns; - external meeting-bot candidates such as Vexa, which exposes `/speak`, chat, screen, and transcript APIs. The gap is an official, safe integration layer that turns these pieces into an operator-facing workflow: > live meeting transcript → Hermes reasoning/policy → TTS → meeting bot microphone injection. Use case: an agent should be able to join a meeting as a named assistant, listen, and speak only when authorized or when a strict policy permits it. ## Proposed solution Add a meeting voice bridge plugin/toolset, initially backed by Vexa because it is open-source and already has interactive meeting controls. Suggested tools: - `meeting_join(platform, meeting_url | native_meeting_id, mode="observer|voice")` - `meeting_status(meeting_id)` - `meeting_transcript(meeting_id, since=None)` - `meeting_say(meeting_id, text, voice=None, provider=None)` - `meeting_chat_send(meeting_id, text)` - `meeting_leave(meeting_id)` Suggested config: ```yaml meeting_voice: provider: vexa default_mode: observer require_explicit_speak_approval: true max_utterance_seconds: 20 tts_provider: edge # or openai/minimax/elevenlabs/local platforms: teams: enabled: true google_meet: enabled: true zoom: enabled: true vexa: base_url: http://127.0.0.1:18056 api_key_env: VEXA_API_KEY ``` ## Safety / governance requirements This should default to safe behavior: - Observer-only by default. - Speaking into a meeting is external/reputational output; require explicit user approval unless a profile explicitly opts into autonomous speech. - Keep max utterance length short. - Log every `meeting_say` with timestamp, meeting ID, text, and triggering user/policy. - Support interruption/cancel when the user or another participant starts speaking. - Allow profile-level policies like `backchannel_only`, `chat_only`, `voice_on_explicit_command`, `autonomous_voice_allowed`. - Avoid committing commercial scope, pricing, deadlines, legal positions, or third-party actions through autonomous speech unless explicitly authorized. ## Vexa integration notes Relevant Vexa capabilities observed: - `POST /bots` can create bots for Teams/Meet/Zoom. - WebSocket transcript stream exists. - Interactive endpoints exist or are documented: - `POST /bots/{platform}/{native_meeting_id}/speak` - `DELETE /bots/{platform}/{native_meeting_id}/speak` - chat read/write - screen/avatar controls Related Vexa issues/docs: - Vexa issue #120 documents a meeting interaction interface with `/speak`, chat, and screen sharing. - Vexa issue #333 requests external AI agent integration via agent URL / bot camera+mic. ## Acceptance criteria 1. Local/self-hosted Vexa can be configured from Hermes without hardcoding secrets. 2. Hermes can join a test meeting in observer mode and stream transcript/backchannel. 3. `meeting_say` causes participants to hear the agent through the meeting bot microphone. 4. The speak path returns success/failure based on real playback status, not just command enqueue. 5. A user can interrupt/cancel current speech. 6. The integration works at least for Teams in a live test; Meet/Zoom can follow. 7. All meeting speech events are auditable. 8. Documentation explains the difference between post-meeting transcript pipelines and live meeting voice participation. ## Alternatives considered - Continue using Teams Graph transcript pipeline only: good for post-meeting summaries, but it cannot speak live. - Use Telegram/Slack voice backchannel only: safe and useful, but not a true meeting participant. - Build a native Graph Communications bot from scratch: powerful, but much heavier than integrating an existing meeting-bot runtime first. - Browser-only automation: possible, but fragile without a dedicated audio bridge and bot lifecycle API.

Code Example

meeting_voice:
  provider: vexa
  default_mode: observer
  require_explicit_speak_approval: true
  max_utterance_seconds: 20
  tts_provider: edge  # or openai/minimax/elevenlabs/local
  platforms:
    teams:
      enabled: true
    google_meet:
      enabled: true
    zoom:
      enabled: true
  vexa:
    base_url: http://127.0.0.1:18056
    api_key_env: VEXA_API_KEY

Feature description

Add a first-class live meeting voice bridge so a Hermes agent can participate in an online meeting as a real-time voice participant, not only as a post-meeting summarizer or text backchannel.

Target behavior:

The agent joins or attaches to an active Teams/Google Meet/Zoom meeting through a meeting-bot runtime such as Vexa.
Hermes receives live transcript/audio events from the meeting.
The user can explicitly ask the agent to speak, or a configured policy can allow limited autonomous interventions.
Hermes generates a short answer, renders TTS, and injects it into the bot microphone so meeting participants hear the agent in the call.
The agent can also optionally send meeting chat messages, but voice is the main missing path.

Motivation

Hermes already has:

messaging gateway + tools;
TTS providers;
Teams meeting pipeline for post-meeting summaries/transcripts;
live meeting/backchannel patterns;
external meeting-bot candidates such as Vexa, which exposes /speak, chat, screen, and transcript APIs.

The gap is an official, safe integration layer that turns these pieces into an operator-facing workflow:

live meeting transcript → Hermes reasoning/policy → TTS → meeting bot microphone injection.

Use case: an agent should be able to join a meeting as a named assistant, listen, and speak only when authorized or when a strict policy permits it.

Proposed solution

Add a meeting voice bridge plugin/toolset, initially backed by Vexa because it is open-source and already has interactive meeting controls.

Suggested tools:

meeting_join(platform, meeting_url | native_meeting_id, mode="observer|voice")
meeting_status(meeting_id)
meeting_transcript(meeting_id, since=None)
meeting_say(meeting_id, text, voice=None, provider=None)
meeting_chat_send(meeting_id, text)
meeting_leave(meeting_id)

Suggested config:

meeting_voice:
  provider: vexa
  default_mode: observer
  require_explicit_speak_approval: true
  max_utterance_seconds: 20
  tts_provider: edge  # or openai/minimax/elevenlabs/local
  platforms:
    teams:
      enabled: true
    google_meet:
      enabled: true
    zoom:
      enabled: true
  vexa:
    base_url: http://127.0.0.1:18056
    api_key_env: VEXA_API_KEY

Safety / governance requirements

This should default to safe behavior:

Observer-only by default.
Speaking into a meeting is external/reputational output; require explicit user approval unless a profile explicitly opts into autonomous speech.
Keep max utterance length short.
Log every meeting_say with timestamp, meeting ID, text, and triggering user/policy.
Support interruption/cancel when the user or another participant starts speaking.
Allow profile-level policies like backchannel_only, chat_only, voice_on_explicit_command, autonomous_voice_allowed.
Avoid committing commercial scope, pricing, deadlines, legal positions, or third-party actions through autonomous speech unless explicitly authorized.

Vexa integration notes

Relevant Vexa capabilities observed:

POST /bots can create bots for Teams/Meet/Zoom.
WebSocket transcript stream exists.
Interactive endpoints exist or are documented:
- POST /bots/{platform}/{native_meeting_id}/speak
- DELETE /bots/{platform}/{native_meeting_id}/speak
- chat read/write
- screen/avatar controls

Related Vexa issues/docs:

Vexa issue #120 documents a meeting interaction interface with /speak, chat, and screen sharing.
Vexa issue #333 requests external AI agent integration via agent URL / bot camera+mic.

Acceptance criteria

Local/self-hosted Vexa can be configured from Hermes without hardcoding secrets.
Hermes can join a test meeting in observer mode and stream transcript/backchannel.
meeting_say causes participants to hear the agent through the meeting bot microphone.
The speak path returns success/failure based on real playback status, not just command enqueue.
A user can interrupt/cancel current speech.
The integration works at least for Teams in a live test; Meet/Zoom can follow.
All meeting speech events are auditable.
Documentation explains the difference between post-meeting transcript pipelines and live meeting voice participation.

Alternatives considered

Continue using Teams Graph transcript pipeline only: good for post-meeting summaries, but it cannot speak live.
Use Telegram/Slack voice backchannel only: safe and useful, but not a true meeting participant.
Build a native Graph Communications bot from scratch: powerful, but much heavier than integrating an existing meeting-bot runtime first.
Browser-only automation: possible, but fragile without a dedicated audio bridge and bot lifecycle API.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Feature: live meeting voice bridge via Vexa /speak

Recommended Tools

GitHub issue graph ai analysis

Root Cause