hermes - 💡(How to fix) Fix Auto voice reply silently dropped on long-running voice-in turns (streaming + tool calls)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

_send_voice_reply either was not invoked, or was invoked and exited silently. The except Exception wrapper in _send_voice_reply would have logged "Auto voice reply failed" at WARNING level — no such line appears in gateway.log, errors.log, or mcp-stderr.log for that window. The function has multiple silent-return paths (if not tts_text: return, plus the except Exception not catching CancelledError in Python 3.8+) so observability is limited. 2. except Exception does not catch asyncio.CancelledError in Python 3.8+ (it inherits from BaseException, not Exception). If the parent task is cancelled mid-await on asyncio.to_thread(text_to_speech_tool, ...) — which can take 60–90s for a 1000+-char ElevenLabs Multilingual v2 generation — the cancellation propagates silently. The finally cleanup runs but no log is emitted. 4. Observe: text arrives, audio does not, no error logged

  • Change except Exception to except BaseException (or add explicit except asyncio.CancelledError re-raise with logging) in _send_voice_reply

Fix Action

Workaround

A possible workaround is to have the agent explicitly call text_to_speech when the v0.13 voice-transcript wrapper format ([The user sent a voice message~ Here's what they said: "..."]) is detected in the user message. This routes the audio reply through the has_agent_tts dedup path that the runner respects, sidestepping _send_voice_reply entirely.

Code Example

[T+0s]    INFO gateway.platforms.telegram: [Telegram] Cached user voice at .../audio_<hash>.ogg
[T+0s]    INFO gateway.run: inbound message: platform=telegram chat=<id> msg=''
[T+4s]    INFO tools.transcription_tools: Transcribed audio_<hash>.ogg via OpenAI API (149 chars)
[T+4s]    INFO gateway.run: lane_router ... input_kind=voice-transcript text_len=149 lane=L5 ...
   [10 failed tool calls between T+115s and T+128s — separate, unrelated issue]
[T+143s]  INFO gateway.run: Suppressing normal final send for session ...: final delivery already confirmed (streamed=True previewed=False).
[T+143s]  INFO gateway.run: response ready: platform=telegram time=142.6s api_calls=5 response=1088 chars
   [no "Auto voice reply" log line of any kind — neither success nor warning]
[T+207s]  INFO gateway.run: inbound message: ... msg='Audio?'   # user follow-up
[T+243s]  INFO gateway.run: response ready: ... time=36.4s api_calls=2 response=93 chars
   # tts_<timestamp>.ogg appeared in the audio cache — audio delivered for this follow-up turn only

---

{"telegram:<chat_id>": "voice_only"}

---

# Dedup: base adapter auto-TTS already handles voice input
# (play_tts plays in VC when connected, so runner can skip).
# When streaming already delivered the text (already_sent=True),
# the base adapter will receive None and can't run auto-TTS,
# so the runner must take over.
if is_voice_input and not already_sent:
    return False
RAW_BUFFERClick to expand / collapse

Environment

  • Hermes Agent v0.13.0 (2026.5.7) (commit 498bfc7)
  • Python 3.11
  • Platform: Telegram, DM chat
  • Voice mode: voice_only (set via /voice on), persisted in gateway_voice_mode.json
  • TTS provider: ElevenLabs (reproduced on both eleven_multilingual_v2 and eleven_v3)
  • Streaming enabled (default), reasoning lane = deep_reasoning (L5)

What happened

User sent a voice message in a non-English language (~149-char STT result). The agent ran for ~142s, made 5 OpenRouter API calls and several MCP tool calls, including ~10 failed tool calls returning HTTP 400 (separate, unrelated tool bug). The agent eventually produced a ~1000-char text response, which streamed to Telegram successfully.

No auto voice reply (auto-TTS) was sent. The user had to send a follow-up message asking for audio to get the audio version, which then succeeded via an explicit text_to_speech tool call from the agent.

For shorter voice-in turns on the same chat earlier the same session (under 200 chars, single API call, no tool failures), auto voice reply fired correctly. The difference correlates with response length, total turn duration, and presence of failed tool calls — but I have not isolated which is the trigger.

Expected

For chats in voice_mode=voice_only, voice input should always trigger an auto-TTS audio reply alongside the text response — that is what the runner's _send_voice_reply fallback exists for when streaming has consumed the response and the base adapter's _process_message auto-TTS path can't fire.

Actual

_send_voice_reply either was not invoked, or was invoked and exited silently. The except Exception wrapper in _send_voice_reply would have logged "Auto voice reply failed" at WARNING level — no such line appears in gateway.log, errors.log, or mcp-stderr.log for that window. The function has multiple silent-return paths (if not tts_text: return, plus the except Exception not catching CancelledError in Python 3.8+) so observability is limited.

Relevant gateway log excerpt (sanitized)

[T+0s]    INFO gateway.platforms.telegram: [Telegram] Cached user voice at .../audio_<hash>.ogg
[T+0s]    INFO gateway.run: inbound message: platform=telegram chat=<id> msg=''
[T+4s]    INFO tools.transcription_tools: Transcribed audio_<hash>.ogg via OpenAI API (149 chars)
[T+4s]    INFO gateway.run: lane_router ... input_kind=voice-transcript text_len=149 lane=L5 ...
   [10 failed tool calls between T+115s and T+128s — separate, unrelated issue]
[T+143s]  INFO gateway.run: Suppressing normal final send for session ...: final delivery already confirmed (streamed=True previewed=False).
[T+143s]  INFO gateway.run: response ready: platform=telegram time=142.6s api_calls=5 response=1088 chars
   [no "Auto voice reply" log line of any kind — neither success nor warning]
[T+207s]  INFO gateway.run: inbound message: ... msg='Audio?'   # user follow-up
[T+243s]  INFO gateway.run: response ready: ... time=36.4s api_calls=2 response=93 chars
   # tts_<timestamp>.ogg appeared in the audio cache — audio delivered for this follow-up turn only

voice_mode state at the time:

{"telegram:<chat_id>": "voice_only"}

Diagnostic notes

_should_send_voice_reply (gateway/run.py:8983) has a dedup branch:

# Dedup: base adapter auto-TTS already handles voice input
# (play_tts plays in VC when connected, so runner can skip).
# When streaming already delivered the text (already_sent=True),
# the base adapter will receive None and can't run auto-TTS,
# so the runner must take over.
if is_voice_input and not already_sent:
    return False

For the failing turn, already_sent=True is confirmed set by the "Suppressing normal final send" log line (emitted at gateway/run.py:14981 inside _run_agent, before that function returns to _handle_message_with_agent). So this branch should not fire and the runner should proceed to _send_voice_reply.

If _send_voice_reply did run, the silent failure is the issue. Two candidate exits in that function:

  1. if not tts_text: return after _strip_markdown_for_tts(text[:4000]) — silent return with no log if the markdown stripper returns empty for the input.

  2. except Exception does not catch asyncio.CancelledError in Python 3.8+ (it inherits from BaseException, not Exception). If the parent task is cancelled mid-await on asyncio.to_thread(text_to_speech_tool, ...) — which can take 60–90s for a 1000+-char ElevenLabs Multilingual v2 generation — the cancellation propagates silently. The finally cleanup runs but no log is emitted.

Repro hypothesis (untested)

  1. Set voice_mode=voice_only for a Telegram chat
  2. Send a voice message that triggers a long agent run (>60s) with multiple tool calls
  3. Expect auto-TTS audio reply alongside the streamed text response
  4. Observe: text arrives, audio does not, no error logged

Suggested investigation

  • Log at INFO level when _send_voice_reply is entered and at INFO/DEBUG when each exit path is taken (incl. the if not tts_text early return)
  • Change except Exception to except BaseException (or add explicit except asyncio.CancelledError re-raise with logging) in _send_voice_reply
  • Consider whether the parent message-handler task can complete (return) while _send_voice_reply is still awaiting asyncio.to_thread(text_to_speech_tool, ...); if so, the long TTS generation may be at risk of cancellation when the next inbound message arrives

Workaround

A possible workaround is to have the agent explicitly call text_to_speech when the v0.13 voice-transcript wrapper format ([The user sent a voice message~ Here's what they said: "..."]) is detected in the user message. This routes the audio reply through the has_agent_tts dedup path that the runner respects, sidestepping _send_voice_reply entirely.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Auto voice reply silently dropped on long-running voice-in turns (streaming + tool calls)