hermes - 💡(How to fix) Fix Telegram voice batching can drop the in-flight response when subsequent voices arrive mid-processing [3 pull requests]

hermes2026-05-24 05:35:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

In a Telegram group chat with multiple rapid-fire voice messages, the gateway's batching pipeline can silently drop the in-flight assistant response for a middle message when newer voices arrive before that response is sent.

Root Cause

Fix Action

Fixed

Fixed by PR: fix(telegram): transcribe voice notes before mention filter so wake-word groups can hear them (https://github.com/NousResearch/hermes-agent/pull/31340)
Fixed by PR: fix(telegram): queue voice follow-ups instead of interrupting in-flight reply (#31328) (https://github.com/NousResearch/hermes-agent/pull/31342)
Fixed by PR: Task: revert PR #101 (voice eager-transcribe patch no longer needed) (https://github.com/leebaroneau/template-agent/pull/104)

Code Example

HH:MM:00  [Telegram] Cached user voice at audio_AAA.ogg
HH:MM:00  inbound message: msg=''
HH:MM:12  response ready: time=11.6s response=78 chars       # voice 1 — fine

HH:MM:72  [Telegram] Cached user voice at audio_BBB.ogg      # voice 2
HH:MM:72  inbound message: msg=''
HH:MM:75  [Telegram] Cached user voice at audio_CCC.ogg      # voice 3
HH:MM:77  tools.transcription_tools: Transcribed audio_BBB ('hermes test 2')
HH:MM:78  [Telegram] Cached user voice at audio_DDD.ogg      # voice 4
HH:MM:79  run_agent: conversation turn: history=6 msg='[The user sent a voice message~ Here\'s what they said: "hermes test 2"]'
HH:MM:86  tools.transcription_tools: Transcribed audio_CCC ('test 3')
HH:MM:90  tools.transcription_tools: Transcribed audio_DDD ('test 4')
HH:MM:91  run_agent: conversation turn: history=8 msg='[The user sent a voice message~ Here\'s what they said: "test 3"]  [The user sent a voice message~ Here\'s what they said: "test 4"]'
HH:MM:93  response ready: time=21.0s response=71 chars       # voices 3+4 — fine

RAW_BUFFERClick to expand / collapse

Summary

Environment

Hermes Agent 0.14.0
Provider: anthropic, model claude-haiku-4-5-20251001
Telegram via python-telegram-bot polling mode
STT: local faster-whisper, model=base/small
Voice messages, ~2s each, sent ~3 seconds apart

Repro

In a group chat where messages reach the agent (we used free_response_chats to bypass the mention filter for the test), send voice 1 and wait ~60 seconds. → response arrives normally.
Send voice 2, then voice 3 ~3 seconds later, then voice 4 ~3 seconds after that (rapid-fire).
Voices 3 + 4 are batched into one agent turn (expected, documented).
Voice 2 is transcribed and a conversation turn at history=N is logged for it — but no response ready / Sending response line appears for it. The session jsonl has no record of voice 2 or any assistant reply to it.
The combined response that gets sent acknowledges voices 3 + 4 but not voice 2.

Observed (anonymized log excerpt)

HH:MM:00  [Telegram] Cached user voice at audio_AAA.ogg
HH:MM:00  inbound message: msg=''
HH:MM:12  response ready: time=11.6s response=78 chars       # voice 1 — fine

HH:MM:72  [Telegram] Cached user voice at audio_BBB.ogg      # voice 2
HH:MM:72  inbound message: msg=''
HH:MM:75  [Telegram] Cached user voice at audio_CCC.ogg      # voice 3
HH:MM:77  tools.transcription_tools: Transcribed audio_BBB ('hermes test 2')
HH:MM:78  [Telegram] Cached user voice at audio_DDD.ogg      # voice 4
HH:MM:79  run_agent: conversation turn: history=6 msg='[The user sent a voice message~ Here\'s what they said: "hermes test 2"]'
HH:MM:86  tools.transcription_tools: Transcribed audio_CCC ('test 3')
HH:MM:90  tools.transcription_tools: Transcribed audio_DDD ('test 4')
HH:MM:91  run_agent: conversation turn: history=8 msg='[The user sent a voice message~ Here\'s what they said: "test 3"]  [The user sent a voice message~ Here\'s what they said: "test 4"]'
HH:MM:93  response ready: time=21.0s response=71 chars       # voices 3+4 — fine

There is no response ready line between the two conversation turn entries. Voice 2's history=6 turn appears to start but its response (and the user message itself) never make it to the session jsonl. Final session has entries for voice 1, the batched voice 3+4, and their two responses — but nothing for voice 2.

Expected

Voice 2 should either:

Get its own response, or
Be batched into the voices 3+4 turn (so the combined message acknowledges all three voice notes — like other media batching paths)

Either is fine; silent loss isn't.

Source pointer

gateway/platforms/telegram.py — _handle_media_message enqueues per-voice; the batch flush path appears to cancel the prior pending media turn when a new voice arrives within the batching window, but only if the previous turn was actually pending in the queue. If the prior turn has already started but not finished, its response gets orphaned.

I haven't tested the fix, but I think the resolution is to either (a) wait for in-flight turns to finish before starting a new batch, or (b) merge late-arriving voices into the in-flight turn instead of starting a new one.

While investigating, also noticed: _should_process_message filters voice messages against mention_patterns before transcription. Since voice has no text/caption at that point, voice messages never match a regex like "hermes". This forces users to either set free_response_chats (which lets everything through) or only accept voice via reply-to-bot. Happy to file that as a separate issue if useful.

Thanks for the great framework.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering