hermes - ✅(Solved) Fix Telegram: audio file attachments misclassified as voice messages, routed to STT pipeline [5 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#24870Fetched 2026-05-14 03:51:01
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Participants
Timeline (top)
cross-referenced ×5labeled ×5

Fix Action

Fixed

PR fix notes

PR #24879: fix(gateway): route Telegram audio file attachments away from STT pipeline (#24870)

Description (problem / solution / changelog)

Summary

Fixes #24870 — Telegram audio file attachments were being misclassified as voice messages and auto-transcribed by the STT pipeline.

Root Cause

gateway/run.py's inbound message routing block matched both MessageType.VOICE and MessageType.AUDIO into audio_paths, which were then fed unconditionally to _enrich_message_with_transcription.

Per the Telegram Bot API, three distinct payload fields exist:

FieldTypeCorrect handling
message.voiceOpus/OGG voice messageSTT pipeline
message.audioAudio file attachment (.mp3, .m4a, etc.)Save as file, NOT STT — was broken
message.document (audio mime)Generic fileExisting document route

Fix

  • Introduce a new audio_file_paths list populated exclusively by MessageType.AUDIO events.
  • Narrow the audio_paths selector to MessageType.VOICE (and bare audio/ MIME-type events that are not explicitly AUDIO or DOCUMENT).
  • After the STT block, inject a document-style context note for each audio file path, giving the agent the file path and asking what to do with it — consistent with how plain documents are handled.

Before / After

Before — sending song.mp3 via Telegram attachment:

[The user said: "[STT transcript of your mp3 here]"]

…the transcribe skill never received the file path.

After — sending song.mp3 via Telegram attachment:

[The user sent an audio file attachment: 'song.mp3'. It is saved at: /path/to/cache/song.mp3.
Ask the user what they'd like you to do with it, or pass the path to a transcription or media tool.]

Testing

5 new tests in tests/gateway/test_telegram_audio_vs_voice.py:

  • test_voice_message_still_transcribed — regression guard, VOICE still goes to STT
  • test_audio_attachment_skips_stt — core fix, AUDIO never calls transcribe_audio
  • test_audio_attachment_context_note_format — verifies note content and display name
  • test_audio_attachment_skips_stt_when_stt_disabled — STT-disabled notice must not appear for file attachments
  • test_telegram_media_type_detection_audio_vs_voice — sanity: AUDIO != VOICE enum values

All 5 new tests + existing test_stt_config.py (5 tests) pass.

Changed files

  • gateway/run.py (modified, +24/-1)
  • tests/gateway/test_telegram_audio_vs_voice.py (added, +184/-0)

PR #24883: fix(telegram): keep audio attachments as files

Description (problem / solution / changelog)

Summary

  • treat Telegram message.audio as an attached file/document instead of a voice message
  • allow audio documents such as mp3/m4a/ogg/wav/flac to be cached and passed to the agent as files
  • skip automatic STT for document audio while preserving STT for Telegram voice notes

Fixes #24870

Tests

  • scripts/run_tests.sh tests/gateway/test_telegram_documents.py tests/gateway/test_stt_config.py tests/gateway/test_tts_media_routing.py
  • .venv/bin/ruff check gateway/platforms/base.py gateway/platforms/telegram.py gateway/run.py tests/gateway/test_telegram_documents.py tests/gateway/test_stt_config.py

Changed files

  • gateway/platforms/base.py (modified, +6/-0)
  • gateway/platforms/telegram.py (modified, +19/-5)
  • gateway/run.py (modified, +12/-2)
  • tests/gateway/test_stt_config.py (modified, +42/-0)
  • tests/gateway/test_telegram_documents.py (modified, +48/-0)

PR #25097: fix(gateway): route audio file attachments as files, not STT input

Description (problem / solution / changelog)

Summary

Telegram distinguishes between msg.voice (voice messages) and msg.audio (audio file attachments). The gateway was routing both types to the STT pipeline, causing:

  • Audio files (.mp3, .m4a, etc.) sent as file attachments being auto-transcribed instead of preserved as files
  • No way to bypass STT for audio file attachments
  • The transcribe skill receiving transcribed text instead of the actual audio file

Root Cause

gateway/run.py:6769 included MessageType.AUDIO in the STT routing condition alongside MessageType.VOICE:

# Before (buggy)
if mtype.startswith("audio/") or event.message_type in {MessageType.VOICE, MessageType.AUDIO}:
    audio_paths.append(path)

Fix

Changed the condition to only match MessageType.VOICE:

# After (fixed)
if mtype.startswith("audio/") and event.message_type == MessageType.VOICE:
    audio_paths.append(path)

Audio files (MessageType.AUDIO) now fall through to the media URL text placeholder ([User sent audio: /path]) and remain accessible as file attachments, while voice messages continue to be transcribed normally.

Testing

  • 6 new regression tests in tests/gateway/test_audio_voice_routing.py
  • 237 existing related tests passing (STT, telegram documents, voice commands)
  • Zero regressions

Closes #24870

Changed files

  • gateway/platforms/telegram.py (modified, +3/-1)
  • gateway/run.py (modified, +4/-1)
  • hermes_cli/model_switch.py (modified, +32/-1)
  • tests/gateway/test_audio_voice_routing.py (added, +161/-0)
  • tests/hermes_cli/test_model_switch_token_validation.py (added, +121/-0)

PR #25274: feat(telegram): skip-STT audio path + 2GB cap via local Bot API server

Description (problem / solution / changelog)

Two coordinated changes that unblock downstream audio pipelines (diarization, custom transcription, archival) on attachments larger than the public Bot API's 20MB getFile ceiling.

What's new

  • stt.enabled: false no longer drops voice/audio with a generic "transcription disabled" note. The gateway probes the cached file's duration (wave → mutagen → ffprobe ladder) and surfaces [The user sent a voice message: <abs path> (duration: M:SS)] to the agent so a skill or tool can pick up the raw file. The previous placeholder is replaced rather than appended when present.

  • platforms.telegram.extra.base_url set → adapter auto-lifts its document size cap from 20MB to 2GB (the local telegram-bot-api --local ceiling) and the "too large" reply reports the active limit dynamically. No new config knob; presence of base_url is the opt-in.

  • platforms.telegram.extra.local_mode: true wires Application.builder().local_mode(True) on the python-telegram-bot builder. PTB then reads files from disk instead of HTTP, which is required when telegram-bot-api runs in --local mode (the server returns absolute filesystem paths, not /file/bot... URLs).

Files

  • gateway/run.py: rewrites the stt.enabled: false branch of _enrich_message_with_transcription. New _format_duration + _probe_audio_duration helpers.
  • gateway/platforms/telegram.py: _max_doc_bytes instance attribute derived from extra.base_url; local_mode builder wiring; dynamic "too large" message.
  • tests/gateway/test_stt_config.py: covers path-surfacing with and without an existing user message, and placeholder replacement.
  • tests/gateway/test_telegram_max_doc_bytes.py: 3 cases — default 20MB without base_url, 2GB when set, empty-string base_url keeps default.
  • website/docs/user-guide/messaging/telegram.md: new "Skipping STT" subsection under Voice Messages and a full "Large Files (>20MB) via Local Bot API Server" walkthrough (api_id/api_hash, docker-compose, one-time logOut migration, platforms.telegram.extra config, the local_mode disk-access requirement, the silent HTTP-fallback 404).
  • website/docs/user-guide/features/voice-mode.md: documents the stt.enabled knob in the config reference.

Validation

  • pytest tests/gateway/test_telegram_max_doc_bytes.py tests/gateway/test_stt_config.py → 9/9 passing.
  • Verified end-to-end on a live deployment: gateway log shows Using custom Telegram base_url: http://... and Using Telegram local_mode (read files from disk) on startup; voice messages above 20MB cache to disk and surface their path to the agent.

What does this PR do?

<!-- Describe the change clearly. What problem does it solve? Why is this approach the right one? -->

Related Issue

<!-- Link the issue this PR addresses. If no issue exists, consider creating one first. -->

Fixes #24870 #15145

Type of Change

<!-- Check the one that applies. -->
  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

<!-- List the specific changes. Include file paths for code changes. -->
  • tts disabled will forward the filepath and info about the audio
  • setting base_url in telegram will allow use of custom tg-bot-api and >20MB <2GB file sizes

How to Test

<!-- Steps to verify this change works. For bugs: reproduction steps + proof that the fix works. -->
  1. setup tts disabled and tg-bot-api docker container according to docs
  2. send an audio file larger than 20MB
  3. observe the logs

Checklist

<!-- Complete these before requesting review. -->

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu

Documentation & Housekeeping

<!-- Check all that apply. It's OK to check "N/A" if a category doesn't apply to your change. -->
  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

<!-- Only fill this out if you're adding a skill. Delete this section otherwise. -->
  • This skill is broadly useful to most users (if bundled) — see Contributing Guide
  • SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
  • No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
  • I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

<!-- If applicable, add screenshots or log output showing the fix/feature in action. -->

Changed files

  • gateway/platforms/telegram.py (modified, +19/-3)
  • gateway/run.py (modified, +71/-9)
  • tests/gateway/test_stt_config.py (modified, +28/-2)
  • tests/gateway/test_telegram_max_doc_bytes.py (added, +56/-0)
  • website/docs/user-guide/features/voice-mode.md (modified, +5/-0)
  • website/docs/user-guide/messaging/telegram.md (modified, +148/-0)

PR #25280: feat(telegram): skip-STT audio path + 2GB cap via local Bot API server

Description (problem / solution / changelog)

What & Why

Two coordinated changes that unblock downstream audio pipelines (diarization, custom transcription, archival) on Telegram attachments larger than the public Bot API's 20 MB getFile ceiling.

1. `stt.enabled: false` surfaces audio file paths to the agent

Previously a no-op note: "transcription disabled." Now the gateway still caches the voice/audio attachment, probes its duration (`wave` → `mutagen` → `ffprobe` ladder), and surfaces:

``` [The user sent a voice message: /home/<user>/.hermes/cache/audio/<hash>.ogg (duration: 12:34)] ```

…so a skill or tool can pick up the raw file. The previous `(The user sent a message with no text content)` placeholder is replaced rather than appended when present.

2. Local Bot API server unlocks 2 GB downloads

When `platforms.telegram.extra.base_url` is set, the adapter:

  • Auto-lifts the document size cap from 20 MB → 2 GB (the `telegram-bot-api` `--local` ceiling).
  • Reports the active limit dynamically in the "too large" reply.
  • No new top-level config knob: presence of `base_url` is the opt-in.

A new `platforms.telegram.extra.local_mode: true` wires `Application.builder().local_mode(True)` on the python-telegram-bot builder. PTB then reads files from disk instead of HTTP, which is required when `telegram-bot-api` runs in `--local` mode (the server returns absolute filesystem paths, not `/file/bot...` URLs).

How to test

Path 1 — STT-skip path (no local server required)

  1. Set `stt.enabled: false` in `~/.hermes/config.yaml`.
  2. Restart the gateway.
  3. Send the bot a voice note ≤ 20 MB.
  4. Check the inbound log message contains `[The user sent a voice message: /path/to/cache/audio/<hash>.ogg (duration: M:SS)]`.

Path 2 — Local Bot API server (full pipeline)

Follow the new docs at `website/docs/user-guide/messaging/telegram.md` → Large Files (>20MB) via Local Bot API Server. Six steps cover: getting api_id/api_hash, running the docker container with `TELEGRAM_LOCAL=1`, the one-time `logOut` migration, Hermes config, the `local_mode` disk-access requirement, and a smoke test with a >20 MB voice message.

Successful startup log lines:

``` [Telegram] Using custom Telegram base_url: http://... [Telegram] Using Telegram local_mode (read files from disk) ```

Automated

  • `scripts/run_tests.sh tests/gateway/test_telegram_max_doc_bytes.py tests/gateway/test_stt_config.py` → 9/9 passing.
  • `scripts/check-windows-footguns.py --diff main` → clean.

Test plan

  • Unit tests for both code paths (`test_telegram_max_doc_bytes.py`, `test_stt_config.py`)
  • CI-parity test runner (`scripts/run_tests.sh`) green on touched files
  • Windows-footguns check clean
  • Manual end-to-end on Linux: bot connects to local server, voice messages above 20 MB cache to disk, audio path surfaced to agent
  • Pre-existing test failures (`test_tts_media_routing.py` × 3, `test_api_server.py` etc. import errors) reproduce against `HEAD~1` — not introduced by this PR.

Platforms tested

Linux (Ubuntu 24.04). The new code uses portable APIs (`os.path.abspath`, `asyncio.create_subprocess_exec` with try/except fallback for ffprobe); no Unix-only syscalls introduced.

Security note

The new docs include a prominent warning that the local Bot API server takes the bot token in the URL path with no additional auth — operators must keep it on a private network and not expose port 8081 publicly. No change to Hermes-side security posture; the warning is purely advisory for operators running the optional local server.

Out of scope (deferred)

  • Slack's 20 MB cap and WeCom's 20 MB cap (other adapters; operator confirmed Telegram is the blocker).
  • MTProto migration (much larger blast radius; local Bot API server covers the use case).
  • Streaming-to-disk for ≥ 1 GB downloads (PTB's `download_as_bytearray` still loads the full payload into memory; worth revisiting under measured memory pressure).

Related Issue

<!-- Link the issue this PR addresses. If no issue exists, consider creating one first. -->

Fixes #24870 #15145

Type of Change

<!-- Check the one that applies. -->
  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

<!-- List the specific changes. Include file paths for code changes. -->
  • tts disabled will forward the filepath and info about the audio
  • setting base_url in telegram will allow use of custom tg-bot-api and >20MB <2GB file sizes

How to Test

<!-- Steps to verify this change works. For bugs: reproduction steps + proof that the fix works. -->
  1. setup tts disabled and tg-bot-api docker container according to docs
  2. send an audio file larger than 20MB
  3. observe the logs

Checklist

<!-- Complete these before requesting review. -->

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu

Documentation & Housekeeping

<!-- Check all that apply. It's OK to check "N/A" if a category doesn't apply to your change. -->
  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

🤖 Generated with Claude Code

Changed files

  • gateway/platforms/telegram.py (modified, +19/-3)
  • gateway/run.py (modified, +71/-9)
  • tests/gateway/test_stt_config.py (modified, +28/-2)
  • tests/gateway/test_telegram_max_doc_bytes.py (added, +56/-0)
  • website/docs/user-guide/features/voice-mode.md (modified, +5/-0)
  • website/docs/user-guide/messaging/telegram.md (modified, +148/-0)

Code Example

if msg.voice:
    # STT pipeline
elif msg.audio:
    # Save as file only, do NOT run STT
elif msg.document:
    # Check mime type; if audio, save file only
RAW_BUFFERClick to expand / collapse

Bug Description

On Telegram, Hermes Agent fails to distinguish between message.audio (audio file attachments) and message.voice (voice messages). Both types are routed to the STT pipeline, resulting in:

  • Audio files sent as file attachments being auto-transcribed instead of saved as files
  • The transcribe skill never receives the actual audio file, making it unusable
  • No way to bypass STT for audio file attachments

Steps to Reproduce

  1. Send an audio file via Telegram attachment (any format: .mp3, .m4a, .ogg, .wav, etc.)
  2. Alternatively: save audio to Files app, then attach via Telegram
  3. Observe that Hermes Agent treats it as a voice message and runs STT

Expected Behavior

Per Telegram API, there are three distinct message fields:

  • message.voice → voice messages (Opus/OGG), should go to STT
  • message.audio → audio files/music, should be saved as files, NOT to STT
  • message.document → generic files, need mime type check

The correct cascading logic should be:

if msg.voice:
    # STT pipeline
elif msg.audio:
    # Save as file only, do NOT run STT
elif msg.document:
    # Check mime type; if audio, save file only

Actual Behavior

Hermes Agent routes all audio-related fields (voice, audio, and audio document) to the STT pipeline without distinguishing between them.

Environment

  • Hermes Agent version: latest
  • OS: macOS
  • Platform: Telegram
  • STT provider: local (faster-whisper)

Additional Context

The transcribe skill depends on receiving actual audio file paths to process with Whisper CLI. Since all audio is routed through STT, the skill is effectively broken for Telegram platform.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Telegram: audio file attachments misclassified as voice messages, routed to STT pipeline [5 pull requests, 1 participants]