hermes - ✅(Solved) Fix Inbound videos from all platforms (WeChat, etc.) are silently ignored — no transcription, no vision analysis [1 pull requests, 1 participants]

yyufoyy02 · 2026-05-01T03:26:18Z

[hermes] PR 18243: fix gateway : process inbound video messages via ffmpeg extraction - Repository: NousResearch/hermes-agent - Author: luyao618 - State: open… # PR #18243: fix(gateway): process inbound video messages via ffmpeg extraction - Repository: NousResearch/hermes-agent - Author: luyao618 - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/18243 ## Description (problem / solution / changelog) ## Summary Fixes #18204 — Inbound videos from all platforms (WeChat, Telegram, etc.) were silently ignored because the media processing loop in `gateway/run.py` only handled `image/*` and `audio/*` MIME types. `video/*` was completely skipped, resulting in empty prompts sent to the LLM API and HTTP 400 errors. ## Changes ### `gateway/run.py` - **Media routing loop** (~line 5113): Added `video_paths` collection alongside `image_paths` and `audio_paths`. Videos are detected by `video/*` MIME type or `MessageType.VIDEO`. - **`_extract_video_components()`**: New async method that uses ffmpeg to extract: - Audio track → WAV (16kHz mono PCM) for STT transcription - Up to 3 keyframes → JPEG for vision analysis (I-frame extraction first, falls back to `fps=1/10` sampling) - Handles missing ffmpeg, timeouts, and errors gracefully - **`_enrich_message_with_video()`**: New async method that orchestrates video processing — extracts components, delegates to existing `_enrich_message_with_transcription` and `_enrich_message_with_vision`, and provides a fallback note when ffmpeg is unavailable. Cleans up temp files after processing. ### `tests/gateway/test_video_media_processing.py` (new) 7 tests covering: - MIME type routing (`video/mp4`) - `MessageType.VIDEO` routing - Mixed media routing (image + video + audio) - Graceful handling when ffmpeg is missing - Timeout handling - Fallback note when extraction fails - Audio-only enrichment path ## Testing - All 7 new tests pass - All 3836 existing gateway tests pass (1 pre-existing failure in `test_teams.py` unrelated to this change) ## Notes - ffmpeg is an optional runtime dependency — videos degrade gracefully to a text note if ffmpeg is not installed - Temp files are cleaned up in a `finally`-like pattern via `shutil.rmtree` - No changes to platform adapters needed — they already download videos correctly ## Changed files - `gateway/run.py` (modified, +160/-0) - `tests/gateway/test_video_media_processing.py` (added, +165/-0) ## Fixed - Fixed by PR: fix(gateway): process inbound video messages via ffmpeg extraction (https://github.com/NousResearch/hermes-agent/pull/18243) ## Bug Description When a user sends a **video** message through any platform (WeChat, Telegram, etc.), the gateway downloads the video file successfully but then **silently ignores it** in the message processing pipeline. The video is not transcribed (audio extraction) and not analyzed (vision), resulting in an empty prompt being sent to the LLM API, which returns HTTP 400. ## Steps to Reproduce 1. Send a video message via WeChat (or any platform) 2. Observe gateway log: `inbound ... media=1` (video detected and downloaded) 3. The video file is cached locally (e.g., `cache/videos/`) 4. But in `gateway/run.py`, only `image/*` and `audio/*` media types are processed 5. `video/*` is completely ignored → empty prompt → HTTP 400: "The prompt parameter was not received normally" ## Root Cause In `gateway/run.py` (around line 5093), the media processing loop only handles two types: ```python if event.media_urls: image_paths = [] audio_paths = [] for i, path in enumerate(event.media_urls): mtype = event.media_types[i] if i < len(event.media_types) else "" if mtype.startswith("image/") or event.message_type == MessageType.PHOTO: image_paths.append(path) if mtype.startswith("audio/") or event.message_type in (MessageType.VOICE, MessageType.AUDIO): audio_paths.append(path) # video/mp4 → NOT handled at all ❌ ``` ## Expected Behavior Inbound videos should be processed similarly to how images and audio are handled: 1. **Extract audio track** via ffmpeg → run through STT (whisper) → append transcription to prompt 2. **Extract key frames** via ffmpeg → run through vision analysis → append visual description to prompt 3. This way the model receives meaningful content instead of an empty prompt ## Environment - Hermes Agent v0.12.0 (2026.4.30) - Platform: WeChat (Weixin), but affects all platforms - Python 3.11.15 ## Related - Telegram video caching was added in commit `9fdfb09ae` (platform adapter level), but the gateway run.py processing is still missing - WeChat adapter already downloads videos successfully (`_download_video` in `weixin.py`)

hermes2026-05-01 03:26:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#18204•Fetched 2026-05-02 05:49:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yyufoyy02

Participants

yyufoyy02

Timeline (top)

labeled ×3cross-referenced ×1

Root Cause

In gateway/run.py (around line 5093), the media processing loop only handles two types:

if event.media_urls:
    image_paths = []
    audio_paths = []
    for i, path in enumerate(event.media_urls):
        mtype = event.media_types[i] if i < len(event.media_types) else ""
        if mtype.startswith("image/") or event.message_type == MessageType.PHOTO:
            image_paths.append(path)
        if mtype.startswith("audio/") or event.message_type in (MessageType.VOICE, MessageType.AUDIO):
            audio_paths.append(path)
    # video/mp4 → NOT handled at all ❌

Fix Action

Fixed

Fixed by PR: fix(gateway): process inbound video messages via ffmpeg extraction (https://github.com/NousResearch/hermes-agent/pull/18243)

PR fix notes

PR #18243: fix(gateway): process inbound video messages via ffmpeg extraction

Repository: NousResearch/hermes-agent
Author: luyao618
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/18243

Description (problem / solution / changelog)

Summary

Fixes #18204 — Inbound videos from all platforms (WeChat, Telegram, etc.) were silently ignored because the media processing loop in gateway/run.py only handled image/* and audio/* MIME types. video/* was completely skipped, resulting in empty prompts sent to the LLM API and HTTP 400 errors.

Changes

`gateway/run.py`

Media routing loop (~line 5113): Added video_paths collection alongside image_paths and audio_paths. Videos are detected by video/* MIME type or MessageType.VIDEO.
_extract_video_components(): New async method that uses ffmpeg to extract:
- Audio track → WAV (16kHz mono PCM) for STT transcription
- Up to 3 keyframes → JPEG for vision analysis (I-frame extraction first, falls back to fps=1/10 sampling)
- Handles missing ffmpeg, timeouts, and errors gracefully
_enrich_message_with_video(): New async method that orchestrates video processing — extracts components, delegates to existing _enrich_message_with_transcription and _enrich_message_with_vision, and provides a fallback note when ffmpeg is unavailable. Cleans up temp files after processing.

`tests/gateway/test_video_media_processing.py` (new)

7 tests covering:

MIME type routing (video/mp4)
MessageType.VIDEO routing
Mixed media routing (image + video + audio)
Graceful handling when ffmpeg is missing
Timeout handling
Fallback note when extraction fails
Audio-only enrichment path

Testing

All 7 new tests pass
All 3836 existing gateway tests pass (1 pre-existing failure in test_teams.py unrelated to this change)

Notes

ffmpeg is an optional runtime dependency — videos degrade gracefully to a text note if ffmpeg is not installed
Temp files are cleaned up in a finally-like pattern via shutil.rmtree
No changes to platform adapters needed — they already download videos correctly

Changed files

gateway/run.py (modified, +160/-0)
tests/gateway/test_video_media_processing.py (added, +165/-0)

Code Example

if event.media_urls:
    image_paths = []
    audio_paths = []
    for i, path in enumerate(event.media_urls):
        mtype = event.media_types[i] if i < len(event.media_types) else ""
        if mtype.startswith("image/") or event.message_type == MessageType.PHOTO:
            image_paths.append(path)
        if mtype.startswith("audio/") or event.message_type in (MessageType.VOICE, MessageType.AUDIO):
            audio_paths.append(path)
    # video/mp4 → NOT handled at all ❌

RAW_BUFFERClick to expand / collapse

Bug Description

When a user sends a video message through any platform (WeChat, Telegram, etc.), the gateway downloads the video file successfully but then silently ignores it in the message processing pipeline. The video is not transcribed (audio extraction) and not analyzed (vision), resulting in an empty prompt being sent to the LLM API, which returns HTTP 400.

Steps to Reproduce

Send a video message via WeChat (or any platform)
Observe gateway log: inbound ... media=1 (video detected and downloaded)
The video file is cached locally (e.g., cache/videos/)
But in gateway/run.py, only image/* and audio/* media types are processed
video/* is completely ignored → empty prompt → HTTP 400: "The prompt parameter was not received normally"

Root Cause

In gateway/run.py (around line 5093), the media processing loop only handles two types:

if event.media_urls:
    image_paths = []
    audio_paths = []
    for i, path in enumerate(event.media_urls):
        mtype = event.media_types[i] if i < len(event.media_types) else ""
        if mtype.startswith("image/") or event.message_type == MessageType.PHOTO:
            image_paths.append(path)
        if mtype.startswith("audio/") or event.message_type in (MessageType.VOICE, MessageType.AUDIO):
            audio_paths.append(path)
    # video/mp4 → NOT handled at all ❌

Expected Behavior

Inbound videos should be processed similarly to how images and audio are handled:

Extract audio track via ffmpeg → run through STT (whisper) → append transcription to prompt
Extract key frames via ffmpeg → run through vision analysis → append visual description to prompt
This way the model receives meaningful content instead of an empty prompt

Environment

Hermes Agent v0.12.0 (2026.4.30)
Platform: WeChat (Weixin), but affects all platforms
Python 3.11.15

Telegram video caching was added in commit 9fdfb09ae (platform adapter level), but the gateway run.py processing is still missing
WeChat adapter already downloads videos successfully (_download_video in weixin.py)

extent analysis

TL;DR

The issue can be fixed by modifying the gateway/run.py to handle video/* media types and process them similarly to images and audio.

Guidance

Modify the media processing loop in gateway/run.py to handle video/* media types by adding a condition to check for mtype.startswith("video/").
Extract the audio track from the video using ffmpeg and run it through STT (whisper) to append the transcription to the prompt.
Extract key frames from the video using ffmpeg and run them through vision analysis to append a visual description to the prompt.
Verify the fix by sending a video message and checking the gateway log for successful processing and the LLM API response.

Example

if event.media_urls:
    image_paths = []
    audio_paths = []
    video_paths = []  # Add a list to store video paths
    for i, path in enumerate(event.media_urls):
        mtype = event.media_types[i] if i < len(event.media_types) else ""
        if mtype.startswith("image/") or event.message_type == MessageType.PHOTO:
            image_paths.append(path)
        if mtype.startswith("audio/") or event.message_type in (MessageType.VOICE, MessageType.AUDIO):
            audio_paths.append(path)
        if mtype.startswith("video/"):  # Add a condition to handle video types
            video_paths.append(path)
    # Process video paths to extract audio and key frames

Notes

The fix assumes that the necessary dependencies, such as ffmpeg, are installed and configured correctly. Additionally, the vision analysis and STT (whisper) components should be properly set up to handle the extracted audio and key frames.

Recommendation

Apply the workaround by modifying the gateway/run.py to handle video/* media types, as this will allow the gateway to process videos correctly and send meaningful content to the LLM API.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Inbound videos from all platforms (WeChat, etc.) are silently ignored — no transcription, no vision analysis [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #18243: fix(gateway): process inbound video messages via ffmpeg extraction

Description (problem / solution / changelog)

Summary

Changes

gateway/run.py

tests/gateway/test_video_media_processing.py (new)

Testing

Notes

Changed files

Code Example

Bug Description

Steps to Reproduce

Root Cause

Expected Behavior

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`gateway/run.py`

`tests/gateway/test_video_media_processing.py` (new)