hermes - 💡(How to fix) Fix MEDIA: tag silently drops .md (and other) files due to regex whitelist mismatch

hermes2026-05-29 09:06:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fix

Align extract_media's extension whitelist with extract_local_files's supported set. Missing extensions include: md, json, xml, ya?ml, tsv, odt, rtf, bmp, tiff, svg, tar, gz, tgz, bz2, xz, xls, ods, ppt, odp, key.

Code Example

import re

# extract_media pattern (line 2524)
media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>...)\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s\`"',;:)\]}]|$))[`"']?'''
)

# cleanup pattern (line 3709)
cleanup = re.compile(r'MEDIA:\s*\S+')

text = 'Here is your report: MEDIA:/tmp/paid_users_up_analysis.md'

assert media_pattern.search(text) is None        # not extracted
cleaned = cleanup.sub('', text).strip()
assert '/tmp/paid_users_up_analysis.md' not in cleaned  # path gone

RAW_BUFFERClick to expand / collapse

Introduced by PR #28350 (diagnosable MEDIA rejections + canonical cache roots + null-path guard).

Problem

extract_media uses a strict extension whitelist that does not include .md (nor .json, .yaml, .xml, .tsv, etc.), while the fallback extract_local_files does support them.

However, line 3709 unconditionally strips all MEDIA: tags from the response text with a loose regex (MEDIA:\s*\S+) — even those that extract_media failed to match.

This creates a black hole for unsupported extensions:

extract_media (strict regex) → no match for .md
Cleanup regex re.sub(r"MEDIA:\s*\S+", "", ...) → removes the path from text
extract_local_files (broad extension list) → runs on already-cleaned text, path is gone

Result: The file is neither extracted as media nor detected as a bare path. The user receives nothing.

Reproduction

import re

# extract_media pattern (line 2524)
media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>...)\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s\`"',;:)\]}]|$))[`"']?'''
)

# cleanup pattern (line 3709)
cleanup = re.compile(r'MEDIA:\s*\S+')

text = 'Here is your report: MEDIA:/tmp/paid_users_up_analysis.md'

assert media_pattern.search(text) is None        # not extracted
cleaned = cleanup.sub('', text).strip()
assert '/tmp/paid_users_up_analysis.md' not in cleaned  # path gone

Fix

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering