hermes - 💡(How to fix) Fix MEDIA: tag silently drops .md (and other) files due to regex whitelist mismatch

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix

Align extract_media's extension whitelist with extract_local_files's supported set. Missing extensions include: md, json, xml, ya?ml, tsv, odt, rtf, bmp, tiff, svg, tar, gz, tgz, bz2, xz, xls, ods, ppt, odp, key.

Code Example

import re

# extract_media pattern (line 2524)
media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>...)\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s\`"',;:)\]}]|$))[`"']?'''
)

# cleanup pattern (line 3709)
cleanup = re.compile(r'MEDIA:\s*\S+')

text = 'Here is your report: MEDIA:/tmp/paid_users_up_analysis.md'

assert media_pattern.search(text) is None        # not extracted
cleaned = cleanup.sub('', text).strip()
assert '/tmp/paid_users_up_analysis.md' not in cleaned  # path gone
RAW_BUFFERClick to expand / collapse

Related

Introduced by PR #28350 (diagnosable MEDIA rejections + canonical cache roots + null-path guard).

Problem

extract_media uses a strict extension whitelist that does not include .md (nor .json, .yaml, .xml, .tsv, etc.), while the fallback extract_local_files does support them.

However, line 3709 unconditionally strips all MEDIA: tags from the response text with a loose regex (MEDIA:\s*\S+) — even those that extract_media failed to match.

This creates a black hole for unsupported extensions:

  1. extract_media (strict regex) → no match for .md
  2. Cleanup regex re.sub(r"MEDIA:\s*\S+", "", ...) → removes the path from text
  3. extract_local_files (broad extension list) → runs on already-cleaned text, path is gone

Result: The file is neither extracted as media nor detected as a bare path. The user receives nothing.

Reproduction

import re

# extract_media pattern (line 2524)
media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>...)\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s\`"',;:)\]}]|$))[`"']?'''
)

# cleanup pattern (line 3709)
cleanup = re.compile(r'MEDIA:\s*\S+')

text = 'Here is your report: MEDIA:/tmp/paid_users_up_analysis.md'

assert media_pattern.search(text) is None        # not extracted
cleaned = cleanup.sub('', text).strip()
assert '/tmp/paid_users_up_analysis.md' not in cleaned  # path gone

Fix

Align extract_media's extension whitelist with extract_local_files's supported set. Missing extensions include: md, json, xml, ya?ml, tsv, odt, rtf, bmp, tiff, svg, tar, gz, tgz, bz2, xz, xls, ods, ppt, odp, key.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix MEDIA: tag silently drops .md (and other) files due to regex whitelist mismatch