hermes - 💡(How to fix) Fix [BUG] extract_media regex truncates Windows spaced paths and rejects GIS extensions (.kmz/.kml/.geojson/.gpx)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The MEDIA: tag extractor in gateway/platforms/base.py (extract_media) fails on Windows absolute paths that contain spaces (e.g. C:\Users\Foo\OneDrive\My Folder\file.pdf). The path is silently truncated at the first whitespace, so the file is never attached. Additionally, several common GIS / structured-data extensions (kmz, kml, geojson, gpx, json, xml, html) are absent from the spaced-path allowlist, so even POSIX-style spaced paths fail for those types.

Root Cause

In gateway/platforms/base.py around line 2067, the current pattern is:

media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
)

Two problems:

  1. The spaced-path branch is anchored to (?:~/|/) — i.e. only POSIX paths starting with ~/ or /. Windows drive paths (C:\…, D:\…) and UNC paths (\server\share\…) skip this branch and fall into the final \S+, which stops at the first whitespace.

  2. GIS/structured extensions are missing from the allowlist (kmz, kml, geojson, gpx, json, xml, html?). Any user delivering coordinate exports, OpenAPI specs, sitemaps, etc. from a spaced path hits this even on Linux/macOS.

Fix Action

Fix / Workaround

tests/gateway/test_media_extraction.py passes after the patch (4/4) including a new fixture for Windows spaced paths. Happy to PR if maintainers want.

Code Example

MEDIA:C:\Users\Confera\OneDrive\Nusa Alam Kreasindo\Project\Foo\report.pdf

---

MEDIA:/home/user/My Folder/coords.kmz

---

media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
)

---

-r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
+r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|[^\n]+?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa|kmz|kml|json|xml|html?|geojson|gpx)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
RAW_BUFFERClick to expand / collapse

Summary

The MEDIA: tag extractor in gateway/platforms/base.py (extract_media) fails on Windows absolute paths that contain spaces (e.g. C:\Users\Foo\OneDrive\My Folder\file.pdf). The path is silently truncated at the first whitespace, so the file is never attached. Additionally, several common GIS / structured-data extensions (kmz, kml, geojson, gpx, json, xml, html) are absent from the spaced-path allowlist, so even POSIX-style spaced paths fail for those types.

Repro

On Windows (any platform, but Telegram makes it most obvious), have the agent emit:

MEDIA:C:\Users\Confera\OneDrive\Nusa Alam Kreasindo\Project\Foo\report.pdf

Expected: file delivered as attachment. Actual: path is truncated to C:\Users\Confera\OneDrive\Nusa — gateway logs file not found (or silently drops it), and the rest of the path (Alam Kreasindo\Project\Foo\report.pdf) leaks into the user-visible text.

Also reproducible with:

MEDIA:/home/user/My Folder/coords.kmz

Even with the existing spaced-path branch, .kmz is not in the extension allowlist, so the regex falls through to the \S+ branch and truncates at the first space.

Root cause

In gateway/platforms/base.py around line 2067, the current pattern is:

media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
)

Two problems:

  1. The spaced-path branch is anchored to (?:~/|/) — i.e. only POSIX paths starting with ~/ or /. Windows drive paths (C:\…, D:\…) and UNC paths (\server\share\…) skip this branch and fall into the final \S+, which stops at the first whitespace.

  2. GIS/structured extensions are missing from the allowlist (kmz, kml, geojson, gpx, json, xml, html?). Any user delivering coordinate exports, OpenAPI specs, sitemaps, etc. from a spaced path hits this even on Linux/macOS.

Suggested fix

Drop the (?:~/|/) anchor (since MEDIA: is itself the start-of-token marker, and the regex is already terminated by an extension + lookahead) and extend the allowlist. Diff against current main:

-r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
+r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|[^\n]+?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa|kmz|kml|json|xml|html?|geojson|gpx)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''

[^\n]+? (non-greedy, line-bounded) handles Windows drive paths, UNC paths, and POSIX paths uniformly. The trailing extension + lookahead ((?=[\s\"',;:)]}]|$)`) still terminates the match cleanly so it doesn't swallow following sentences.

The same fix needs to be mirrored at the other call sites that use MEDIA:\S+ for cleanup/history/UI:

  • gateway/platforms/base.py cleanup re.sub(r"MEDIA:[^\n]+", …) (line ~2993) — already correct
  • gateway/platforms/stream_consumer.py — cleanup regex
  • gateway/run.py — history dedup (2 occurrences)
  • gateway/mcp_serve.py — MCP attachments
  • ui-tui/src/components/markdown.tsx — UI renderer

Tests

tests/gateway/test_media_extraction.py passes after the patch (4/4) including a new fixture for Windows spaced paths. Happy to PR if maintainers want.

Related

  • #21527 (Telegram media path escaping — different root cause, same area)
  • #6249 (MEDIA path echo as text — same symptom for spaced paths on Linux)
  • #23759 (Markdown ** in MEDIA paths — same pattern of regex not handling a path character)

Environment

  • Hermes Agent main @ 271883447 (May 12 2026)
  • Windows 11 Pro, Python 3.x, Telegram gateway
  • Real-world trigger: OneDrive-rooted project paths (C:\Users\<user>\OneDrive\<Org With Spaces>\…)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [BUG] extract_media regex truncates Windows spaced paths and rejects GIS extensions (.kmz/.kml/.geojson/.gpx)