hermes - 💡(How to fix) Fix Gateway turn can stall after Codex response while Python regex holds the GIL

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A Hermes gateway turn can become user-visible silent after a successful Codex Responses API call. In one captured incident, the API call completed and Hermes logged the usage/latency line, but no subsequent tool execution or final response was emitted until the gateway process was restarted.

A macOS sample taken while the process was silent showed the active CPU path inside Python's regex engine:

_sre_SRE_Pattern_search
  sre_search
    sre_ucs1_match
    sre_ucs1_count

Other Python threads in the same process were sampled waiting on take_gil, so this appears to be a CPU-bound regex path holding the GIL rather than a provider/network wait.

Root Cause

This does not look like the already-known context=~0 stale timeout issue, because the first incident occurred after an API call completed and the process was then sampled inside _sre.

Fix Action

Fix / Workaround

  • Codex Responses output normalization
  • leaked tool-call text detection
  • tool-call argument validation/sanitization
  • tool dispatch/destructive-command checks
  • gateway post-processing regexes

Code Example

_sre_SRE_Pattern_search
  sre_search
    sre_ucs1_match
    sre_ucs1_count

---

Thread A:
  ...
  _sre_SRE_Pattern_search
    sre_search
      sre_ucs1_match
      sre_ucs1_count

Other threads:
  ...
  take_gil

---

conversation turn started
API call #N completed: model=..., provider=codex_responses, latency=...
# no subsequent tool execution
# no Turn ended
# gateway later restarted
RAW_BUFFERClick to expand / collapse

Summary

A Hermes gateway turn can become user-visible silent after a successful Codex Responses API call. In one captured incident, the API call completed and Hermes logged the usage/latency line, but no subsequent tool execution or final response was emitted until the gateway process was restarted.

A macOS sample taken while the process was silent showed the active CPU path inside Python's regex engine:

_sre_SRE_Pattern_search
  sre_search
    sre_ucs1_match
    sre_ucs1_count

Other Python threads in the same process were sampled waiting on take_gil, so this appears to be a CPU-bound regex path holding the GIL rather than a provider/network wait.

Environment

  • Hermes Agent: v0.14.0 (2026.5.16)
  • Local checkout commit: 39b8d1d31
  • Checkout status: clean working tree at time of report
  • Update state: local checkout was behind upstream main when observed
  • Python: 3.11.14
  • OpenAI SDK: 2.24.0
  • OS: macOS 26.5 (25F71), ARM64
  • Runtime: Hermes gateway
  • Platform: Discord gateway
  • Provider/API mode: OpenAI Codex / Codex Responses (gpt-5.5)
  • Streaming: Codex Responses stream path internally, normal gateway final-response delivery externally

What Happened

A Discord-triggered agent turn ran normally through several model/tool iterations:

  1. Codex Responses API call completed successfully.
  2. Hermes logged the API call as complete with normal usage/latency.
  3. Immediately after that, there was no tool execution log and no Turn ended log.
  4. The user saw no response in Discord.
  5. A process sample during the stall showed Python in _sre_SRE_Pattern_search / sre_search, with other threads waiting on the GIL.
  6. Restarting the gateway cleared the wedged process.

A later retry of the same user task hit a separate stale-provider timeout, but the first stall was different: it happened after an API call had already returned and before tool execution resumed.

Expected Behavior

Post-response parsing, tool-call normalization, safety checks, media extraction, and message cleanup should not be able to monopolize the gateway process indefinitely. If a regex path becomes pathological, Hermes should either:

  • avoid the pathological regex,
  • bound the input/pattern,
  • use a regex engine with timeout support for risky patterns,
  • or emit a diagnostic stack/timeout and fail the turn without wedging the gateway.

Actual Behavior

The gateway process became effectively silent for the active Discord turn. The sampled CPU stack was inside Python re / _sre, and other Python threads were blocked on the GIL.

Sanitized Evidence

The relevant sample excerpt:

Thread A:
  ...
  _sre_SRE_Pattern_search
    sre_search
      sre_ucs1_match
      sre_ucs1_count

Other threads:
  ...
  take_gil

The relevant log shape:

conversation turn started
API call #N completed: model=..., provider=codex_responses, latency=...
# no subsequent tool execution
# no Turn ended
# gateway later restarted

I am intentionally omitting local paths, user identifiers, Discord IDs, exact process IDs, and private tool-command payloads.

Suspected Area

The stall appears to happen after response receipt and before tool execution, so likely candidates are post-response parsing/normalization paths such as:

  • Codex Responses output normalization
  • leaked tool-call text detection
  • tool-call argument validation/sanitization
  • tool dispatch/destructive-command checks
  • gateway post-processing regexes

This does not look like the already-known context=~0 stale timeout issue, because the first incident occurred after an API call completed and the process was then sampled inside _sre.

Request

Could Hermes add diagnostics or hardening around post-response regex paths?

Useful fixes might include:

  • logging the current post-response phase before/after risky regex operations,
  • adding a watchdog/faulthandler dump for long CPU-bound post-response processing,
  • replacing risky re patterns with bounded parsing,
  • or isolating agent turns so a regex CPU spin cannot starve the gateway.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING