hermes - ✅(Solved) Fix Stack overflow / SIGSEGV in _process_message_background due to direct recursion on pending-queue drain [4 pull requests, 1 participants]

hermes2026-04-30 04:39:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17758•Fetched 2026-05-01 05:56:05

View on GitHub

Comments

Participants

Timeline

Reactions

Author

irsaliev-ai

Participants

irsaliev-ai

Timeline (top)

referenced ×5cross-referenced ×4labeled ×3closed ×1

gateway/platforms/base.py::_process_message_background recursively awaits itself when there is a pending message queued during processing. Each pending follow-up adds another frame to the call stack instead of starting fresh. Under sustained pending-queue activity this exhausts the C stack and crashes the process with SIGSEGV.

In a real failure the stack reached ~2000 nested _process_message_background frames before segfaulting.

Error Message

Program terminated with signal SIGSEGV, Segmentation fault. #0 vgetargskeywords (... format="|$OO:AttributeError" ...) at Python/getargs.c:1592

Traceback (most recent call first): File "/usr/lib/python3.12/pathlib.py", line 441, in str File "/usr/lib/python3.12/pathlib.py", line 448, in fspath File "/usr/lib/python3.12/pathlib.py", line 842, in stat File "/usr/lib/python3.12/pathlib.py", line 862, in exists File "/opt/hermes/gateway/pairing.py", line 101, in _load_json File "/opt/hermes/gateway/pairing.py", line 115, in is_approved File "/opt/hermes/gateway/run.py", line 3274, in _is_user_authorized File "/opt/hermes/gateway/run.py", line 3452, in _handle_message File "/opt/hermes/gateway/platforms/base.py", line 2320, in _process_message_background File "/opt/hermes/gateway/platforms/base.py", line ?, in _process_message_background File "/opt/hermes/gateway/platforms/base.py", line ?, in _process_message_background ... (~2000 frames) ...

Root Cause

The bug is silent — the process can absorb dozens of follow-ups before crashing, so it surfaces as an apparently random segfault under load (long bot replies that get queued behind quick user follow-ups, Telegram message-splitting, etc.).

It also defeats the systemd auto-restart safety net for users on platforms with at-least-once delivery: the queued message that triggered the crash is replayed on restart and can crash the new process the same way.

Fix Action

Fixed

Fixed by PR: fix(gateway): drain pending messages via fresh task, not recursion (#17758) (https://github.com/NousResearch/hermes-agent/pull/17772)
Fixed by PR: fix(gateway): drain pending messages via independent task, not recursive await (https://github.com/NousResearch/hermes-agent/pull/17863)
Fixed by PR: fix(gateway): drain pending messages via fresh task, not recursion (#17758) (https://github.com/NousResearch/hermes-agent/pull/17896)
Fixed by PR: test(gateway): pin cleanup invariants for #17758 in-band drain hand-off (https://github.com/NousResearch/hermes-agent/pull/17930)

PR fix notes

PR #17772: fix(gateway): drain pending messages via fresh task, not recursion (#17758)

Repository: NousResearch/hermes-agent
Author: briandevans
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17772

Description (problem / solution / changelog)

Summary

Convert the in-band pending-drain in BasePlatformAdapter._process_message_background from a recursive await self._process_message_background(...) to asyncio.create_task(...), mirroring the existing late-arrival drain pattern in the same function.
Add a guard in the late-arrival drain path so the in-band hand-off can't race with itself across the finally boundary and spawn two concurrent tasks for the same session_key.
Add a regression test that proves the in-band drain no longer grows the call stack across chained follow-ups.

Fixes #17758.

The bug

When a user sent message A, then B arrived while A was being processed, B was queued in _pending_messages[session_key]. After A's turn finished, _process_message_background drained B by awaiting itself recursively:

# gateway/platforms/base.py (pre-fix)
await self._process_message_background(pending_event, session_key)
return  # Already cleaned up

Each chained follow-up added a frame to the call stack instead of starting fresh. Under sustained pending-queue activity, the C stack would exhaust at ~2000 nested _process_message_background frames and the process would crash with SIGSEGV — exactly what was reported in #17758.

The fix

Hand off the pending event to a brand-new asyncio.create_task(...) and return, mirroring the late-arrival drain pattern that already exists ~80 lines below in finally:

drain_task = asyncio.create_task(
    self._process_message_background(pending_event, session_key)
)
self._session_tasks[session_key] = drain_task
try:
    self._background_tasks.add(drain_task)
    drain_task.add_done_callback(self._background_tasks.discard)
except TypeError:
    pass
return  # Drain task owns the session now.

That on its own would let the in-band hand-off race with the late-arrival drain in finally: during await typing_task and await self.stop_typing(...), a brand-new message C could land in _pending_messages via the busy-handler. Without a guard, finally would pop C and spawn a second concurrent task for the same session_key, clobbering the in-band drain task's ownership.

So the late-arrival block now also checks: if _session_tasks[session_key] is no longer the current task, an in-band drain already spawned a follow-up — put the late-arrival event back into _pending_messages so the existing drain task picks it up at the end of its own turn, instead of starting a competing task.

late_pending = self._pending_messages.pop(session_key, None)
if late_pending is not None:
    current_task = asyncio.current_task()
    existing_task = self._session_tasks.get(session_key)
    if existing_task is not None and existing_task is not current_task:
        # Re-queue: the in-band drain task will pick it up.
        self._pending_messages[session_key] = late_pending
    else:
        # spawn drain task (existing behavior)

The existing single-pending-slot semantic of _pending_messages still holds — the busy-handler at line 2363 was already overwriting the slot, and we never queue more than one pending message per session_key.

Test plan

New regression: tests/gateway/test_pending_drain_no_recursion.py::test_in_band_drain_does_not_grow_stack chains 12 follow-ups and asserts the maximum nested _process_message_background frame count at handler entry stays ≤ 2. Confirmed failing on pre-fix code with depths=[1,2,3,4,5,6,7,8,9,10,11,12]; passing post-fix with all depths = 1.
tests/gateway/test_pending_drain_race.py (3 tests) — existing race regression suite still green.
tests/gateway/test_cancel_background_drain.py (3 tests) — drain-on-shutdown still green.
tests/gateway/test_pending_event_none.py (6 tests) — control-message handling still green.
tests/gateway/test_duplicate_reply_suppression.py (22 tests) — including the one test that called _process_message_background directly and implicitly relied on the old recursive await semantic; updated to await the spawned drain task before asserting on adapter.sent.
Broader gateway suite (tests/gateway/, ~4000 tests) — only pre-existing baselines failing on this checkout (DingTalk needs [dingtalk] extra, Matrix needs matrix-nio, WhatsApp needs the bundled bridge); none touch the changed code path. DingTalk verified 62/62 passing under the CI install (--with 'alibabacloud-dingtalk>=2.0.0' --with 'dingtalk-stream>=0.20,<1').

Fixes #17758 ("Stack overflow / SIGSEGV in _process_message_background due to direct recursion on pending-queue drain")
The late-arrival drain block (#12471, #12371) already used the create_task pattern; this PR makes the in-band drain match it and adds the cross-block ownership guard.

Changed files

gateway/platforms/base.py (modified, +70/-23)
tests/gateway/test_duplicate_reply_suppression.py (modified, +9/-0)
tests/gateway/test_pending_drain_no_recursion.py (added, +188/-0)

PR #17863: fix(gateway): drain pending messages via independent task, not recursive await

Repository: NousResearch/hermes-agent
Author: vominh1919
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17863

Description (problem / solution / changelog)

Summary

_process_message_background() in gateway/platforms/base.py recursively awaits itself when a pending message is queued during processing. Each follow-up message adds another frame to the call stack. Under sustained pending-queue activity, the stack grows to ~2000 nested frames and the process crashes with SIGSEGV.

Root Cause

At line 2674, after processing message A, if message B was queued during A's processing:

# Recursive — each pending message adds a stack frame
await self._process_message_background(pending_event, session_key)
return

If message C arrives during B's processing, B recursively calls itself for C, and so on. No bound on recursion depth.

Fix

Replace the direct await with asyncio.create_task(), which schedules the drain as an independent task on the event loop with zero stack growth. This matches the existing pattern already used for late-arrival pending messages at line ~2745:

drain_task = asyncio.create_task(
    self._process_message_background(pending_event, session_key)
)
self._session_tasks[session_key] = drain_task
try:
    self._background_tasks.add(drain_task)
    drain_task.add_done_callback(self._background_tasks.discard)
except TypeError:
    pass
return

Why This Is Safe

The _active_sessions guard is preserved — the interrupt event is cleared but the entry stays live, preventing concurrent agents on the same session key
Task tracking (_session_tasks, _background_tasks) ensures the drain task participates in graceful shutdown via cancel_background_tasks()
The try/except TypeError mirrors the defensive pattern at line ~2753 for test stubs

Fixes #17758

Changed files

gateway/platforms/base.py (modified, +13/-2)
gateway/run.py (modified, +17/-5)

PR #17896: fix(gateway): drain pending messages via fresh task, not recursion (#17758)

Repository: NousResearch/hermes-agent
Author: teknium1
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/17896

Description (problem / solution / changelog)

Salvages #17772 by @briandevans onto current main. Closes #17758. Also supersedes @vominh1919's #17863 (same core fix, submitted 4h later — both contributors credited).

Problem

When the gateway drains pending follow-up messages, _process_message_background used to recursively await itself. Every chained follow-up added one frame to the call stack. Under sustained pending-queue activity the C stack exhausted at ~2000 nested frames and the process crashed with SIGSEGV. Real-world crash reported on Hermes v0.11.0, Python 3.12 native install.

Fix (author: @briandevans, 2 commits)

Commit 1 — fix(gateway): drain pending messages via fresh task, not recursion Replace await self._process_message_background(pending_event, session_key) with asyncio.create_task(...) that owns the session guard through _session_tasks and _background_tasks — mirrors the existing late-arrival drain pattern. Stack stays at depth 1 regardless of chain length.

Commit 2 — fix(gateway): preserve session guard across in-band drain handoff Without this, the in-band hand-off could race with the late-arrival drain in finally: during the typing-task cleanup, a new message C landing in _pending_messages would spawn a second concurrent _process_message_background task for the same session_key. The late-arrival block now checks whether ownership has already been transferred to an in-band drain task and re-queues its event instead of spawning a duplicate.

Merge conflict resolution

PR branch was stale against aa7bf329b (gateway typing-task helper refactor). Trivially resolved: kept main's await _stop_typing_task() helper call and layered @briandevans' fresh-task drain logic on top.

Validation

scripts/run_tests.sh tests/gateway/test_duplicate_reply_suppression.py tests/gateway/test_pending_drain_no_recursion.py
24 passed in 0.96s

scripts/run_tests.sh tests/gateway/
4213 passed, 7 skipped in 86s

New regression test tests/gateway/test_pending_drain_no_recursion.py asserts the invariant directly by counting nested _process_message_background frames at handler entry across a chain of N follow-ups — recursion makes depth grow linearly, task spawning keeps it constant at 1.

Before / after

	Before	After
Drain 10 chained follow-ups	stack depth = 10 frames	stack depth = 1 frame (always)
Drain ~2000 chained follow-ups	SIGSEGV	stack depth = 1 frame
Two drain paths racing	2 concurrent agents on same session_key	late arrival re-queued, single drain task processes it

Authorship preserved for @briandevans via plain cherry-pick. Thanks also to @vominh1919 who independently identified and fixed the same issue in #17863.

Changed files

gateway/platforms/base.py (modified, +70/-23)
tests/gateway/test_duplicate_reply_suppression.py (modified, +9/-0)
tests/gateway/test_pending_drain_no_recursion.py (added, +188/-0)

PR #17930: test(gateway): pin cleanup invariants for #17758 in-band drain hand-off

Repository: NousResearch/hermes-agent
Author: teknium1
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/17930

Description (problem / solution / changelog)

Follow-up to #17758. @briandevans' fix (commits 663ba9a58 + f44f1f961) landed on main earlier today — this PR is pure test coverage on top, no production code changes.

Why

During review of the #17758 fix I had three specific concerns that the fix reasoned about in commit messages but didn't pin with tests. Adding them so future refactors can't silently regress the invariants.

What it adds

Three new async tests in tests/gateway/test_pending_drain_no_recursion.py:

1. `test_normal_path_releases_session_guard`

The #17758 fix moved _release_session_guard(...) under if current_task is self._session_tasks.get(session_key). For the 99%-common case (one message, nothing queued) current_task IS the stored task, so the guard must still fire. Without this test, a future tightening of the conditional could leave sessions permanently pinned busy after normal messages.

2. `test_drain_task_cancellation_releases_session`

If the drain task spawned by the in-band hand-off is cancelled mid-handler (e.g. /stop fired while draining a follow-up), the drain task's own finally must fire _release_session_guard. Without this, a cancel mid-drain would leave _active_sessions[sk] populated forever — the session stays stuck busy.

3. `test_late_arrival_drain_still_fires_when_no_in_band_drain`

The #17758 follow-up commit added a re-queue branch to the late-arrival drain block that only fires when ownership was already handed off to another task. For the common case (late-arrival with no prior in-band drain), the else branch must still spawn a fresh drain task — otherwise a message that arrives during stop_typing gets silently dropped.

Validation

scripts/run_tests.sh tests/gateway/test_pending_drain_no_recursion.py
5 passed in 4.95s

All 5 tests (2 existing + 3 new) pass against current main.

Scope

1 file changed: tests/gateway/test_pending_drain_no_recursion.py (+163, -0)
No production code modified
No behavior change
Pure regression guards

Changed files

tests/gateway/test_pending_drain_no_recursion.py (modified, +163/-0)

Code Example

# gateway/platforms/base.py:2542-2544
    # Process pending message in new background task
    await self._process_message_background(pending_event, session_key)
    return  # Already cleaned up

---

Program terminated with signal SIGSEGV, Segmentation fault.
#0  vgetargskeywords (... format="|$OO:AttributeError" ...) at Python/getargs.c:1592

Traceback (most recent call first):
  File "/usr/lib/python3.12/pathlib.py", line 441, in __str__
  File "/usr/lib/python3.12/pathlib.py", line 448, in __fspath__
  File "/usr/lib/python3.12/pathlib.py", line 842, in stat
  File "/usr/lib/python3.12/pathlib.py", line 862, in exists
  File "/opt/hermes/gateway/pairing.py", line 101, in _load_json
  File "/opt/hermes/gateway/pairing.py", line 115, in is_approved
  File "/opt/hermes/gateway/run.py", line 3274, in _is_user_authorized
  File "/opt/hermes/gateway/run.py", line 3452, in _handle_message
  File "/opt/hermes/gateway/platforms/base.py", line 2320, in _process_message_background
  File "/opt/hermes/gateway/platforms/base.py", line ?, in _process_message_background
  File "/opt/hermes/gateway/platforms/base.py", line ?, in _process_message_background
  ... (~2000 frames) ...

---

# gateway/platforms/base.py:2542-2544
# Process pending message in new background task
asyncio.create_task(
    self._process_message_background(pending_event, session_key)
)
return

RAW_BUFFERClick to expand / collapse

Stack overflow / SIGSEGV in `_process_message_background` due to direct recursion on pending-queue drain

Summary

In a real failure the stack reached ~2000 nested _process_message_background frames before segfaulting.

Environment

Hermes Agent v0.11.0 (2026.4.23)
Python 3.12.3
OpenAI SDK 2.33.0
Ubuntu 24.04, kernel 6.17.0-20-generic
Native install (systemd service, not container)

Reproduction (logical)

User sends message A → handle_message (line 2025) creates background task running _process_message_background(A, key).
While A is still being processed, user sends message B. The busy-handler path stores B in self._pending_messages[key] and sets the interrupt event.

After A's handler returns, the post-processing block at lines 2521–2544 detects the pending entry and awaits the same coroutine recursively:

# gateway/platforms/base.py:2542-2544
# Process pending message in new background task
await self._process_message_background(pending_event, session_key)
return  # Already cleaned up

If during the processing of B another follow-up C arrives and is queued, the same branch fires again, adding another stack frame. With N queued follow-ups in a row, depth grows linearly to N.

The comment on line 2542 says "in new background task" but the code is a direct await, not asyncio.create_task(...).

Observed crash

Program terminated with signal SIGSEGV, Segmentation fault.
#0  vgetargskeywords (... format="|$OO:AttributeError" ...) at Python/getargs.c:1592

Traceback (most recent call first):
  File "/usr/lib/python3.12/pathlib.py", line 441, in __str__
  File "/usr/lib/python3.12/pathlib.py", line 448, in __fspath__
  File "/usr/lib/python3.12/pathlib.py", line 842, in stat
  File "/usr/lib/python3.12/pathlib.py", line 862, in exists
  File "/opt/hermes/gateway/pairing.py", line 101, in _load_json
  File "/opt/hermes/gateway/pairing.py", line 115, in is_approved
  File "/opt/hermes/gateway/run.py", line 3274, in _is_user_authorized
  File "/opt/hermes/gateway/run.py", line 3452, in _handle_message
  File "/opt/hermes/gateway/platforms/base.py", line 2320, in _process_message_background
  File "/opt/hermes/gateway/platforms/base.py", line ?, in _process_message_background
  File "/opt/hermes/gateway/platforms/base.py", line ?, in _process_message_background
  ... (~2000 frames) ...

The terminal pathlib.__str__ frame is irrelevant — it's just the unlucky call that ran out of stack first. The cause is the depth, not pathlib.

hermes[PID]: segfault at 7ffd27c69fd8 ip ...sp 7ffd27c69fe0 error 6 — SP and fault address differ by 8 bytes, classic stack overflow signature.

Suggested fix

Replace the await with a fresh task and return:

# gateway/platforms/base.py:2542-2544
# Process pending message in new background task
asyncio.create_task(
    self._process_message_background(pending_event, session_key)
)
return

The surrounding cleanup (typing_task cancel, _active.clear()) has already happened, so spawning is safe. This removes the unbounded recursion entirely — the new task starts at depth 1.

The concurrency comment at lines 2525–2533 explains why _active_sessions[session_key] is cleared rather than deleted; that invariant is unaffected by switching from await to create_task. The Level-1 guard still treats follow-ups as busy.

Why this matters

Artifacts

I have the full core dump, full py-bt output, and gdb bt 2000 output if useful. No proprietary data is included in this report — only function/file names from the open-source gateway/ module.

extent analysis

TL;DR

Replace the direct await with asyncio.create_task to prevent unbounded recursion in _process_message_background.

Guidance

Identify the recursive call to _process_message_background and replace it with asyncio.create_task to start a new task instead of awaiting the coroutine directly.
Verify that the change fixes the issue by testing with a high volume of pending messages and checking for SIGSEGV crashes.
Review the surrounding code to ensure that the cleanup and concurrency invariants are maintained after switching to create_task.
Consider adding logging or monitoring to detect and alert on potential stack overflow issues in the future.

Example

# gateway/platforms/base.py:2542-2544
# Process pending message in new background task
asyncio.create_task(
    self._process_message_background(pending_event, session_key)
)
return

Notes

This fix assumes that the surrounding code is correctly handling the concurrency and cleanup of pending messages. Additional testing and review may be necessary to ensure that the fix does not introduce new issues.

Recommendation

Apply the suggested fix by replacing the await with asyncio.create_task to prevent the unbounded recursion and potential SIGSEGV crashes. This change should fix the issue and prevent future crashes under load.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent execution #callback error #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Stack overflow / SIGSEGV in _process_message_background due to direct recursion on pending-queue drain [4 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #17772: fix(gateway): drain pending messages via fresh task, not recursion (#17758)

Description (problem / solution / changelog)

Summary

The bug

The fix

Test plan

Related

Changed files

PR #17863: fix(gateway): drain pending messages via independent task, not recursive await

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Why This Is Safe

Changed files

PR #17896: fix(gateway): drain pending messages via fresh task, not recursion (#17758)

Description (problem / solution / changelog)

Problem

Fix (author: @briandevans, 2 commits)

Merge conflict resolution

Validation

Before / after

Changed files

PR #17930: test(gateway): pin cleanup invariants for #17758 in-band drain hand-off

Description (problem / solution / changelog)

Why

What it adds

1. test_normal_path_releases_session_guard

2. test_drain_task_cancellation_releases_session

3. test_late_arrival_drain_still_fires_when_no_in_band_drain

Validation

Scope

Changed files

Code Example

Stack overflow / SIGSEGV in _process_message_background due to direct recursion on pending-queue drain

Summary

Environment

Reproduction (logical)

Observed crash

Suggested fix

Why this matters

Artifacts

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `test_normal_path_releases_session_guard`

2. `test_drain_task_cancellation_releases_session`

3. `test_late_arrival_drain_still_fires_when_no_in_band_drain`

Stack overflow / SIGSEGV in `_process_message_background` due to direct recursion on pending-queue drain