hermes - 💡(How to fix) Fix Gateway restart can lose long-running sessions during shutdown drain [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  1. Start a gateway session from a messaging platform.
  2. Run a long task whose SessionEntry.updated_at is older than the startup fallback window because the turn has not completed yet.
  3. Restart the gateway via the service manager while the task is still active.
  4. Let the service manager terminate the old process while it is still inside the drain wait.
  5. Start the gateway again and send a message in the same chat/thread/topic.

Fix Action

Fixed

RAW_BUFFERClick to expand / collapse

Bug Description

Long-running gateway sessions can be lost across a service restart if the process is killed while it is still waiting for active agents to drain.

The current restart/shutdown flow only writes the durable resume_pending marker after _drain_active_agents(timeout) returns timed_out=True. Service managers such as systemd can terminate the process while it is still inside that drain wait, before the timeout branch runs.

If the session's updated_at is older than the startup suspend_recently_active() fallback window, the next gateway boot has no durable marker to resume or suspend the in-flight session cleanly.

Steps to Reproduce

  1. Start a gateway session from a messaging platform.
  2. Run a long task whose SessionEntry.updated_at is older than the startup fallback window because the turn has not completed yet.
  3. Restart the gateway via the service manager while the task is still active.
  4. Let the service manager terminate the old process while it is still inside the drain wait.
  5. Start the gateway again and send a message in the same chat/thread/topic.

Expected Behavior

The active session is durably marked before the vulnerable drain wait begins, so the next process can recover the interrupted session state instead of treating it as a normal idle session.

Actual Behavior

The durable resume_pending marker may never be written if the old process is killed during the drain wait. Long-running sessions outside the startup freshness window can then appear stopped or reset after restart.

Notes

This is in the same failure-mode family as the existing restart resume work, but the race happens earlier than the post-timeout mark_resume_pending() path: the process can die before that branch gets control.

A narrow fix is to pre-mark currently running sessions before awaiting drain, then clear only those early markers if the drain completes gracefully.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Gateway restart can lose long-running sessions during shutdown drain [2 pull requests]