hermes - 💡(How to fix) Fix Gateway restart can lose long-running sessions during shutdown drain [2 pull requests]

hermes2026-05-18 08:05:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Start a gateway session from a messaging platform.
Run a long task whose SessionEntry.updated_at is older than the startup fallback window because the turn has not completed yet.
Restart the gateway via the service manager while the task is still active.
Let the service manager terminate the old process while it is still inside the drain wait.
Start the gateway again and send a message in the same chat/thread/topic.

Fix Action

Fixed

Fixed by PR: fix(gateway): premark active sessions before drain (https://github.com/NousResearch/hermes-agent/pull/27831)
Fixed by PR: fix(gateway): pre-mark sessions as resume_pending before drain to pre… (https://github.com/NousResearch/hermes-agent/pull/28217)

RAW_BUFFERClick to expand / collapse

Bug Description

Long-running gateway sessions can be lost across a service restart if the process is killed while it is still waiting for active agents to drain.

The current restart/shutdown flow only writes the durable resume_pending marker after _drain_active_agents(timeout) returns timed_out=True. Service managers such as systemd can terminate the process while it is still inside that drain wait, before the timeout branch runs.

If the session's updated_at is older than the startup suspend_recently_active() fallback window, the next gateway boot has no durable marker to resume or suspend the in-flight session cleanly.

Steps to Reproduce

Start a gateway session from a messaging platform.
Run a long task whose SessionEntry.updated_at is older than the startup fallback window because the turn has not completed yet.
Restart the gateway via the service manager while the task is still active.
Let the service manager terminate the old process while it is still inside the drain wait.
Start the gateway again and send a message in the same chat/thread/topic.

Expected Behavior

The active session is durably marked before the vulnerable drain wait begins, so the next process can recover the interrupted session state instead of treating it as a normal idle session.

Actual Behavior

The durable resume_pending marker may never be written if the old process is killed during the drain wait. Long-running sessions outside the startup freshness window can then appear stopped or reset after restart.

Notes

This is in the same failure-mode family as the existing restart resume work, but the race happens earlier than the post-timeout mark_resume_pending() path: the process can die before that branch gets control.

A narrow fix is to pre-mark currently running sessions before awaiting drain, then clear only those early markers if the drain completes gracefully.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Gateway restart can lose long-running sessions during shutdown drain [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Notes

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Gateway restart can lose long-running sessions during shutdown drain [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Notes

Still need to ship something?

RELATED_DISCOVERY

TRENDING