hermes - 💡(How to fix) Fix Session liveness: 29% of sessions leak with ended_at=NULL on hard exits

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Hermes SessionDB.end_session() is called only from happy-path exits. Sessions terminated via SIGKILL, terminal force-close, OOM kill, machine crash, or ad-hoc script exit leak silently with ended_at IS NULL, polluting downstream consumers (session listings, activity dashboards, our overlay visualizer that aggregates 4 gateways).

Root Cause

Root cause (grep census)

Fix Action

Fix / Workaround

Is this a known issue with a planned fix? If not, would you accept a PR? I'm happy to scope it carefully — schema migration, heartbeat write site, reaper, tests. Would prefer to validate on our fleet for ~2 weeks via a fork patch before opening the PR, so the upstream change ships with empirical leak-rate-before/after data.

Code Example

Total sessions ............ 3,049
ended_at IS NULL .......... 887  (29%)

end_reason histogram for marked sessions:
  cron_complete .... 1,275  ← cron jobs end deterministically
  (NULL) ............ 887  ← the leaks
  recovered ......... 640  ← gateway-side crash recovery (gateway/session.py)
  cli_close ......... 229  ← clean CLI exit (the only CLI path that sets it)
  session_reset ..... 17
  compression ........ 1
RAW_BUFFERClick to expand / collapse

Summary

Hermes SessionDB.end_session() is called only from happy-path exits. Sessions terminated via SIGKILL, terminal force-close, OOM kill, machine crash, or ad-hoc script exit leak silently with ended_at IS NULL, polluting downstream consumers (session listings, activity dashboards, our overlay visualizer that aggregates 4 gateways).

Empirical observation

Snapshot of one production state.db (Mac Mini, 4-gateway fleet, ~3 months of usage):

Total sessions ............ 3,049
ended_at IS NULL .......... 887  (29%)

end_reason histogram for marked sessions:
  cron_complete .... 1,275  ← cron jobs end deterministically
  (NULL) ............ 887  ← the leaks
  recovered ......... 640  ← gateway-side crash recovery (gateway/session.py)
  cli_close ......... 229  ← clean CLI exit (the only CLI path that sets it)
  session_reset ..... 17
  compression ........ 1

The asymmetry is striking: source='gateway' has a crash-recovery backstop (end_reason='recovered', 640 rows). source='cli' does not — the only CLI path that marks ended_at is the finally block at cli.py:11968-11972, which runs only when Python's exit handlers actually fire.

Root cause (grep census)

Calls to end_session() in the codebase:

  • cli.py:5171 — /new
  • cli.py:5309 — /resume
  • cli.py:5411 — /branch
  • cli.py:11970 — atexit / finally block (cli_close)
  • run_agent.py:9153 — compression split
  • gateway/session.py:939, 1164, 1219 — gateway resets

No coverage for: SIGKILL, force-close terminal, OOM kill, ssh drop before SIGHUP grace fires, batch_runner exit, ad-hoc programmatic exits.

Proposed fix (sketch)

Two-part:

  1. last_heartbeat_at column on sessions. run_agent.py updates it every ~15s during the agent loop. Single-row UPDATE; cheap.
  2. Stale-session reaper. Periodic sweep (e.g. every 60s, in the dashboard process or as a harness janitor subcommand): mark any ended_at IS NULL AND last_heartbeat_at < now - 90s as end_reason='reaped'. Could subsume the existing recovered mechanism for gateway sessions.

Backwards-compatible: last_heartbeat_at is nullable, old rows just don't have it. Reaper only acts on the heartbeat-equipped subset.

What I'm looking for

Is this a known issue with a planned fix? If not, would you accept a PR? I'm happy to scope it carefully — schema migration, heartbeat write site, reaper, tests. Would prefer to validate on our fleet for ~2 weeks via a fork patch before opening the PR, so the upstream change ships with empirical leak-rate-before/after data.

Filed because we hit it concretely while building Space (a desktop visualizer that polls /v1/space/activity from 4 gateways and renders a sprite per session). The 600s message-activity heuristic is a workable approximation, but better signals would help everyone consuming session state, not just us.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING