hermes - 💡(How to fix) Fix Session liveness: 29% of sessions leak with ended_at=NULL on hard exits

hermes2026-05-08 20:48:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Hermes SessionDB.end_session() is called only from happy-path exits. Sessions terminated via SIGKILL, terminal force-close, OOM kill, machine crash, or ad-hoc script exit leak silently with ended_at IS NULL, polluting downstream consumers (session listings, activity dashboards, our overlay visualizer that aggregates 4 gateways).

Root Cause

Root cause (grep census)

Fix Action

Fix / Workaround

Is this a known issue with a planned fix? If not, would you accept a PR? I'm happy to scope it carefully — schema migration, heartbeat write site, reaper, tests. Would prefer to validate on our fleet for ~2 weeks via a fork patch before opening the PR, so the upstream change ships with empirical leak-rate-before/after data.

Code Example

Total sessions ............ 3,049
ended_at IS NULL .......... 887  (29%)

end_reason histogram for marked sessions:
  cron_complete .... 1,275  ← cron jobs end deterministically
  (NULL) ............ 887  ← the leaks
  recovered ......... 640  ← gateway-side crash recovery (gateway/session.py)
  cli_close ......... 229  ← clean CLI exit (the only CLI path that sets it)
  session_reset ..... 17
  compression ........ 1

RAW_BUFFERClick to expand / collapse

Summary

Empirical observation

Snapshot of one production state.db (Mac Mini, 4-gateway fleet, ~3 months of usage):

Total sessions ............ 3,049
ended_at IS NULL .......... 887  (29%)

end_reason histogram for marked sessions:
  cron_complete .... 1,275  ← cron jobs end deterministically
  (NULL) ............ 887  ← the leaks
  recovered ......... 640  ← gateway-side crash recovery (gateway/session.py)
  cli_close ......... 229  ← clean CLI exit (the only CLI path that sets it)
  session_reset ..... 17
  compression ........ 1

The asymmetry is striking: source='gateway' has a crash-recovery backstop (end_reason='recovered', 640 rows). source='cli' does not — the only CLI path that marks ended_at is the finally block at cli.py:11968-11972, which runs only when Python's exit handlers actually fire.

Root cause (grep census)

Calls to end_session() in the codebase:

cli.py:5171 — /new
cli.py:5309 — /resume
cli.py:5411 — /branch
cli.py:11970 — atexit / finally block (cli_close)
run_agent.py:9153 — compression split
gateway/session.py:939, 1164, 1219 — gateway resets

No coverage for: SIGKILL, force-close terminal, OOM kill, ssh drop before SIGHUP grace fires, batch_runner exit, ad-hoc programmatic exits.

Proposed fix (sketch)

Two-part:

last_heartbeat_at column on sessions. run_agent.py updates it every ~15s during the agent loop. Single-row UPDATE; cheap.
Stale-session reaper. Periodic sweep (e.g. every 60s, in the dashboard process or as a harness janitor subcommand): mark any ended_at IS NULL AND last_heartbeat_at < now - 90s as end_reason='reaped'. Could subsume the existing recovered mechanism for gateway sessions.

Backwards-compatible: last_heartbeat_at is nullable, old rows just don't have it. Reaper only acts on the heartbeat-equipped subset.

What I'm looking for

Filed because we hit it concretely while building Space (a desktop visualizer that polls /v1/space/activity from 4 gateways and renders a sprite per session). The 600s message-activity heuristic is a workable approximation, but better signals would help everyone consuming session state, not just us.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#authentication setup #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Session liveness: 29% of sessions leak with ended_at=NULL on hard exits

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root cause (grep census)

Fix Action

Fix / Workaround

Code Example

Summary

Empirical observation

Root cause (grep census)

Proposed fix (sketch)

What I'm looking for

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Session liveness: 29% of sessions leak with ended_at=NULL on hard exits

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root cause (grep census)

Fix Action

Fix / Workaround

Code Example

Summary

Empirical observation

Root cause (grep census)

Proposed fix (sketch)

What I'm looking for

Still need to ship something?

RELATED_DISCOVERY

TRENDING