hermes - 💡(How to fix) Fix Gateway restart resume can lose immediate pre-restart context (possible JSONL vs SQLite transcript mismatch)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Error Output

Code Example

Gateway drain timed out after 60.0s with 1 active agent(s); interrupting remaining work.
RAW_BUFFERClick to expand / collapse

Bug Description

Gateway restart recovery does not reliably preserve the immediate pre-restart conversation state.

In practice, after hermes gateway restart interrupts an active turn and the user sends a follow-up like Continue, Hermes often reopens the same session but fails to recover the most recent interrupted context. Older history is present, but the exact in-flight turn right before restart is partially or fully lost.

From code inspection, this appears to be a real bug/gap rather than just UX confusion:

  1. restart recovery preserves the session_key and marks the session resume_pending
  2. the next message reuses the same session_id and transcript lane
  3. but there is no true checkpoint/restore of in-flight agent execution state
  4. additionally, interrupted-turn transcript recovery may prefer stale JSONL history over newer SQLite-persisted interrupted messages

The likely concrete failure mode is:

  • AIAgent persists interrupted/early-exit messages to SQLite via _persist_session() / _flush_messages_to_session_db()
  • gateway JSONL append/write happens later in normal post-run flow
  • if the gateway is restarted during that window, SQLite may contain newer interrupted-turn messages than JSONL
  • gateway/session.py::load_transcript() then returns whichever source has more messages, not whichever is newer or more complete
  • older-but-longer JSONL can therefore win over newer SQLite rows, silently dropping the latest interrupted context on resume

This makes the current restart UX promise stronger than the implementation delivers. The gateway says:

"Send any message after restart and I'll try to resume where you left off."

But in practice that can fail specifically for the immediate pre-restart turn.

Steps to Reproduce

  1. Start an active gateway conversation on Discord/Telegram (or another gateway platform) where Hermes is in the middle of a tool/agent turn.
  2. Trigger a gateway restart with:
    • hermes gateway restart
  3. Let the running turn be interrupted during shutdown/drain.
  4. After the gateway comes back, send a follow-up such as:
    • Continue
  5. Compare the recovered context with the pre-restart active turn.

Expected Behavior

  • The same session should reopen.
  • The most recent interrupted user/tool/assistant context should still be available.
  • If full in-flight resume is not supported, the recovered transcript should at least contain the newest persisted interrupted-turn messages so the model can continue from the latest saved point.
  • The user-facing restart notice should match the actual guarantee.

Actual Behavior

  • The same session lane is often reused, but the immediate pre-restart working context is not reliably restored.
  • Older conversation history is present, while the last interrupted turn may be missing or weakened.
  • A simple Continue can feel like Hermes forgot what happened right before restart.

Environment

  • OS: Ubuntu Linux (user systemd service)
  • Hermes version: Hermes Agent v0.10.0 (2026.4.16)
  • Repo commit inspected: 093aec5a4c66997ddf053bebc8d7720bf673facc
  • Python version: 3.11.15
  • Gateway context: Discord
  • Restart path used: hermes gateway restart

Error Output

Relevant live observations/logs from one reproduced case:

Gateway drain timed out after 60.0s with 1 active agent(s); interrupting remaining work.

Related code paths inspected:

  • gateway/run.py
    • shutdown warning text around lines ~1735-1742
    • resume_pending marking around ~2604-2638
    • transcript/history load around ~4127-4128
    • resume-pending clear on successful turn around ~4520-4528
    • transcript persistence around ~4683-4732
    • restart recovery note injection around ~9988-10013
  • gateway/session.py
    • resume_pending session entry fields around ~381-390
    • get_or_create_session() reuse of resume_pending session around ~748-759
    • mark_resume_pending() around ~849-875
    • load_transcript() source-selection logic around ~1170-1216
  • run_agent.py
    • _persist_session() around ~2932-2943
    • _flush_messages_to_session_db() around ~2945-2991

Additional Context

Current evidence suggests two separate issues:

  1. Product/behavior gap
    • restart resume currently means "reload saved transcript and continue", not true in-flight checkpoint/restore
  2. Likely transcript source bug
    • load_transcript() chooses whichever source has more messages (JSONL vs SQLite)
    • that can prefer stale JSONL over newer interrupted-turn SQLite rows

Suggested fix directions:

  • Make interrupted-turn recovery choose the freshest/most complete transcript source, not just the longest
  • Consider transcript merge/reconciliation instead of simple length preference
  • Tighten the guarantee in the user-facing restart message if exact pre-restart state cannot be restored
  • Add a regression test covering interrupted gateway restart where SQLite contains newer interrupted-turn messages than JSONL

extent analysis

TL;DR

The most likely fix involves modifying the load_transcript() function to prioritize the freshest transcript source, rather than the longest, to ensure that the most recent interrupted-turn messages are recovered after a gateway restart.

Guidance

  1. Review and modify load_transcript(): Update the load_transcript() function in gateway/session.py to prefer the transcript source with the newest messages, rather than the source with the most messages.
  2. Consider transcript merge/reconciliation: Instead of simply choosing one transcript source over the other, consider implementing a merge or reconciliation mechanism to combine the messages from both sources, ensuring that the most complete and up-to-date transcript is recovered.
  3. Tighten the restart guarantee: Review the user-facing restart message and consider revising it to reflect the actual capabilities of the restart feature, avoiding promises that may not be fulfilled in all cases.
  4. Add regression testing: Develop a regression test to cover the scenario where SQLite contains newer interrupted-turn messages than JSONL, ensuring that the fix addresses the identified issue and preventing similar problems in the future.

Example

# Example of how load_transcript() could be modified to prefer the freshest source
def load_transcript(self):
    # ...
    jsonl_messages = self.load_jsonl_transcript()
    sqlite_messages = self.load_sqlite_transcript()
    
    # Compare the timestamps of the most recent messages in each source
    if sqlite_messages and jsonl_messages:
        sqlite_latest_timestamp = max(message['timestamp'] for message in sqlite_messages)
        jsonl_latest_timestamp = max(message['timestamp'] for message in jsonl_messages)
        
        # Prefer the source with the newest messages
        if sqlite_latest_timestamp > jsonl_latest_timestamp:
            return sqlite_messages
        else:
            return jsonl_messages
    # ...

Notes

The provided solution focuses on addressing the identified issue with transcript source selection. However, the underlying problem of not having true in-flight checkpoint/restore functionality may still exist and should be considered for future improvements.

Recommendation

Apply the workaround by modifying the load_transcript() function to prioritize the freshest transcript source, as this directly addresses the identified issue and improves the reliability of the restart feature.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING