hermes - 💡(How to fix) Fix Gateway restart resume can lose immediate pre-restart context (possible JSONL vs SQLite transcript mismatch)

StepCodex · 2026-04-20T17:10:47Z

[hermes] Bug Description Gateway restart recovery does not reliably preserve the immediate pre-restart conversation state. In practice, after hermes gateway re… ## Bug Description Gateway restart recovery does not reliably preserve the immediate pre-restart conversation state. In practice, after `hermes gateway restart` interrupts an active turn and the user sends a follow-up like `Continue`, Hermes often reopens the same session but fails to recover the most recent interrupted context. Older history is present, but the exact in-flight turn right before restart is partially or fully lost. From code inspection, this appears to be a real bug/gap rather than just UX confusion: 1. restart recovery preserves the `session_key` and marks the session `resume_pending` 2. the next message reuses the same `session_id` and transcript lane 3. but there is no true checkpoint/restore of in-flight agent execution state 4. additionally, interrupted-turn transcript recovery may prefer stale JSONL history over newer SQLite-persisted interrupted messages The likely concrete failure mode is: - `AIAgent` persists interrupted/early-exit messages to SQLite via `_persist_session()` / `_flush_messages_to_session_db()` - gateway JSONL append/write happens later in normal post-run flow - if the gateway is restarted during that window, SQLite may contain newer interrupted-turn messages than JSONL - `gateway/session.py::load_transcript()` then returns whichever source has more messages, not whichever is newer or more complete - older-but-longer JSONL can therefore win over newer SQLite rows, silently dropping the latest interrupted context on resume This makes the current restart UX promise stronger than the implementation delivers. The gateway says: > "Send any message after restart and I'll try to resume where you left off." But in practice that can fail specifically for the immediate pre-restart turn. ## Steps to Reproduce 1. Start an active gateway conversation on Discord/Telegram (or another gateway platform) where Hermes is in the middle of a tool/agent turn. 2. Trigger a gateway restart with: - `hermes gateway restart` 3. Let the running turn be interrupted during shutdown/drain. 4. After the gateway comes back, send a follow-up such as: - `Continue` 5. Compare the recovered context with the pre-restart active turn. ## Expected Behavior - The same session should reopen. - The most recent interrupted user/tool/assistant context should still be available. - If full in-flight resume is not supported, the recovered transcript should at least contain the newest persisted interrupted-turn messages so the model can continue from the latest saved point. - The user-facing restart notice should match the actual guarantee. ## Actual Behavior - The same session lane is often reused, but the immediate pre-restart working context is not reliably restored. - Older conversation history is present, while the last interrupted turn may be missing or weakened. - A simple `Continue` can feel like Hermes forgot what happened right before restart. ## Environment - OS: Ubuntu Linux (user systemd service) - Hermes version: `Hermes Agent v0.10.0 (2026.4.16)` - Repo commit inspected: `093aec5a4c66997ddf053bebc8d7720bf673facc` - Python version: `3.11.15` - Gateway context: Discord - Restart path used: `hermes gateway restart` ## Error Output Relevant live observations/logs from one reproduced case: ```text Gateway drain timed out after 60.0s with 1 active agent(s); interrupting remaining work. ``` Related code paths inspected: - `gateway/run.py` - shutdown warning text around lines ~1735-1742 - `resume_pending` marking around ~2604-2638 - transcript/history load around ~4127-4128 - resume-pending clear on successful turn around ~4520-4528 - transcript persistence around ~4683-4732 - restart recovery note injection around ~9988-10013 - `gateway/session.py` - `resume_pending` session entry fields around ~381-390 - `get_or_create_session()` reuse of `resume_pending` session around ~748-759 - `mark_resume_pending()` around ~849-875 - `load_transcript()` source-selection logic around ~1170-1216 - `run_agent.py` - `_persist_session()` around ~2932-2943 - `_flush_messages_to_session_db()` around ~2945-2991 ## Additional Context Current evidence suggests two separate issues: 1. **Product/behavior gap** - restart resume currently means "reload saved transcript and continue", not true in-flight checkpoint/restore 2. **Likely transcript source bug** - `load_transcript()` chooses whichever source has more messages (`JSONL` vs `SQLite`) - that can prefer stale JSONL over newer interrupted-turn SQLite rows Suggested fix directions: - Make interrupted-turn recovery choose the freshest/most complete transcript source, not just the longest - Consider transcript merge/reconciliation instead of simple length preference - Tighten the guarantee in the user-facing restart message if exact pre-restart state cannot be restored - Add a regression test covering interrupted gateway r

Bug Description

Gateway restart recovery does not reliably preserve the immediate pre-restart conversation state.

In practice, after hermes gateway restart interrupts an active turn and the user sends a follow-up like Continue, Hermes often reopens the same session but fails to recover the most recent interrupted context. Older history is present, but the exact in-flight turn right before restart is partially or fully lost.

From code inspection, this appears to be a real bug/gap rather than just UX confusion:

restart recovery preserves the session_key and marks the session resume_pending
the next message reuses the same session_id and transcript lane
but there is no true checkpoint/restore of in-flight agent execution state
additionally, interrupted-turn transcript recovery may prefer stale JSONL history over newer SQLite-persisted interrupted messages

The likely concrete failure mode is:

AIAgent persists interrupted/early-exit messages to SQLite via _persist_session() / _flush_messages_to_session_db()
gateway JSONL append/write happens later in normal post-run flow
if the gateway is restarted during that window, SQLite may contain newer interrupted-turn messages than JSONL
gateway/session.py::load_transcript() then returns whichever source has more messages, not whichever is newer or more complete
older-but-longer JSONL can therefore win over newer SQLite rows, silently dropping the latest interrupted context on resume

This makes the current restart UX promise stronger than the implementation delivers. The gateway says:

"Send any message after restart and I'll try to resume where you left off."

But in practice that can fail specifically for the immediate pre-restart turn.

Steps to Reproduce

Start an active gateway conversation on Discord/Telegram (or another gateway platform) where Hermes is in the middle of a tool/agent turn.
Trigger a gateway restart with:
- hermes gateway restart
Let the running turn be interrupted during shutdown/drain.
After the gateway comes back, send a follow-up such as:
- Continue
Compare the recovered context with the pre-restart active turn.

Expected Behavior

The same session should reopen.
The most recent interrupted user/tool/assistant context should still be available.
If full in-flight resume is not supported, the recovered transcript should at least contain the newest persisted interrupted-turn messages so the model can continue from the latest saved point.
The user-facing restart notice should match the actual guarantee.

Actual Behavior

The same session lane is often reused, but the immediate pre-restart working context is not reliably restored.
Older conversation history is present, while the last interrupted turn may be missing or weakened.
A simple Continue can feel like Hermes forgot what happened right before restart.

Environment

OS: Ubuntu Linux (user systemd service)
Hermes version: Hermes Agent v0.10.0 (2026.4.16)
Repo commit inspected: 093aec5a4c66997ddf053bebc8d7720bf673facc
Python version: 3.11.15
Gateway context: Discord
Restart path used: hermes gateway restart

Error Output

Relevant live observations/logs from one reproduced case:

Gateway drain timed out after 60.0s with 1 active agent(s); interrupting remaining work.

Related code paths inspected:

gateway/run.py
- shutdown warning text around lines ~1735-1742
- resume_pending marking around ~2604-2638
- transcript/history load around ~4127-4128
- resume-pending clear on successful turn around ~4520-4528
- transcript persistence around ~4683-4732
- restart recovery note injection around ~9988-10013
gateway/session.py
- resume_pending session entry fields around ~381-390
- get_or_create_session() reuse of resume_pending session around ~748-759
- mark_resume_pending() around ~849-875
- load_transcript() source-selection logic around ~1170-1216
run_agent.py
- _persist_session() around ~2932-2943
- _flush_messages_to_session_db() around ~2945-2991

Additional Context

Current evidence suggests two separate issues:

Product/behavior gap
- restart resume currently means "reload saved transcript and continue", not true in-flight checkpoint/restore
Likely transcript source bug
- load_transcript() chooses whichever source has more messages (JSONL vs SQLite)
- that can prefer stale JSONL over newer interrupted-turn SQLite rows

Suggested fix directions:

Make interrupted-turn recovery choose the freshest/most complete transcript source, not just the longest
Consider transcript merge/reconciliation instead of simple length preference
Tighten the guarantee in the user-facing restart message if exact pre-restart state cannot be restored
Add a regression test covering interrupted gateway restart where SQLite contains newer interrupted-turn messages than JSONL

extent analysis

TL;DR

The most likely fix involves modifying the load_transcript() function to prioritize the freshest transcript source, rather than the longest, to ensure that the most recent interrupted-turn messages are recovered after a gateway restart.

Guidance

Review and modify load_transcript(): Update the load_transcript() function in gateway/session.py to prefer the transcript source with the newest messages, rather than the source with the most messages.
Consider transcript merge/reconciliation: Instead of simply choosing one transcript source over the other, consider implementing a merge or reconciliation mechanism to combine the messages from both sources, ensuring that the most complete and up-to-date transcript is recovered.
Tighten the restart guarantee: Review the user-facing restart message and consider revising it to reflect the actual capabilities of the restart feature, avoiding promises that may not be fulfilled in all cases.
Add regression testing: Develop a regression test to cover the scenario where SQLite contains newer interrupted-turn messages than JSONL, ensuring that the fix addresses the identified issue and preventing similar problems in the future.

Example

# Example of how load_transcript() could be modified to prefer the freshest source
def load_transcript(self):
    # ...
    jsonl_messages = self.load_jsonl_transcript()
    sqlite_messages = self.load_sqlite_transcript()
    
    # Compare the timestamps of the most recent messages in each source
    if sqlite_messages and jsonl_messages:
        sqlite_latest_timestamp = max(message['timestamp'] for message in sqlite_messages)
        jsonl_latest_timestamp = max(message['timestamp'] for message in jsonl_messages)
        
        # Prefer the source with the newest messages
        if sqlite_latest_timestamp > jsonl_latest_timestamp:
            return sqlite_messages
        else:
            return jsonl_messages
    # ...

Notes

The provided solution focuses on addressing the identified issue with transcript source selection. However, the underlying problem of not having true in-flight checkpoint/restore functionality may still exist and should be considered for future improvements.

Recommendation

Apply the workaround by modifying the load_transcript() function to prioritize the freshest transcript source, as this directly addresses the identified issue and improves the reliability of the restart feature.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Gateway restart resume can lose immediate pre-restart context (possible JSONL vs SQLite transcript mismatch)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Output

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Error Output

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Gateway restart resume can lose immediate pre-restart context (possible JSONL vs SQLite transcript mismatch)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Output

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Error Output

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING