hermes - ✅(Solved) Fix Bug: cold-start with conversation= mints new session_id, breaking client-side conversation continuity [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16517Fetched 2026-04-28 06:52:54
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×3cross-referenced ×2

When a client POSTs /v1/responses with a conversation=<slug> body field but the gateway's in-memory _response_store no longer contains the previous response_id (gateway restart, LRU eviction, process recycle), hermes mints a new state.db session_id rather than continuing the existing one for that conversation. The result is N "fork" rows in sessions per logical conversation, with messages distributed across them, no parent_session_id linkage between them.

Root Cause

When a client POSTs /v1/responses with a conversation=<slug> body field but the gateway's in-memory _response_store no longer contains the previous response_id (gateway restart, LRU eviction, process recycle), hermes mints a new state.db session_id rather than continuing the existing one for that conversation. The result is N "fork" rows in sessions per logical conversation, with messages distributed across them, no parent_session_id linkage between them.

Fix Action

Fix / Workaround

We've added a JS-side merge in our proxy that collapses session rows sharing a sidekick-* slug, plus a cascade-delete that fans out to all forks. Works fine, but feels like we're papering over a backend invariant that should hold. Hence the issue.

PR fix notes

PR #16607: fix(api_server): reuse session_id for known conversation slug on cold-start (#16517)

Description (problem / solution / changelog)

Summary

Fixes #16517 — POST /v1/responses with {conversation: <slug>, ...} was minting a brand-new state.db session_id whenever the in-memory _response_store no longer held the previous response_id (gateway restart, LRU eviction, process recycle). Result: N independent root sessions per logical conversation, each carrying a slice of the messages, with no parent_session_id linking them.

This PR makes the conversation slug a first-class continuity key: the slug → session_id mapping is now persisted on the conversations table itself, so cold-start requests continue the existing session even after the LRU forgets the response_id.

Decision: Option 1 (slug → session reuse) over Option 2 (parent_session_id linkage)

The issue offered three paths. Option 1 — "look up the latest session_id for conversation=<slug> via the conversations table and reuse it" — is the cleanest because:

  • It preserves the invariant the issue actually wants: one logical conversation = one row in sessions.
  • It avoids the cross-walking work for sidebar UIs / drawer dedup / search CTEs (those still keep their parent_session_id walkers for the compression-rotation case at run_agent.py:7580).
  • The conversations table already exists; we just needed one more column. No new tables, no schema rework.

Option 2 (linking via parent_session_id) was tempting (cheaper at the call site) but it leaves the "N rows per conversation" pathology in place — clients still have to dedup. That's exactly the workaround the issue is trying to eliminate.

Option 3 (document as intentional) didn't fit: run_agent.py:7580 already sets parent_session_id for the compression-rotation fork case, so the lack of linkage on cold-start is clearly an inconsistency, not a deliberate design.

Root cause

conversations:  name → response_id
responses:      response_id → { ..., "session_id": "<sess>", ... }   ← LRU bounded

The session_id only lived inside responses.data (JSON blob). Once an LRU eviction (or restart with a smaller cache) dropped the responses row, the cold-start path at gateway/platforms/api_server.py:1694 fell through to session_id = stored_session_id or str(uuid.uuid4()) with stored_session_id=None, and a fresh fork was born.

Fix

Schema

conversations gains a session_id TEXT column. Existing databases are migrated transparently with an idempotent ALTER TABLE ... ADD COLUMN session_id TEXT that swallows the duplicate-column OperationalError on already-migrated DBs.

ResponseStore API

# New
def get_session_id_for_conversation(self, name: str) -> Optional[str]: ...

# Extended (old call signature remains valid)
def set_conversation(
    self,
    name: str,
    response_id: str,
    session_id: Optional[str] = None,
) -> None: ...

/v1/responses handler

  • Compute a slug_session_id_hint at the top of the handler (single query, costs nothing on slug-less requests).
  • When previous_response_id is unresolvable AND the request was keyed by conversation, fall back to the slug-only session_id lookup instead of returning 404. Logged at INFO level so operators see when the fallback fires.
  • When previous_response_id was provided directly (no slug), preserve the existing 404 — there's no slug to fall back to and silently swallowing it would mask client bugs.
  • session_id = stored_session_id or slug_session_id_hint or str(uuid.uuid4()) — a brand-new slug still gets a fresh UUID (no behavior change for first-time conversations).
  • All set_conversation(...) call sites now pass session_id=session_id so the column gets populated on every chain advance.

Behavior

ScenarioBeforeAfter
conversation=foo, first turnMint new session_id (unchanged)Mint new session_id, persist on conversations row
conversation=foo, second turn, response_id intactReuse session_id via responses.dataReuse via responses.data (unchanged)
conversation=foo, second turn, response_id evicted404 / new forkReuse same session_id from conversations.session_id
previous_response_id=evicted_id (no slug)404404 (unchanged)
conversation=foo, legacy row from before this fix (no session_id column populated)N/A404 (no fallback path; client falls back to a fresh slug)
Brand-new slugNew session_idNew session_id (unchanged)

Tests

11 new tests in tests/gateway/test_api_server.py, all green. 130/130 tests in the file pass after the change.

TestResponseStore (6 new):

  • set_conversation persists session_id on the row.
  • session_id survives LRU eviction of the underlying responses row.
  • get_session_id_for_conversation returns None for unknown slugs.
  • The old 2-arg set_conversation(name, response_id) call signature still works (backwards compat).
  • set_conversation overwrites session_id correctly on chain advance.
  • A legacy conversations table without the session_id column migrates in place without crashing.

TestConversationParameter (5 new):

  • First turn populates the new column.
  • Second turn after manually evicting the responses row continues the same session_id (the regression).
  • Legacy slug with no session_id mapping still 404s (no silent regression of the explicit error).
  • Brand-new slug still mints a fresh session_id.
  • Direct previous_response_id path with no slug still 404s on eviction.
$ pytest tests/gateway/test_api_server.py -q
130 passed, 79 warnings in 2.71s

Migration / compatibility

  • Online databases: the ALTER TABLE runs on every ResponseStore.__init__(). The first time it runs against an existing DB, the column is added; subsequent runs hit the duplicate-column branch and no-op. Zero downtime.
  • Legacy conversations rows that pre-date the migration have session_id = NULL. Those rows fall back to the existing 404 — no silent recovery, no crash. Clients that hit one of those will simply get fresh sessions on their next request, which is the same outcome they get today.
  • Old call sites that still pass set_conversation(name, response_id) without session_id= keep working. They just don't benefit from the new fallback.

Changed files

  • gateway/platforms/api_server.py (modified, +110/-19)
  • tests/gateway/test_api_server.py (modified, +202/-0)

PR #16635: fix(api_server): persist session_id on conversation slug to survive LRU eviction (#16517)

Description (problem / solution / changelog)

Summary

  • /v1/responses with conversation=<slug> lost session continuity after the underlying response row was LRU-evicted.
  • Persist session_id alongside response_id on the conversations row and recover it on the read side when the response is gone.
  • Adds three tests covering the new column, the cold-start recovery flow, and the preserved 404 behavior when no slug exists.

The bug

The ResponseStore.conversations table only stored name -> response_id. When a turn's response was later evicted by the LRU policy, get_conversation(slug) still returned the orphaned response_id but get(response_id) returned None. The resolver in _handle_responses then either:

  1. 404'd the client with Previous response not found: <id>, or
  2. Fell through to mint a fresh session_id, breaking client-side conversation continuity (the symptom called out in #16517 — N forked sessions per logical conversation, none of them linked).

The conversations table did not carry enough state to recover the Hermes session by slug alone.

The fix

Structural fix at the persistence layer: store session_id on the conversations row, recover it when the response row is gone.

  • gateway/platforms/api_server.py
    • Schema: add session_id TEXT to the conversations table. Existing installs are migrated via an idempotent ALTER TABLE ADD COLUMN that swallows sqlite3.OperationalError if the column already exists, so re-running on a populated DB is a no-op.
    • set_conversation(name, response_id, session_id=None) now writes session_id too. Both call sites in _handle_responses (snapshot persist + final persist) pass the running session_id.
    • New helper get_conversation_session_id(name) reads back just the session_id — independent of the responses row, which may have been evicted.
    • _handle_responses: when _response_store.get(previous_response_id) returns None and the request used conversation=<slug>, recover session_id via get_conversation_session_id(slug) and continue. Without a slug, the original 404 path is preserved — there is genuinely no recovery context.

Contract Protected

  • Invariant: for any conversation slug whose row was ever written, the latest associated session_id is recoverable independently of whether the chained response row still exists.
  • Known-bad inputs: evicted response_id + intact slug, gateway-restart cycle that leaves the conversations row but loses the responses LRU contents.
  • Future-input coverage: any new code path that writes the conversations row inherits the session_id requirement via the updated method signature.
  • Negative case: previous_response_id with no slug still 404s — the fallback only fires when there is a slug to carry continuity.

Test plan

  • Focused regression test: tests/gateway/test_api_server.py::TestConversationParameter::test_conversation_continues_session_after_response_eviction — turn 1 establishes the slug, the response row is then deleted (LRU proxy), turn 2 with the same slug reuses the same session_id.
  • New invariant test: test_conversation_session_persisted_on_setset_conversation writes session_id, get_conversation_session_id reads it back, and get_conversation still returns the response_id (backward-compatible).
  • Negative test: test_previous_response_id_still_404s_without_conversation — raw previous_response_id with no slug still 404s.
  • Regression guard: with the production fix reverted, the new tests fail (AttributeError: 'ResponseStore' object has no attribute 'get_conversation_session_id' + session_id assertion mismatch). With the fix, all 122 tests in tests/gateway/test_api_server.py pass.

Related

  • Fixes #16517

Changed files

  • gateway/platforms/api_server.py (modified, +79/-14)
  • tests/gateway/test_api_server.py (modified, +99/-0)
RAW_BUFFERClick to expand / collapse

Summary

When a client POSTs /v1/responses with a conversation=<slug> body field but the gateway's in-memory _response_store no longer contains the previous response_id (gateway restart, LRU eviction, process recycle), hermes mints a new state.db session_id rather than continuing the existing one for that conversation. The result is N "fork" rows in sessions per logical conversation, with messages distributed across them, no parent_session_id linkage between them.

Reproduce

  1. POST /v1/responses with {conversation: "my-conv-foo", input: "hi"}. Hermes creates session A.
  2. systemctl restart hermes-gateway (or wait for _response_store LRU to evict the response_id of the last turn).
  3. POST /v1/responses with {conversation: "my-conv-foo", input: "still here?"}. Hermes creates session B, no parent_session_id link to A.
  4. Repeat through several restart cycles → N independent root sessions all keyed to "my-conv-foo".

(Reference: api_server.py:1694 — when previous_response_id is unresolvable, session creation falls through to a new session_id even though the conversation slug → session mapping is queryable in the conversations table.)

Why it matters

  • Sidebar UIs that key on session_id show the same logical conversation as N rows.
  • Compression-rotation forks (run_agent.py:7580) DO set parent_session_id so they can be walked. Cold-start forks are the gap — same fork pathology, no link.
  • /api/responses already owns the slug→sessions map (via conversations.response_idresponses.data.session_id). The information to continue the existing session exists, just isn't consulted on the fallback path.

Possible fixes (we'd love guidance — there may be a deliberate reason)

  1. On fallback: when previous_response_id is unresolvable, look up the latest session_id for conversation=<slug> via the conversations table and reuse it. Keeps conversations as one session row.
  2. Set parent_session_id: cheapest. New session_id, but linked to the previous one for the same slug. Existing CTE walkers (drawer dedup, search) handle the chain automatically.
  3. Document as intentional: if there's a reason cold-start should mint fresh (security boundary? state isolation?), document so client UIs know to dedup by slug themselves.

What we're doing in our client

We've added a JS-side merge in our proxy that collapses session rows sharing a sidekick-* slug, plus a cascade-delete that fans out to all forks. Works fine, but feels like we're papering over a backend invariant that should hold. Hence the issue.

Versions / context

  • hermes-agent v0.11.0
  • Sidekick PWA + audio bridge talking to local hermes-gateway via Tailscale
  • See related: #6507 (child-session search drops), #11793 (CLI resume)

extent analysis

TL;DR

The most likely fix is to modify the fallback path in api_server.py to look up the latest session_id for a conversation slug and reuse it when previous_response_id is unresolvable.

Guidance

  • Investigate the api_server.py:1694 line to understand why the session creation falls through to a new session_id when previous_response_id is unresolvable.
  • Consider implementing the suggested fix of looking up the latest session_id for a conversation slug via the conversations table and reusing it to maintain a single session row per conversation.
  • Alternatively, setting parent_session_id for new sessions could provide a link to the previous session for the same slug, allowing existing CTE walkers to handle the chain.
  • Review the code to determine if there's a deliberate reason for the current behavior, and document it if so.

Example

No code snippet is provided as the issue does not contain sufficient information to generate a specific example.

Notes

The provided information suggests that the issue is related to the handling of previous_response_id and conversation slugs in the api_server.py file. However, without more context or code, it's difficult to provide a definitive solution.

Recommendation

Apply workaround: Implement the suggested fix of looking up the latest session_id for a conversation slug and reusing it to maintain a single session row per conversation, as it seems to be the most straightforward solution to the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Bug: cold-start with conversation= mints new session_id, breaking client-side conversation continuity [2 pull requests, 1 participants]