openclaw - ✅(Solved) Fix Gateway session-resume: orphaned LCM bindings + stuck-session deadlock (regression since post-Apr 29 update) [3 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77871Fetched 2026-05-06 06:19:59
View on GitHub
Comments
1
Participants
2
Timeline
9
Reactions
2
Timeline (top)
referenced ×5cross-referenced ×3commented ×1

Two related defects in the gateway's session-resume path. Worked correctly until ~Apr 29, 2026; broke after a subsequent OpenClaw update.

Net symptom: chat sessions silently lose continuity. The model appears to "have context" on the first turn after a restore (because lcm-files history is re-injected at bootstrap) but every subsequent turn is treated as first-turn fresh, and nothing persists.

Root Cause

Net symptom: chat sessions silently lose continuity. The model appears to "have context" on the first turn after a restore (because lcm-files history is re-injected at bootstrap) but every subsequent turn is treated as first-turn fresh, and nothing persists.

Fix Action

Fixed

PR fix notes

PR #77891: fix(sessions): unbind conversation bindings when missing transcripts are pruned

Description (problem / solution / changelog)

Summary

  • Problem: When sessions cleanup --fix-missing removes a session store entry because its transcript file is missing, the matching conversation binding in current-conversations.json is left intact. Subsequent messages resolve this stale binding and are routed to the pruned session key. The gateway finds no durable session there, silently spawns a fresh in-memory session per message, and discards every turn — giving the appearance of correct operation while persisting nothing.
  • Root Cause: pruneMissingTranscriptEntries in src/config/sessions/cleanup-service.ts deletes session store entries when their transcript files are absent, but never calls getSessionBindingService().unbind() for those keys. The explicit session-reset path (emitSessionUnboundLifecycleEvent in session-reset-service.ts) correctly unbinds on reset and delete; the cleanup codepath omitted this step entirely.
  • Fix: After updateSessionStore commits the missing-entry prune, iterate over the collected pruned session keys and call getSessionBindingService().unbind({ targetSessionKey, reason: "cleanup-missing-transcript" }) for each. This mirrors the lifecycle contract already enforced by the reset and delete paths. Additionally, resolveRuntimeConversationBindingRoute in src/channels/plugins/binding-routing.ts now accepts an optional targetSessionExists predicate; when provided and the target session is absent, the stale binding is dropped and unbound immediately at route-resolution time rather than routing to a phantom session — providing defense-in-depth for gateway callers that have session-store visibility.
  • What changed:
    • src/config/sessions/cleanup-service.ts: import getSessionBindingService; collect pruned session keys via the onPruned callback already present on pruneMissingTranscriptEntries; call unbind for each key after the store update commits.
    • src/channels/plugins/binding-routing.ts: reorder plugin-owned binding check before the session-existence check to preserve its existing touch-and-return behavior; add optional targetSessionExists predicate to resolveRuntimeConversationBindingRoute; when provided and the target session is missing, log a verbose warning, fire-and-forget unbind, and return the default route.
    • src/channels/plugins/binding-routing.test.ts: add three new cases covering stale-binding drop with unbind, plugin-owned binding exemption from the session check, and successful route rewrite when the session exists.
    • src/config/sessions/cleanup-service.test.ts (new): five cases covering single-key unbind, multi-key unbind, dry-run no-op, fixMissing: false no-op, and existing-transcript no-op.
    • CHANGELOG.md: entry under ## Unreleased.
  • What did NOT change (scope boundary): The preview/dry-run path in runSessionsCleanup is unchanged — it still only operates on a cloned store with no side-effects. No changes to session-reset-service.ts, session-binding-service.ts, current-conversation-bindings.ts, or any extension code. The targetSessionExists parameter is optional and backward-compatible; existing callers require no changes.

Reproduction

  1. Start the gateway with a Discord channel bound to a persistent session (agent:review:acp:session-1).
  2. Stop the gateway; manually delete the session transcript file from the state directory while leaving the session store entry and the current-conversations.json binding intact.
  3. Run openclaw sessions cleanup --fix-missing (or trigger automatic maintenance). The session store entry is removed.
  4. Restart the gateway and send a Discord message. Without this fix, every message routes to the deleted session key, spawns an ephemeral in-memory session, and produces zero persistence — the model behaves as if each turn is the first.
  5. With this fix, the binding is cleared by step 3, so the gateway falls through to a fresh persisted session on the next message.

Risk / Mitigation

  • Risk: Calling getSessionBindingService().unbind() in the cleanup path is an additional async operation per pruned session. In practice, missing-transcript entries are rare edge cases triggered only with --fix-missing; the loop is bounded by the number of pruned entries.
  • Mitigation: The unbind call is guarded by prunedMissingKeys.length > 0 implicitly (the for...of loop does nothing when the array is empty). The unbind implementation is the same service already called by session reset and delete paths; no new failure modes are introduced. Five targeted tests cover the apply path, dry-run, and no-op scenarios.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway
  • Sessions / cleanup
  • Channel binding routing
  • Session binding service

Linked Issue/PR

Refs #77871

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/channels/plugins/binding-routing.test.ts (modified, +99/-0)
  • src/channels/plugins/binding-routing.ts (modified, +28/-1)
  • src/config/sessions/cleanup-service.test.ts (added, +237/-0)
  • src/config/sessions/cleanup-service.ts (modified, +16/-0)

PR #328: 🦅 Scout: Critical Inherited Defect Report - 2026-05-18

Description (problem / solution / changelog)

🦅 Scout: Critical Inherited Defect Report - $(date +%Y-%m-%d)

  • Upstream Issue #71234: Gateway OOM crash: sessions.json (31MB / 1,407 sessions) causes heap exhaustion during every sessions.list/chat.history poll

    • Location in our code: src/gateway/session-utils.ts in loadCombinedSessionStoreForGateway
    • Observed Behavior: The function unconditionally reads and deserializes the entire sessions.json index into an unbounded in-memory object graph on every poll, causing severe V8 heap exhaustion and fatal OOM crashes on systems with many sessions.
    • Expected Behavior: The function should use pagination, lazy-loading, or caching so that it does not load the entire unbounded session history into memory simultaneously.
    • Impact Severity: CRITICAL (Causes immediate process crash under moderate load)
  • Upstream Issue #70559: runUnsafeReindex crashes with "no such table: chunks_vec" when sqlite-vec is enabled

    • Location in our code: src/memory/manager-sync-ops.ts in dropVectorTable and resetIndex
    • Observed Behavior: During an unsafe reindex (triggered by OPENCLAW_TEST_FAST=1), dropping the virtual vector table fails to clear prepared statements that still reference chunks_vec, which then immediately crash the reindex when executed.
    • Expected Behavior: The function should row-delete from the virtual vector table when preserving the schema instead of dropping it out from under the active prepared statements.
    • Impact Severity: HIGH (Breaks vector memory reindexing entirely on Windows and during test sequences)
  • Upstream Issue #77871: Gateway session-resume: orphaned LCM bindings + stuck-session deadlock

    • Location in our code: src/logging/diagnostic.ts around stuckSessionWarnMs
    • Observed Behavior: The runtime diagnostic subsystem identifies and logs stuck sessions via stuckSessionWarnMs but lacks an automated recovery mechanism to abort them, causing the gateway process to hang indefinitely.
    • Expected Behavior: The diagnostic system should escalate stuck-session warnings to action (e.g., abort, queue drain) after N intervals to prevent deadlocks.
    • Impact Severity: HIGH (Indefinite hanging during certain operations leading to memory exhaustion)

PR created automatically by Jules for task 5473837547395383246 started by @MillionthOdin16

Changed files


PR #78036: Drop stale conversation session bindings

Description (problem / solution / changelog)

Fixes #77871.

Summary

  • validate runtime conversation binding targets before touching or rewriting routes
  • unbind stale runtime targets and fall back to the original route
  • unbind conversation bindings for session keys removed by applied session cleanup
  • keep dry-run cleanup read-only

Verification

  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/test-projects.mjs src/channels/plugins/binding-routing.test.ts src/config/sessions/store.pruning.integration.test.ts src/infra/outbound/session-binding-service.test.ts src/commands/sessions-cleanup.test.ts --maxWorkers=1
  • git diff --check

Known unrelated blocker

  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs currently fails in the extension typecheck lane on extensions/codex/src/app-server/run-attempt.ts(19,3): Module "openclaw/plugin-sdk/agent-harness-runtime" has no exported member "resolveAgentDir". This branch does not touch Codex harness exports.

Changed files

  • src/channels/plugins/binding-routing.test.ts (modified, +36/-1)
  • src/channels/plugins/binding-routing.ts (modified, +86/-2)
  • src/config/sessions/cleanup-service.ts (modified, +27/-0)
  • src/config/sessions/store.pruning.integration.test.ts (modified, +53/-0)
RAW_BUFFERClick to expand / collapse

Summary

Two related defects in the gateway's session-resume path. Worked correctly until ~Apr 29, 2026; broke after a subsequent OpenClaw update.

Net symptom: chat sessions silently lose continuity. The model appears to "have context" on the first turn after a restore (because lcm-files history is re-injected at bootstrap) but every subsequent turn is treated as first-turn fresh, and nothing persists.

Bug 1 — No reconciliation between LCM binding and on-disk session file

When the gateway resolves agent:main:<channel>:channel:<id> and the bound sessionId points to a session file that no longer exists on disk, it does not heal the binding. Instead it silently spawns an ephemeral in-memory session per message.

Consequences:

  • Every turn looks "first-turn fresh" to the model.
  • Writes target a phantom path (e.g. d1565397-….jsonl) that is never created, so nothing is persisted.
  • The illusion of restored context only holds for the bootstrap turn (lcm-files history dump); turn 2 onward has no continuity.

Expected behavior — one of:

  • (a) Auto-clear the stale binding and spawn a fresh persisted session, OR
  • (b) Restore the lcm-files history dump into a new session file and rebind to it.

Bug 2 — Stuck-session deadlock with no escalation

On Apr 30, a session sat in state=processing age=875s queueDepth=1 for ~14 minutes while new messages queued behind it.

The watchdog detected it ([diagnostic] stuck session… log lines were emitted continuously) but did not act:

  • No auto-abort
  • No queue drain
  • No recovery

The session eventually transitioned to failed, the file was later pruned, but the LCM binding remained — which is how Bug 1 gets triggered downstream.

Suggested fixes

  1. Validate sessionFile existence when resolving an LCM binding. Missing file → drop binding, emit warning, fall through to fresh-session path.
  2. Escalate stuck-session diagnostics to action after N intervals (e.g. 5 min). Today they log forever with no remediation.
  3. Session-file cleanup must also clear matching LCM bindings. Orphaned pointers should be impossible by construction — whatever job pruned the dead 4a3674de-….jsonl must invalidate the binding that still points to it.

Regression window

  • Last known good: Apr 29, 2026
  • First observed broken: Apr 30, 2026 (14-minute deadlock)
  • Trigger: Recent OpenClaw update in that window

Environment

  • Host: Mac mini (Darwin 24.6.0)
  • Node: v22.22.0
  • Channels affected: at least Discord-bound main session; likely any persisted-session channel

extent analysis

TL;DR

Validate session file existence when resolving LCM bindings and escalate stuck-session diagnostics to action to prevent silent session loss and deadlocks.

Guidance

  • Validate sessionFile existence when resolving an LCM binding to prevent ephemeral in-memory sessions and ensure session continuity.
  • Escalate stuck-session diagnostics to action after a specified interval (e.g., 5 minutes) to prevent deadlocks and queue buildup.
  • Ensure session-file cleanup also clears matching LCM bindings to prevent orphaned pointers and related issues.
  • Review the recent OpenClaw update for potential changes that may have triggered these issues.

Example

No code snippet is provided as the issue does not contain sufficient code details.

Notes

The provided guidance is based on the information given in the issue and may not cover all possible scenarios or edge cases. Additional testing and verification may be necessary to ensure the suggested fixes resolve the issues.

Recommendation

Apply the suggested workarounds, specifically validating session file existence and escalating stuck-session diagnostics, as these directly address the identified bugs and can help restore session continuity and prevent deadlocks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway session-resume: orphaned LCM bindings + stuck-session deadlock (regression since post-Apr 29 update) [3 pull requests, 1 comments, 2 participants]