openclaw - 💡(How to fix) Fix prepare.runtime invalidates live claude-cli sessions as missing-transcript, causing mid-thread context loss [2 comments, 2 participants]

jnard0ne · 2026-05-06T14:04:15Z

[openclaw] Bug When OpenClaw starts a new turn, prepare.runtime probes for the bound CLI session's transcript file on disk. If the file is absent or contains n… ## Bug When OpenClaw starts a new turn, `prepare.runtime` probes for the bound CLI session's transcript file on disk. If the file is absent or contains no `assistant`-role records yet, the runtime calls `clearCliSessionInStore` and spawns a fresh claude-cli session — discarding all prior conversation context before the turn even starts. This silently resets the AI's memory mid-thread, with no warning to the user. ## Reproduction The most reliable trigger: 1. Cause any claude-cli turn to fail *before* it produces its first assistant message (e.g. hit a usage-limit 400, or kill the process mid-turn). 2. Wait for the process to recover, then send a follow-up message. 3. The runtime emits `cli session reset: provider=claude-cli reason=missing-transcript` and the next turn runs in a brand-new session with zero prior context. It also fires spontaneously during normal use — the gateway log shows `missing-transcript` resets a dozen or more times per day, on both `trigger=user` and `trigger=heartbeat` turns. No failure required. ## Log evidence (gateway.log excerpt) ``` 06:32:00 cli exec: … trigger=user useResume=true session=present reuse=reusable 06:32:08 claude live session turn: durationMs=7873 ← turn 1 finishes OK 06:32:48 cli session reset: provider=claude-cli reason=missing-transcript ← BUG 06:32:48 cli exec: … trigger=user useResume=false session=none reuse=invalidated:missing-transcript 06:32:48 claude live session close: reason=restart 06:32:48 claude live session start ← fresh session, no context 06:32:57 claude live session turn: durationMs=8867 ``` Turn 2 arrives 40 seconds after turn 1 completes — well past any flush race. ## Root cause **File:** `dist/prepare.runtime-DbLtKpjH.js:719` **Helper:** `dist/attempt-execution.helpers-D6YqcgiA.js:16–69` **Destructive branch:** `dist/attempt-execution-C4Upu-4a.js:247–256` The probe (`claudeCliSessionTranscriptHasContent`) scans `~/.claude/projects/ / .jsonl` for any record where `obj.message.role === "assistant"`. It returns `false` if: - The file doesn't exist at all (binding desync — OpenClaw store points at a session ID that differs from the live claude-cli session). - The file exists but the session errored before its first assistant output (e.g., after a 400 response). - There's a brief window between flush completion and the probe firing. In the desync case, the OpenClaw-store binding's `sessionId` does not match the session ID the live claude-cli process is actually using. The probe checks the wrong file, gets a miss, and hard-resets — even though the live session is perfectly healthy. `clearCliSessionInStore` is then called *before* the new run starts, so there's no recovery path. ## What makes this worse `buildClaudeCliFallbackContextPrelude` already exists in `attempt-execution.helpers` for failover paths. It is **not invoked** on `missing-transcript` resets — so even when OpenClaw has a complete unified history it could replay, it doesn't. The `setCliSessionBinding` call in `attempt-execution-C4Upu-4a.js:306` only fires on the FailoverError path, not on normal session rotation — so the binding can drift silently with each new CLI session. ## Proposed fix **Short-term (most defensive):** 1. Distinguish "file not found" from "file has no assistant records." If the transcript file simply doesn't exist, treat the binding as *unknown* rather than *missing*: keep it, let claude-cli attempt a resume, and let claude-cli reject if the session is genuinely dead. 2. When a reset is unavoidable, invoke `buildClaudeCliFallbackContextPrelude` (already imported in `prepare.runtime`) to feed the prior context to the new session. This infrastructure exists; it's just not wired to this code path. 3. Log the stored `sessionId` vs. the actual running session ID when they diverge — this is the diagnostic gap that makes the bug hard to catch. **Medium-term:** - Reverse the order in the destructive branch: only call `clearCliSessionInStore` *after* a successful new turn confirms the old binding is dead, not before. - Fire `setCliSessionBinding` whenever `runCliAgent` returns a `cliSessionBinding` whose `sessionId` differs from the stored one, not only on `FailoverError`. **Long-term:** - Stop using JSONL assistant-record presence as a liveness probe. Either query claude-cli directly, or maintain an OpenClaw-side liveness flag that is set when the runtime receives the first assistant chunk over the live channel (the `claude live session turn` events already carry this information). ## What this is NOT - Not a claw-messenger (or other channel plugin) bug — the plugin's inbound routing is correct. - Not a claude-cli bug — given a fresh session ID, claude-cli behaves as expected. - Not a config issue — no gateway setting addresses this; the logic is in compiled runtime code.

Root Cause

File: dist/prepare.runtime-DbLtKpjH.js:719 Helper: dist/attempt-execution.helpers-D6YqcgiA.js:16–69 Destructive branch: dist/attempt-execution-C4Upu-4a.js:247–256

The probe (claudeCliSessionTranscriptHasContent) scans ~/.claude/projects/<dir>/<sessionId>.jsonl for any record where obj.message.role === "assistant". It returns false if:

The file doesn't exist at all (binding desync — OpenClaw store points at a session ID that differs from the live claude-cli session).
The file exists but the session errored before its first assistant output (e.g., after a 400 response).
There's a brief window between flush completion and the probe firing.

In the desync case, the OpenClaw-store binding's sessionId does not match the session ID the live claude-cli process is actually using. The probe checks the wrong file, gets a miss, and hard-resets — even though the live session is perfectly healthy.

clearCliSessionInStore is then called before the new run starts, so there's no recovery path.

Code Example

06:32:00  cli exec: … trigger=user useResume=true session=present reuse=reusable
06:32:08  claude live session turn: durationMs=7873          ← turn 1 finishes OK
06:32:48  cli session reset: provider=claude-cli reason=missing-transcript   ← BUG
06:32:48  cli exec: … trigger=user useResume=false session=none reuse=invalidated:missing-transcript
06:32:48  claude live session close: reason=restart
06:32:48  claude live session start                           ← fresh session, no context
06:32:57  claude live session turn: durationMs=8867

Bug

When OpenClaw starts a new turn, prepare.runtime probes for the bound CLI session's transcript file on disk. If the file is absent or contains no assistant-role records yet, the runtime calls clearCliSessionInStore and spawns a fresh claude-cli session — discarding all prior conversation context before the turn even starts.

This silently resets the AI's memory mid-thread, with no warning to the user.

Reproduction

The most reliable trigger:

Cause any claude-cli turn to fail before it produces its first assistant message (e.g. hit a usage-limit 400, or kill the process mid-turn).
Wait for the process to recover, then send a follow-up message.
The runtime emits cli session reset: provider=claude-cli reason=missing-transcript and the next turn runs in a brand-new session with zero prior context.

It also fires spontaneously during normal use — the gateway log shows missing-transcript resets a dozen or more times per day, on both trigger=user and trigger=heartbeat turns. No failure required.

Log evidence (gateway.log excerpt)

06:32:00  cli exec: … trigger=user useResume=true session=present reuse=reusable
06:32:08  claude live session turn: durationMs=7873          ← turn 1 finishes OK
06:32:48  cli session reset: provider=claude-cli reason=missing-transcript   ← BUG
06:32:48  cli exec: … trigger=user useResume=false session=none reuse=invalidated:missing-transcript
06:32:48  claude live session close: reason=restart
06:32:48  claude live session start                           ← fresh session, no context
06:32:57  claude live session turn: durationMs=8867

Turn 2 arrives 40 seconds after turn 1 completes — well past any flush race.

Root cause

File: dist/prepare.runtime-DbLtKpjH.js:719 Helper: dist/attempt-execution.helpers-D6YqcgiA.js:16–69 Destructive branch: dist/attempt-execution-C4Upu-4a.js:247–256

The probe (claudeCliSessionTranscriptHasContent) scans ~/.claude/projects/<dir>/<sessionId>.jsonl for any record where obj.message.role === "assistant". It returns false if:

The file doesn't exist at all (binding desync — OpenClaw store points at a session ID that differs from the live claude-cli session).
The file exists but the session errored before its first assistant output (e.g., after a 400 response).
There's a brief window between flush completion and the probe firing.

clearCliSessionInStore is then called before the new run starts, so there's no recovery path.

What makes this worse

buildClaudeCliFallbackContextPrelude already exists in attempt-execution.helpers for failover paths. It is not invoked on missing-transcript resets — so even when OpenClaw has a complete unified history it could replay, it doesn't.

The setCliSessionBinding call in attempt-execution-C4Upu-4a.js:306 only fires on the FailoverError path, not on normal session rotation — so the binding can drift silently with each new CLI session.

Proposed fix

Short-term (most defensive):

Distinguish "file not found" from "file has no assistant records." If the transcript file simply doesn't exist, treat the binding as unknown rather than missing: keep it, let claude-cli attempt a resume, and let claude-cli reject if the session is genuinely dead.
When a reset is unavoidable, invoke buildClaudeCliFallbackContextPrelude (already imported in prepare.runtime) to feed the prior context to the new session. This infrastructure exists; it's just not wired to this code path.
Log the stored sessionId vs. the actual running session ID when they diverge — this is the diagnostic gap that makes the bug hard to catch.

Medium-term:

Reverse the order in the destructive branch: only call clearCliSessionInStore after a successful new turn confirms the old binding is dead, not before.
Fire setCliSessionBinding whenever runCliAgent returns a cliSessionBinding whose sessionId differs from the stored one, not only on FailoverError.

Long-term:

Stop using JSONL assistant-record presence as a liveness probe. Either query claude-cli directly, or maintain an OpenClaw-side liveness flag that is set when the runtime receives the first assistant chunk over the live channel (the claude live session turn events already carry this information).

What this is NOT

Not a claw-messenger (or other channel plugin) bug — the plugin's inbound routing is correct.
Not a claude-cli bug — given a fresh session ID, claude-cli behaves as expected.
Not a config issue — no gateway setting addresses this; the logic is in compiled runtime code.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix prepare.runtime invalidates live claude-cli sessions as missing-transcript, causing mid-thread context loss [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Bug

Reproduction

Log evidence (gateway.log excerpt)

Root cause

What makes this worse

Proposed fix

What this is NOT

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix prepare.runtime invalidates live claude-cli sessions as missing-transcript, causing mid-thread context loss [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Bug

Reproduction

Log evidence (gateway.log excerpt)

Root cause

What makes this worse

Proposed fix

What this is NOT

Still need to ship something?

RELATED_DISCOVERY

TRENDING