openclaw - ✅(Solved) Fix Gateway update.run can leave half-installed package, killing live session transcripts [2 pull requests, 2 comments, 3 participants]

ChattanoogaDan · 2026-05-04T00:41:22Z

[openclaw] A botched update.run npm upgrade triggered via the gateway left the package half-installed. The post-restart maintenance step then tried to load a m… A botched `update.run` (npm upgrade triggered via the gateway) left the package half-installed. The post-restart maintenance step then tried to load a module that no longer existed at the path the new build expected, crashed, and a live Telegram session lost its ` .jsonl` transcript. Every subsequent message to that Telegram chat then hit `claude --resume ` against a missing file, failed in ~400 ms, and surfaced as "Something went wrong while processing your request" to the user. # PR #77030: fix(cli-runner): drop stale claude-cli sessionId when transcript missing - Repository: openclaw/openclaw - Author: openperf - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/77030 ## Description (problem / solution / changelog) ### Summary - **Problem**: After a half-installed `update.run` (#77011), live Telegram-direct sessions stop responding indefinitely. The Claude CLI live session repeatedly invokes `claude --resume ` against a transcript that no longer exists at `~/.claude/projects/ / .jsonl`. The first attempt hangs to the 5-minute hard timeout (`durationMs=300011 error=FailoverError`); every following turn fast-fails (`durationMs=384 error=FailoverError`) without ever clearing the dead binding. Concretely, `src/agents/cli-runner/prepare.ts:259-271` only validates auth-profile / auth-epoch / system-prompt / mcp hashes via `resolveCliSessionReuse` (`src/agents/cli-session.ts:127`); it never checks that the transcript file the persisted `cliSessionBindings.claude-cli.sessionId` points at is still on disk. - **Root Cause**: A transcript-existence probe (`claudeCliSessionTranscriptHasContent` in `src/agents/command/attempt-execution.helpers.ts:80`) is already wired into the user-driven path at `src/agents/command/attempt-execution.ts:443-467` to clear stale bindings before invoking `runCliAgent`. The auto-reply / followup / Telegram-direct path takes a different entrypoint — `src/auto-reply/reply/agent-runner-execution.ts:1305` calls `runCliAgent` directly and does not go through `runAgentAttempt`, so the existing probe is not consulted on this code path. The `session_expired` retry inside `src/agents/cli-runner.ts:331` does not compensate either, because `claude --resume` against a missing transcript surfaces as `reason=timeout`, not `reason=session_expired`; the retry branch never fires and the persisted `cliSessionBinding` is never overwritten with a fresh sessionId. Every subsequent turn re-reads the same dead sessionId until the operator explicitly runs `openclaw sessions cleanup --enforce --fix-missing`. The cron isolated-agent path (`src/cron/isolated-agent/run-execution.runtime.ts`) is in the same category — also enters `runCliAgent` directly. - **Fix**: Push the transcript-existence pre-flight down into `prepareCliRunContext`, the single common entry point for **all** CLI-run callers (`attempt-execution.ts`, `agent-runner-execution.ts`, `cron/isolated-agent/run-execution.runtime.ts`). When the candidate sessionId is non-empty, the provider is `claude-cli`, and the on-disk Claude CLI transcript has no assistant message (i.e. resume target is dead), drop the binding for this turn and return a `CliReusableSession` with `invalidatedReason: "missing-transcript"`. The current turn then runs as a fresh `claude` session (no `--resume`), and the existing post-run flow in `src/agents/command/session-store.ts:163-168` writes the brand-new `cliSessionBinding` back to the session store, replacing the dead one. The loop is broken at the per-turn level — no manual cleanup needed, no schema migration, no protocol bump. The check is gated on `isClaudeCliProvider(params.provider)` so non-claude providers pay zero cost. The helper is exposed via the existing `prepareDeps` injection seam so tests stay hermetic and don't touch real `~/.claude/projects/`. - **What changed**: - `src/agents/cli-runner/prepare.ts` — import `isClaudeCliProvider` and `claudeCliSessionTranscriptHasContent`; expose the latter through `prepareDeps` for test injection; before computing `reusableCliSession`, run the claude-cli transcript probe and short-circuit to `{ invalidatedReason: "missing-transcript" }` when the resume target is dead. - `src/agents/cli-runner/types.ts` — extend `CliReusableSession.invalidatedReason` union with the new `"missing-transcript"` case. - `src/agents/cli-runner/prepare.test.ts` — add a `vi.mock` for the `plugin-sdk/anthropic-cli.js` facade (so the test runs without bundled-plugin runtime), plus three behavior tests: drop on missing transcript, keep on present transcript, no probe for non-claude providers. - `CHANGELOG.md` — single Fixes line under Unreleased referencing the issue. - **What did NOT change (scope boundary)**: - No protocol changes; `update.run` orchestration / atomic-swap logic

openclaw2026-05-04 00:41:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77011•Fetched 2026-05-04 04:59:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

referenced ×5commented ×2cross-referenced ×2mentioned ×2

A botched update.run (npm upgrade triggered via the gateway) left the package half-installed. The post-restart maintenance step then tried to load a module that no longer existed at the path the new build expected, crashed, and a live Telegram session lost its <sessionId>.jsonl transcript. Every subsequent message to that Telegram chat then hit claude --resume <id> against a missing file, failed in ~400 ms, and surfaced as "Something went wrong while processing your request" to the user.

Error Message

[gateway] request handler failed: Error: Cannot find module [gateway] shutdown error: ERR_MODULE_NOT_FOUND ... server-close-D1yUo6cN.js 14:05:51 [gateway] shutdown error: ERR_MODULE_NOT_FOUND server-close 14:52:43 claude live session turn failed: provider=claude-cli ... durationMs=300011 error=FailoverError 15:10:22 claude live session turn failed: durationMs=384 error=FailoverError (immediate)

Root Cause

After restart, every Telegram turn fails with FailoverError in ~400 ms (both Opus 4.7 and Sonnet 4.6), because the session's transcript file no longer exists.
openclaw sessions cleanup --enforce --fix-missing confirmed the record pointed at a missing transcript and pruned it.

Fix Action

Workaround

Run openclaw sessions cleanup --enforce --fix-missing
Or schedule it daily via openclaw cron add --system-event 'run: ...'

PR fix notes

PR #77030: fix(cli-runner): drop stale claude-cli sessionId when transcript missing

Repository: openclaw/openclaw
Author: openperf
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/77030

Description (problem / solution / changelog)

Summary

Problem: After a half-installed update.run (#77011), live Telegram-direct sessions stop responding indefinitely. The Claude CLI live session repeatedly invokes claude --resume <stale-sid> against a transcript that no longer exists at ~/.claude/projects/<project>/<sessionId>.jsonl. The first attempt hangs to the 5-minute hard timeout (durationMs=300011 error=FailoverError); every following turn fast-fails (durationMs=384 error=FailoverError) without ever clearing the dead binding. Concretely, src/agents/cli-runner/prepare.ts:259-271 only validates auth-profile / auth-epoch / system-prompt / mcp hashes via resolveCliSessionReuse (src/agents/cli-session.ts:127); it never checks that the transcript file the persisted cliSessionBindings.claude-cli.sessionId points at is still on disk.
Root Cause: A transcript-existence probe (claudeCliSessionTranscriptHasContent in src/agents/command/attempt-execution.helpers.ts:80) is already wired into the user-driven path at src/agents/command/attempt-execution.ts:443-467 to clear stale bindings before invoking runCliAgent. The auto-reply / followup / Telegram-direct path takes a different entrypoint — src/auto-reply/reply/agent-runner-execution.ts:1305 calls runCliAgent directly and does not go through runAgentAttempt, so the existing probe is not consulted on this code path. The session_expired retry inside src/agents/cli-runner.ts:331 does not compensate either, because claude --resume against a missing transcript surfaces as reason=timeout, not reason=session_expired; the retry branch never fires and the persisted cliSessionBinding is never overwritten with a fresh sessionId. Every subsequent turn re-reads the same dead sessionId until the operator explicitly runs openclaw sessions cleanup --enforce --fix-missing. The cron isolated-agent path (src/cron/isolated-agent/run-execution.runtime.ts) is in the same category — also enters runCliAgent directly.
Fix: Push the transcript-existence pre-flight down into prepareCliRunContext, the single common entry point for all CLI-run callers (attempt-execution.ts, agent-runner-execution.ts, cron/isolated-agent/run-execution.runtime.ts). When the candidate sessionId is non-empty, the provider is claude-cli, and the on-disk Claude CLI transcript has no assistant message (i.e. resume target is dead), drop the binding for this turn and return a CliReusableSession with invalidatedReason: "missing-transcript". The current turn then runs as a fresh claude session (no --resume), and the existing post-run flow in src/agents/command/session-store.ts:163-168 writes the brand-new cliSessionBinding back to the session store, replacing the dead one. The loop is broken at the per-turn level — no manual cleanup needed, no schema migration, no protocol bump. The check is gated on isClaudeCliProvider(params.provider) so non-claude providers pay zero cost. The helper is exposed via the existing prepareDeps injection seam so tests stay hermetic and don't touch real ~/.claude/projects/.
What changed:
- src/agents/cli-runner/prepare.ts — import isClaudeCliProvider and claudeCliSessionTranscriptHasContent; expose the latter through prepareDeps for test injection; before computing reusableCliSession, run the claude-cli transcript probe and short-circuit to { invalidatedReason: "missing-transcript" } when the resume target is dead.
- src/agents/cli-runner/types.ts — extend CliReusableSession.invalidatedReason union with the new "missing-transcript" case.
- src/agents/cli-runner/prepare.test.ts — add a vi.mock for the plugin-sdk/anthropic-cli.js facade (so the test runs without bundled-plugin runtime), plus three behavior tests: drop on missing transcript, keep on present transcript, no probe for non-claude providers.
- CHANGELOG.md — single Fixes line under Unreleased referencing the issue.
What did NOT change (scope boundary):
- No protocol changes; update.run orchestration / atomic-swap logic in src/infra/update-runner.ts and src/infra/package-update-steps.ts is untouched (the npm path already stages and atomically swaps).
- No changes to non-claude-cli backends (codex, gemini, etc.) — provider gate prevents cross-provider impact.
- No session-store schema changes; no migration; no doctor changes.
- The existing user-driven check in attempt-execution.ts:443-467 is intentionally left in place. It still has the value of clearing the persisted binding synchronously for that path (so a follow-up read inside the same request sees the cleared entry); this PR's pre-flight is a cross-caller safety net, not a replacement. Both call the same helper, so the two layers cannot diverge.
- The session_expired retry path at cli-runner.ts:331-358 is unchanged. The two failure modes are complementary: session_expired is the clean-error case (Claude returns "Conversation not found"), already handled in-process by the retry; timeout is the silent-hang case (Claude blocks until the per-turn deadline), addressed here at the preparation layer above.

Reproduction

Start the gateway with an active Claude-CLI agent on a Telegram-direct session and let it persist a cliSessionBindings.claude-cli.sessionId (any successful turn does this).
Simulate the half-installed update.run outcome by removing the corresponding ~/.claude/projects/<project>/<sessionId>.jsonl file (or replace its contents with an empty file). The persisted session entry still references the now-missing sessionId.
Send a follow-up message to the Telegram bot. Without this fix, observe claude live session turn failed: ... durationMs=300011 error=FailoverError (5-minute resume hang) and cli session reset is not logged. Every subsequent turn fails immediately with the same FailoverError; the binding never refreshes; user is stuck.
With this fix, observe cli session reset: provider=claude-cli reason=missing-transcript ... once at the start of the next turn, the run completes against a fresh Claude session, and the new sessionId is persisted by the existing post-run flow. Subsequent turns succeed.

Risk / Mitigation

Risk 1 — false positives: Could a healthy binding be dropped? claudeCliSessionTranscriptHasContent reads up to SESSION_FILE_MAX_RECORDS=500 lines and returns true once it sees an assistant message. A binding is only persisted by session-store.ts:setCliSessionBinding after a successful run, which by definition has flushed at least one assistant turn — so a freshly-written binding always passes the probe. The probe walks ~/.claude/projects/* (already done in attempt-execution.helpers.ts), so home-dir / project-prefix mismatches behave the same as in the existing user-driven check. Mitigation: covered by the new "keeps the claude-cli sessionId when the on-disk transcript is present" test, plus the existing claudeCliSessionTranscriptHasContent suite in attempt-execution.test.ts:366-432 (already validates symlink rejection, path-traversal rejection, and assistant-message detection).
Risk 2 — non-claude regressions: Could other providers' resume paths break? The probe is gated on isClaudeCliProvider(params.provider), so any non-claude provider follows the unmodified existing branch. Mitigation: new "does not probe the transcript for non-claude-cli providers" test asserts transcriptCheck is never called and the existing { sessionId } flow is preserved.
Risk 3 — extra I/O per turn: The probe adds one fs.readdir plus a small bounded fs.open+readline per claude-cli turn. The same helper is already on the user-driven hot path via attempt-execution.ts:447; we are not introducing a new I/O class, only widening its coverage. Worst-case is bounded by SESSION_FILE_MAX_RECORDS=500 lines and short-circuits on the first assistant message (typically line 1-2). Mitigation: the helper itself is unchanged and already production-tested.
Risk 4 — cross-package import: The new import of claudeCliSessionTranscriptHasContent crosses from agents/cli-runner/ into agents/command/, and isClaudeCliProvider is loaded through the plugin-sdk/anthropic-cli.js facade. Mitigation: both directions are already used in core (prepare.ts already imports ../command/types.js; attempt-execution.ts already loads the same facade). No new architectural seam, no cycle (helpers.ts does not import back into cli-runner). Test file mocks the facade locally so the unit suite stays hermetic.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Agents (cli-runner / claude-cli resume path)
Auto-reply / followup runs (indirectly: now goes through the new pre-flight)
Tests (prepare.test.ts unit coverage)
Changelog (Unreleased Fixes entry)

Linked Issue/PR

Refs #77011 — addresses the missing-transcript auto-recovery scenario described in the issue. The other two items in the same issue (update.run atomicity and shell-only cron systemEvent run session rows) are independent failure modes; they remain open and out of scope here so they can be tracked and shipped separately.

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/cli-runner/prepare.test.ts (modified, +129/-0)
src/agents/cli-runner/prepare.ts (modified, +36/-14)
src/agents/cli-runner/types.ts (modified, +6/-1)

PR #77104: fix(cron): keep pre-transcript rows non-resumable

Repository: openclaw/openclaw
Author: steipete
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/77104

Description (problem / solution / changelog)

Summary

keep default isolated cron metadata rows non-resumable while their transcript file is missing
restore sessionId/sessionFile once the cron transcript exists, leaving persistent session:<id> cron targets untouched
add regression coverage for pre-transcript and transcript-backed cron persistence

Refs #77011.

Verification

pnpm exec oxfmt --check --threads=1 src/cron/isolated-agent/run-session-state.ts src/cron/isolated-agent/run-session-state.test.ts
pnpm test src/cron/isolated-agent/run-session-state.test.ts src/cron/isolated-agent/run.session-key-isolation.test.ts src/cron/isolated-agent/run.fast-mode.test.ts src/cron/isolated-agent.session-identity.test.ts
pnpm test src/infra/heartbeat-runner.ghost-reminder.test.ts src/cron/service.runs-one-shot-main-job-disables-it.test.ts
Crabbox Testbox tbx_01kqre4ppknrf3sfje8rt3kbmc: pnpm test:docker:cron-mcp-cleanup
Crabbox Testbox tbx_01kqre8m7vmngddbp9b4g9pn7v: pnpm check:changed

Changed files

CHANGELOG.md (modified, +1/-0)
src/cron/isolated-agent/run-session-state.test.ts (modified, +87/-0)
src/cron/isolated-agent/run-session-state.ts (modified, +27/-2)

RAW_BUFFERClick to expand / collapse

Summary

Version

openclaw 2026.5.2 (8b2a6e5), macOS, Node v25.9.0 via nvm

Reproduction (observed)

Gateway running with active Telegram session.
update.run initiated (npm package upgrade) by openclaw-control-ui.

Module resolution failed mid-restart:

[gateway] request handler failed: Error: Cannot find module
  '/opt/homebrew/lib/node_modules/openclaw/dist/task-registry.maintenance-DuW0FRWY.js'
  imported from .../dist/status.summary-D7d6QRTx.js
[gateway] shutdown error: ERR_MODULE_NOT_FOUND ... server-close-D1yUo6cN.js

After restart, every Telegram turn fails with FailoverError in ~400 ms (both Opus 4.7 and Sonnet 4.6), because the session's transcript file no longer exists.
openclaw sessions cleanup --enforce --fix-missing confirmed the record pointed at a missing transcript and pruned it.

Two underlying bugs

Bug 1 — update.run is not atomic

A failed/in-progress install leaves modules referenced by sibling modules at hashed paths that no longer match. The gateway should download → verify all module imports resolve → swap atomically. Right now a partial install is reachable by the live process and crashes the maintenance routine.

Bug 2 — session-store entries for shell-only cron jobs

systemEvent run: cron jobs (pure shell, no LLM turn) get a row in sessions.json and a .trajectory-path.json + .trajectory.jsonl, but never a <id>.jsonl. openclaw doctor then reports them as "missing transcripts" forever, and cleanup --fix-missing prunes them — but they should never have been registered as resumable sessions in the first place. Example session ids that exhibit this: a1287341-..., 04ab26a5-..., cb3b51f9-... (all cron-driven systemEvent jobs).

Bug 3 (nice-to-have) — auto-recover from missing transcripts

When claude --resume <id> fails because the transcript file is gone, the gateway loops forever on the same broken session record. It should auto-prune the record and start a fresh session for that sessionKey instead of failing every turn for the user.

Workaround

Run openclaw sessions cleanup --enforce --fix-missing
Or schedule it daily via openclaw cron add --system-event 'run: ...'

Logs (trimmed)

14:05:50 [gateway] request handler failed: ERR_MODULE_NOT_FOUND task-registry.maintenance
14:05:51 [gateway] shutdown error: ERR_MODULE_NOT_FOUND server-close
14:10:23 [diagnostic] stuck session: sessionKey=agent:main:telegram:direct:<redacted>
         age=135s reason=queued_work_without_active_run
14:52:43 claude live session turn failed: provider=claude-cli ... durationMs=300011 error=FailoverError
15:10:22 claude live session turn failed: durationMs=384 error=FailoverError (immediate)

extent analysis

TL;DR

Run openclaw sessions cleanup --enforce --fix-missing to remove missing transcript files and prevent further errors.

Guidance

Identify and address the root cause of the failed update.run to prevent partial installs and module resolution failures.
Verify that the cleanup command successfully removes the missing transcript files and prunes the corresponding session records.
Consider scheduling the cleanup command daily via openclaw cron add --system-event 'run: ...' to prevent similar issues in the future.
Investigate and fix the underlying bugs, including making update.run atomic and preventing session-store entries for shell-only cron jobs.

Example

No code snippet is provided as it is not explicitly supported by the issue.

Notes

The provided workaround may not fix the underlying issues but can help mitigate the symptoms. It is essential to address the root causes to prevent similar problems in the future.

Recommendation

Apply the workaround by running openclaw sessions cleanup --enforce --fix-missing to immediately address the issue, and then investigate and fix the underlying bugs to prevent future occurrences.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Gateway update.run can leave half-installed package, killing live session transcripts [2 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #77030: fix(cli-runner): drop stale claude-cli sessionId when transcript missing

Description (problem / solution / changelog)

Summary

Reproduction

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Changed files

PR #77104: fix(cron): keep pre-transcript rows non-resumable

Description (problem / solution / changelog)

Summary

Verification

Changed files

Summary

Version

Reproduction (observed)

Two underlying bugs

Bug 1 — update.run is not atomic

Bug 2 — session-store entries for shell-only cron jobs

Bug 3 (nice-to-have) — auto-recover from missing transcripts

Workaround

Logs (trimmed)

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING