openclaw - ✅(Solved) Fix Gateway update.run can leave half-installed package, killing live session transcripts [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77011Fetched 2026-05-04 04:59:27
View on GitHub
Comments
2
Participants
3
Timeline
13
Reactions
2
Timeline (top)
referenced ×5commented ×2cross-referenced ×2mentioned ×2

A botched update.run (npm upgrade triggered via the gateway) left the package half-installed. The post-restart maintenance step then tried to load a module that no longer existed at the path the new build expected, crashed, and a live Telegram session lost its <sessionId>.jsonl transcript. Every subsequent message to that Telegram chat then hit claude --resume <id> against a missing file, failed in ~400 ms, and surfaced as "Something went wrong while processing your request" to the user.

Error Message

[gateway] request handler failed: Error: Cannot find module [gateway] shutdown error: ERR_MODULE_NOT_FOUND ... server-close-D1yUo6cN.js 14:05:51 [gateway] shutdown error: ERR_MODULE_NOT_FOUND server-close 14:52:43 claude live session turn failed: provider=claude-cli ... durationMs=300011 error=FailoverError 15:10:22 claude live session turn failed: durationMs=384 error=FailoverError (immediate)

Root Cause

  1. After restart, every Telegram turn fails with FailoverError in ~400 ms (both Opus 4.7 and Sonnet 4.6), because the session's transcript file no longer exists.
  2. openclaw sessions cleanup --enforce --fix-missing confirmed the record pointed at a missing transcript and pruned it.

Fix Action

Workaround

  • Run openclaw sessions cleanup --enforce --fix-missing
  • Or schedule it daily via openclaw cron add --system-event 'run: ...'

PR fix notes

PR #77030: fix(cli-runner): drop stale claude-cli sessionId when transcript missing

Description (problem / solution / changelog)

Summary

  • Problem: After a half-installed update.run (#77011), live Telegram-direct sessions stop responding indefinitely. The Claude CLI live session repeatedly invokes claude --resume <stale-sid> against a transcript that no longer exists at ~/.claude/projects/<project>/<sessionId>.jsonl. The first attempt hangs to the 5-minute hard timeout (durationMs=300011 error=FailoverError); every following turn fast-fails (durationMs=384 error=FailoverError) without ever clearing the dead binding. Concretely, src/agents/cli-runner/prepare.ts:259-271 only validates auth-profile / auth-epoch / system-prompt / mcp hashes via resolveCliSessionReuse (src/agents/cli-session.ts:127); it never checks that the transcript file the persisted cliSessionBindings.claude-cli.sessionId points at is still on disk.

  • Root Cause: A transcript-existence probe (claudeCliSessionTranscriptHasContent in src/agents/command/attempt-execution.helpers.ts:80) is already wired into the user-driven path at src/agents/command/attempt-execution.ts:443-467 to clear stale bindings before invoking runCliAgent. The auto-reply / followup / Telegram-direct path takes a different entrypoint — src/auto-reply/reply/agent-runner-execution.ts:1305 calls runCliAgent directly and does not go through runAgentAttempt, so the existing probe is not consulted on this code path. The session_expired retry inside src/agents/cli-runner.ts:331 does not compensate either, because claude --resume against a missing transcript surfaces as reason=timeout, not reason=session_expired; the retry branch never fires and the persisted cliSessionBinding is never overwritten with a fresh sessionId. Every subsequent turn re-reads the same dead sessionId until the operator explicitly runs openclaw sessions cleanup --enforce --fix-missing. The cron isolated-agent path (src/cron/isolated-agent/run-execution.runtime.ts) is in the same category — also enters runCliAgent directly.

  • Fix: Push the transcript-existence pre-flight down into prepareCliRunContext, the single common entry point for all CLI-run callers (attempt-execution.ts, agent-runner-execution.ts, cron/isolated-agent/run-execution.runtime.ts). When the candidate sessionId is non-empty, the provider is claude-cli, and the on-disk Claude CLI transcript has no assistant message (i.e. resume target is dead), drop the binding for this turn and return a CliReusableSession with invalidatedReason: "missing-transcript". The current turn then runs as a fresh claude session (no --resume), and the existing post-run flow in src/agents/command/session-store.ts:163-168 writes the brand-new cliSessionBinding back to the session store, replacing the dead one. The loop is broken at the per-turn level — no manual cleanup needed, no schema migration, no protocol bump. The check is gated on isClaudeCliProvider(params.provider) so non-claude providers pay zero cost. The helper is exposed via the existing prepareDeps injection seam so tests stay hermetic and don't touch real ~/.claude/projects/.

  • What changed:

    • src/agents/cli-runner/prepare.ts — import isClaudeCliProvider and claudeCliSessionTranscriptHasContent; expose the latter through prepareDeps for test injection; before computing reusableCliSession, run the claude-cli transcript probe and short-circuit to { invalidatedReason: "missing-transcript" } when the resume target is dead.
    • src/agents/cli-runner/types.ts — extend CliReusableSession.invalidatedReason union with the new "missing-transcript" case.
    • src/agents/cli-runner/prepare.test.ts — add a vi.mock for the plugin-sdk/anthropic-cli.js facade (so the test runs without bundled-plugin runtime), plus three behavior tests: drop on missing transcript, keep on present transcript, no probe for non-claude providers.
    • CHANGELOG.md — single Fixes line under Unreleased referencing the issue.
  • What did NOT change (scope boundary):

    • No protocol changes; update.run orchestration / atomic-swap logic in src/infra/update-runner.ts and src/infra/package-update-steps.ts is untouched (the npm path already stages and atomically swaps).
    • No changes to non-claude-cli backends (codex, gemini, etc.) — provider gate prevents cross-provider impact.
    • No session-store schema changes; no migration; no doctor changes.
    • The existing user-driven check in attempt-execution.ts:443-467 is intentionally left in place. It still has the value of clearing the persisted binding synchronously for that path (so a follow-up read inside the same request sees the cleared entry); this PR's pre-flight is a cross-caller safety net, not a replacement. Both call the same helper, so the two layers cannot diverge.
    • The session_expired retry path at cli-runner.ts:331-358 is unchanged. The two failure modes are complementary: session_expired is the clean-error case (Claude returns "Conversation not found"), already handled in-process by the retry; timeout is the silent-hang case (Claude blocks until the per-turn deadline), addressed here at the preparation layer above.

Reproduction

  1. Start the gateway with an active Claude-CLI agent on a Telegram-direct session and let it persist a cliSessionBindings.claude-cli.sessionId (any successful turn does this).
  2. Simulate the half-installed update.run outcome by removing the corresponding ~/.claude/projects/<project>/<sessionId>.jsonl file (or replace its contents with an empty file). The persisted session entry still references the now-missing sessionId.
  3. Send a follow-up message to the Telegram bot. Without this fix, observe claude live session turn failed: ... durationMs=300011 error=FailoverError (5-minute resume hang) and cli session reset is not logged. Every subsequent turn fails immediately with the same FailoverError; the binding never refreshes; user is stuck.
  4. With this fix, observe cli session reset: provider=claude-cli reason=missing-transcript ... once at the start of the next turn, the run completes against a fresh Claude session, and the new sessionId is persisted by the existing post-run flow. Subsequent turns succeed.

Risk / Mitigation

  • Risk 1 — false positives: Could a healthy binding be dropped? claudeCliSessionTranscriptHasContent reads up to SESSION_FILE_MAX_RECORDS=500 lines and returns true once it sees an assistant message. A binding is only persisted by session-store.ts:setCliSessionBinding after a successful run, which by definition has flushed at least one assistant turn — so a freshly-written binding always passes the probe. The probe walks ~/.claude/projects/* (already done in attempt-execution.helpers.ts), so home-dir / project-prefix mismatches behave the same as in the existing user-driven check. Mitigation: covered by the new "keeps the claude-cli sessionId when the on-disk transcript is present" test, plus the existing claudeCliSessionTranscriptHasContent suite in attempt-execution.test.ts:366-432 (already validates symlink rejection, path-traversal rejection, and assistant-message detection).
  • Risk 2 — non-claude regressions: Could other providers' resume paths break? The probe is gated on isClaudeCliProvider(params.provider), so any non-claude provider follows the unmodified existing branch. Mitigation: new "does not probe the transcript for non-claude-cli providers" test asserts transcriptCheck is never called and the existing { sessionId } flow is preserved.
  • Risk 3 — extra I/O per turn: The probe adds one fs.readdir plus a small bounded fs.open+readline per claude-cli turn. The same helper is already on the user-driven hot path via attempt-execution.ts:447; we are not introducing a new I/O class, only widening its coverage. Worst-case is bounded by SESSION_FILE_MAX_RECORDS=500 lines and short-circuits on the first assistant message (typically line 1-2). Mitigation: the helper itself is unchanged and already production-tested.
  • Risk 4 — cross-package import: The new import of claudeCliSessionTranscriptHasContent crosses from agents/cli-runner/ into agents/command/, and isClaudeCliProvider is loaded through the plugin-sdk/anthropic-cli.js facade. Mitigation: both directions are already used in core (prepare.ts already imports ../command/types.js; attempt-execution.ts already loads the same facade). No new architectural seam, no cycle (helpers.ts does not import back into cli-runner). Test file mocks the facade locally so the unit suite stays hermetic.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents (cli-runner / claude-cli resume path)
  • Auto-reply / followup runs (indirectly: now goes through the new pre-flight)
  • Tests (prepare.test.ts unit coverage)
  • Changelog (Unreleased Fixes entry)

Linked Issue/PR

Refs #77011 — addresses the missing-transcript auto-recovery scenario described in the issue. The other two items in the same issue (update.run atomicity and shell-only cron systemEvent run session rows) are independent failure modes; they remain open and out of scope here so they can be tracked and shipped separately.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/cli-runner/prepare.test.ts (modified, +129/-0)
  • src/agents/cli-runner/prepare.ts (modified, +36/-14)
  • src/agents/cli-runner/types.ts (modified, +6/-1)

PR #77104: fix(cron): keep pre-transcript rows non-resumable

Description (problem / solution / changelog)

Summary

  • keep default isolated cron metadata rows non-resumable while their transcript file is missing
  • restore sessionId/sessionFile once the cron transcript exists, leaving persistent session:<id> cron targets untouched
  • add regression coverage for pre-transcript and transcript-backed cron persistence

Refs #77011.

Verification

  • pnpm exec oxfmt --check --threads=1 src/cron/isolated-agent/run-session-state.ts src/cron/isolated-agent/run-session-state.test.ts
  • pnpm test src/cron/isolated-agent/run-session-state.test.ts src/cron/isolated-agent/run.session-key-isolation.test.ts src/cron/isolated-agent/run.fast-mode.test.ts src/cron/isolated-agent.session-identity.test.ts
  • pnpm test src/infra/heartbeat-runner.ghost-reminder.test.ts src/cron/service.runs-one-shot-main-job-disables-it.test.ts
  • Crabbox Testbox tbx_01kqre4ppknrf3sfje8rt3kbmc: pnpm test:docker:cron-mcp-cleanup
  • Crabbox Testbox tbx_01kqre8m7vmngddbp9b4g9pn7v: pnpm check:changed

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/cron/isolated-agent/run-session-state.test.ts (modified, +87/-0)
  • src/cron/isolated-agent/run-session-state.ts (modified, +27/-2)
RAW_BUFFERClick to expand / collapse

Summary

A botched update.run (npm upgrade triggered via the gateway) left the package half-installed. The post-restart maintenance step then tried to load a module that no longer existed at the path the new build expected, crashed, and a live Telegram session lost its <sessionId>.jsonl transcript. Every subsequent message to that Telegram chat then hit claude --resume <id> against a missing file, failed in ~400 ms, and surfaced as "Something went wrong while processing your request" to the user.

Version

  • openclaw 2026.5.2 (8b2a6e5), macOS, Node v25.9.0 via nvm

Reproduction (observed)

  1. Gateway running with active Telegram session.

  2. update.run initiated (npm package upgrade) by openclaw-control-ui.

  3. Module resolution failed mid-restart:

    [gateway] request handler failed: Error: Cannot find module
      '/opt/homebrew/lib/node_modules/openclaw/dist/task-registry.maintenance-DuW0FRWY.js'
      imported from .../dist/status.summary-D7d6QRTx.js
    [gateway] shutdown error: ERR_MODULE_NOT_FOUND ... server-close-D1yUo6cN.js
  4. After restart, every Telegram turn fails with FailoverError in ~400 ms (both Opus 4.7 and Sonnet 4.6), because the session's transcript file no longer exists.

  5. openclaw sessions cleanup --enforce --fix-missing confirmed the record pointed at a missing transcript and pruned it.

Two underlying bugs

Bug 1 — update.run is not atomic

A failed/in-progress install leaves modules referenced by sibling modules at hashed paths that no longer match. The gateway should download → verify all module imports resolve → swap atomically. Right now a partial install is reachable by the live process and crashes the maintenance routine.

Bug 2 — session-store entries for shell-only cron jobs

systemEvent run: cron jobs (pure shell, no LLM turn) get a row in sessions.json and a .trajectory-path.json + .trajectory.jsonl, but never a <id>.jsonl. openclaw doctor then reports them as "missing transcripts" forever, and cleanup --fix-missing prunes them — but they should never have been registered as resumable sessions in the first place. Example session ids that exhibit this: a1287341-..., 04ab26a5-..., cb3b51f9-... (all cron-driven systemEvent jobs).

Bug 3 (nice-to-have) — auto-recover from missing transcripts

When claude --resume <id> fails because the transcript file is gone, the gateway loops forever on the same broken session record. It should auto-prune the record and start a fresh session for that sessionKey instead of failing every turn for the user.

Workaround

  • Run openclaw sessions cleanup --enforce --fix-missing
  • Or schedule it daily via openclaw cron add --system-event 'run: ...'

Logs (trimmed)

14:05:50 [gateway] request handler failed: ERR_MODULE_NOT_FOUND task-registry.maintenance
14:05:51 [gateway] shutdown error: ERR_MODULE_NOT_FOUND server-close
14:10:23 [diagnostic] stuck session: sessionKey=agent:main:telegram:direct:<redacted>
         age=135s reason=queued_work_without_active_run
14:52:43 claude live session turn failed: provider=claude-cli ... durationMs=300011 error=FailoverError
15:10:22 claude live session turn failed: durationMs=384 error=FailoverError (immediate)

extent analysis

TL;DR

Run openclaw sessions cleanup --enforce --fix-missing to remove missing transcript files and prevent further errors.

Guidance

  • Identify and address the root cause of the failed update.run to prevent partial installs and module resolution failures.
  • Verify that the cleanup command successfully removes the missing transcript files and prunes the corresponding session records.
  • Consider scheduling the cleanup command daily via openclaw cron add --system-event 'run: ...' to prevent similar issues in the future.
  • Investigate and fix the underlying bugs, including making update.run atomic and preventing session-store entries for shell-only cron jobs.

Example

No code snippet is provided as it is not explicitly supported by the issue.

Notes

The provided workaround may not fix the underlying issues but can help mitigate the symptoms. It is essential to address the root causes to prevent similar problems in the future.

Recommendation

Apply the workaround by running openclaw sessions cleanup --enforce --fix-missing to immediately address the issue, and then investigate and fix the underlying bugs to prevent future occurrences.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway update.run can leave half-installed package, killing live session transcripts [2 pull requests, 2 comments, 3 participants]