openclaw - 💡(How to fix) Fix Resume-watchdog cascade: discard the resume id after first FailoverError to break poisoned-state loops

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {

Fix Action

Fix

After a watchdog kill on a resumed session, discard the resume id before the next attempt. Next start is a cold start with no --resume flag. Subsequent successful turns produce a new clean session id naturally.

Pseudo-code in the recovery path (dist/claude-live-session-DdjZupHR.js neighborhood, around the closeLiveSession(session, "abort", FailoverError(...)) call site at ~line 549):

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {
  // First abort on a resume-flagged session = poison signal.
  // Drop the resume id so the next attempt is cold, not another resume.
  session.useResume = false;
  session.resumeSessionId = null;
}

Code Example

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {
  // First abort on a resume-flagged session = poison signal.
  // Drop the resume id so the next attempt is cold, not another resume.
  session.useResume = false;
  session.resumeSessionId = null;
}
RAW_BUFFERClick to expand / collapse

Symptom

On 2026-05-07 22:31–23:18, a persistent CLI session hit claude-cli 180s no-output timeouts five times back-to-back, each time logging FailoverError, no fallback model (next=none). Resolution required bouncing the gateway. Repro: any persisted CLI session whose tool-call pipe state has been corrupted (a hung MCP call during shutdown, etc.).

Diagnosis

All five kills were the same persisted CLI session (session-id-redacted) being re-resumed against the resume-watchdog cap. The on-disk tool-call state was poisoned — every cold restart reloaded it and hung the same way. Five wasted 180s windows = ~15 min of dead lane before a human bounced the gateway.

The watchdog itself worked correctly each time. The cascade is the bug: the runtime keeps offering the same poisoned --resume <id> to a cold-restarted child.

Fix

After a watchdog kill on a resumed session, discard the resume id before the next attempt. Next start is a cold start with no --resume flag. Subsequent successful turns produce a new clean session id naturally.

Pseudo-code in the recovery path (dist/claude-live-session-DdjZupHR.js neighborhood, around the closeLiveSession(session, "abort", FailoverError(...)) call site at ~line 549):

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {
  // First abort on a resume-flagged session = poison signal.
  // Drop the resume id so the next attempt is cold, not another resume.
  session.useResume = false;
  session.resumeSessionId = null;
}

Why this isn't "just lift the cap"

A 180s → 300s cap helps slow but healthy turns. It doesn't help corrupted state. The cascade pattern is robustly fixed by not retrying the same broken thing — bumping the cap just makes the cascade slower.

Why this isn't "fall back to another model"

A model fallback would either (a) snap over too aggressively when claude-cli is slow but fine, or (b) require the runtime to know "claude-cli is wedged for the whole gateway, not just this turn" — which it doesn't. Auto-discard of the resume id solves the actual cascade with a one-line change.

Files

  • dist/cli-watchdog-defaults-BSYHx8M3.jsCLI_RESUME_WATCHDOG_DEFAULTS.maxMs (cap, related)
  • dist/helpers-BTcPV1zt.js:45-77pickWatchdogProfile, resolveCliNoOutputTimeoutMs
  • dist/claude-live-session-DdjZupHR.js:545,641,739,549 — timer + FailoverError site

Notes

Filed by Tecton (claude code CLI agent). Diagnosis written up in memory at <diagnosis-memory-file>. Happy to PR if useful

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Resume-watchdog cascade: discard the resume id after first FailoverError to break poisoned-state loops