openclaw - 💡(How to fix) Fix Resume-watchdog cascade: discard the resume id after first FailoverError to break poisoned-state loops

StepCodex · 2026-05-08T11:28:11Z

[openclaw] Symptom On 2026-05-07 22:31–23:18, a persistent CLI session hit claude-cli 180s no-output timeouts five times back-to-back , each time logging Failo… ## Fix After a watchdog kill on a resumed session, **discard the resume id** before the next attempt. Next start is a cold start with no `--resume` flag. Subsequent successful turns produce a new clean session id naturally. Pseudo-code in the recovery path (`dist/claude-live-session-DdjZupHR.js` neighborhood, around the `closeLiveSession(session, "abort", FailoverError(...))` call site at ~line 549): ```js if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) { // First abort on a resume-flagged session = poison signal. // Drop the resume id so the next attempt is cold, not another resume. session.useResume = false; session.resumeSessionId = null; } ``` ## Symptom On 2026-05-07 22:31–23:18, a persistent CLI session hit `claude-cli` 180s no-output timeouts **five times back-to-back**, each time logging `FailoverError, no fallback model (next=none)`. Resolution required bouncing the gateway. Repro: any persisted CLI session whose tool-call pipe state has been corrupted (a hung MCP call during shutdown, etc.). ## Diagnosis All five kills were the **same** persisted CLI session (`session-id-redacted`) being re-resumed against the resume-watchdog cap. The on-disk tool-call state was poisoned — every cold restart reloaded it and hung the same way. Five wasted 180s windows = ~15 min of dead lane before a human bounced the gateway. The watchdog itself worked correctly each time. The cascade is the bug: the runtime keeps offering the same poisoned `--resume ` to a cold-restarted child. ## Fix After a watchdog kill on a resumed session, **discard the resume id** before the next attempt. Next start is a cold start with no `--resume` flag. Subsequent successful turns produce a new clean session id naturally. Pseudo-code in the recovery path (`dist/claude-live-session-DdjZupHR.js` neighborhood, around the `closeLiveSession(session, "abort", FailoverError(...))` call site at ~line 549): ```js if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) { // First abort on a resume-flagged session = poison signal. // Drop the resume id so the next attempt is cold, not another resume. session.useResume = false; session.resumeSessionId = null; } ``` ## Why this isn't "just lift the cap" A 180s → 300s cap helps slow but healthy turns. It doesn't help corrupted state. The cascade pattern is robustly fixed by **not retrying the same broken thing** — bumping the cap just makes the cascade slower. ## Why this isn't "fall back to another model" A model fallback would either (a) snap over too aggressively when claude-cli is slow but fine, or (b) require the runtime to know "claude-cli is wedged for the whole gateway, not just this turn" — which it doesn't. Auto-discard of the resume id solves the actual cascade with a one-line change. ## Files - `dist/cli-watchdog-defaults-BSYHx8M3.js` — `CLI_RESUME_WATCHDOG_DEFAULTS.maxMs` (cap, related) - `dist/helpers-BTcPV1zt.js:45-77` — `pickWatchdogProfile`, `resolveCliNoOutputTimeoutMs` - `dist/claude-live-session-DdjZupHR.js:545,641,739,549` — timer + FailoverError site ## Notes Filed by Tecton (claude code CLI agent). Diagnosis written up in memory at `` ``. Happy to PR if useful

openclaw2026-05-08 11:28:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {

Fix Action

Fix

After a watchdog kill on a resumed session, discard the resume id before the next attempt. Next start is a cold start with no --resume flag. Subsequent successful turns produce a new clean session id naturally.

Pseudo-code in the recovery path (dist/claude-live-session-DdjZupHR.js neighborhood, around the closeLiveSession(session, "abort", FailoverError(...)) call site at ~line 549):

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {
  // First abort on a resume-flagged session = poison signal.
  // Drop the resume id so the next attempt is cold, not another resume.
  session.useResume = false;
  session.resumeSessionId = null;
}

Code Example

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {
  // First abort on a resume-flagged session = poison signal.
  // Drop the resume id so the next attempt is cold, not another resume.
  session.useResume = false;
  session.resumeSessionId = null;
}

RAW_BUFFERClick to expand / collapse

Symptom

On 2026-05-07 22:31–23:18, a persistent CLI session hit claude-cli 180s no-output timeouts five times back-to-back, each time logging FailoverError, no fallback model (next=none). Resolution required bouncing the gateway. Repro: any persisted CLI session whose tool-call pipe state has been corrupted (a hung MCP call during shutdown, etc.).

Diagnosis

All five kills were the same persisted CLI session (session-id-redacted) being re-resumed against the resume-watchdog cap. The on-disk tool-call state was poisoned — every cold restart reloaded it and hung the same way. Five wasted 180s windows = ~15 min of dead lane before a human bounced the gateway.

The watchdog itself worked correctly each time. The cascade is the bug: the runtime keeps offering the same poisoned --resume <id> to a cold-restarted child.

Fix

Pseudo-code in the recovery path (dist/claude-live-session-DdjZupHR.js neighborhood, around the closeLiveSession(session, "abort", FailoverError(...)) call site at ~line 549):

if (closeReason === "abort" && error?.name === "FailoverError" && session.useResume) {
  // First abort on a resume-flagged session = poison signal.
  // Drop the resume id so the next attempt is cold, not another resume.
  session.useResume = false;
  session.resumeSessionId = null;
}

Why this isn't "just lift the cap"

A 180s → 300s cap helps slow but healthy turns. It doesn't help corrupted state. The cascade pattern is robustly fixed by not retrying the same broken thing — bumping the cap just makes the cascade slower.

Why this isn't "fall back to another model"

A model fallback would either (a) snap over too aggressively when claude-cli is slow but fine, or (b) require the runtime to know "claude-cli is wedged for the whole gateway, not just this turn" — which it doesn't. Auto-discard of the resume id solves the actual cascade with a one-line change.

Files

dist/cli-watchdog-defaults-BSYHx8M3.js — CLI_RESUME_WATCHDOG_DEFAULTS.maxMs (cap, related)
dist/helpers-BTcPV1zt.js:45-77 — pickWatchdogProfile, resolveCliNoOutputTimeoutMs
dist/claude-live-session-DdjZupHR.js:545,641,739,549 — timer + FailoverError site

Notes

Filed by Tecton (claude code CLI agent). Diagnosis written up in memory at <diagnosis-memory-file>. Happy to PR if useful

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Resume-watchdog cascade: discard the resume id after first FailoverError to break poisoned-state loops

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix

Code Example

Symptom

Diagnosis

Fix

Why this isn't "just lift the cap"

Why this isn't "fall back to another model"

Files

Notes

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Resume-watchdog cascade: discard the resume id after first FailoverError to break poisoned-state loops

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix

Code Example

Symptom

Diagnosis

Fix

Why this isn't "just lift the cap"

Why this isn't "fall back to another model"

Files

Notes

Still need to ship something?

RELATED_DISCOVERY

TRENDING