openclaw - ✅(Solved) Fix CLI backend: retry fresh session on any FailoverError, not just session_expired [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77089Fetched 2026-05-05 05:52:24
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
2
Timeline (top)
cross-referenced ×2referenced ×2commented ×1

Error Message

log.warn(CLI session failed (reason=${err.reason ?? "unknown"}), clearing stale binding and retrying fresh: ...);

Root Cause

In attempt-execution (dist: attempt-execution-BGTy1BCv.js:546), the catch block only handles session_expired:

if (err instanceof FailoverError && err.reason === "session_expired" && activeCliSessionBinding?.sessionId && ...) {

The retry-fresh logic (clear binding → runCliWithSession(void 0) → re-store new binding) is already implemented but gated behind this narrow condition.

Fix Action

Fixed

PR fix notes

PR #77141: fix(agents): clear stale CLI session on any FailoverError, not only session_expired

Description (problem / solution / changelog)

Problem

The stale-session recovery path in runPreparedCliAgent only cleared the binding and retried fresh when err.reason === "session_expired". Failures with reason "timeout" (watchdog kill) or "unknown" (generic CLI crash) would rethrow immediately, leaving the dead session binding in place. Every subsequent turn then tried to resume the same dead session, failed in milliseconds, and cascaded to API fallback (Opus → Sonnet).

The incident in #77089 shows a gateway that degraded for 4+ hours because a single timeout failure left a stale binding that nothing cleared until a gateway restart.

Fix

Remove the err.reason === "session_expired" narrowing. Any FailoverError from an active reused CLI session now clears the binding and retries once fresh. If the fresh retry also fails, it falls through to the API fallback chain as before. Add a cliBackendLog.warn so operators can see when a non-session_expired reason triggered the recovery.

Changes

  • src/agents/cli-runner.ts — remove reason === "session_expired" guard, add warn log, import cliBackendLog
  • src/agents/cli-runner.reliability.test.ts — two new regression tests:
    • FailoverError reason timeout → retry succeeds with fresh session
    • FailoverError reason unknown → retry succeeds with fresh session

Tests

38/38 reliability tests pass. pnpm oxlint clean. pnpm tsc --noEmit clean.

Fixes #77089.

Co-Authored-By: Claude Sonnet 4.6 [email protected]

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/cli-runner.reliability.test.ts (modified, +70/-0)
  • src/agents/cli-runner.ts (modified, +5/-7)
  • src/agents/command/attempt-execution.cli.test.ts (modified, +130/-0)
  • src/agents/command/attempt-execution.ts (modified, +4/-2)

PR #77385: fix(agents): clear stale CLI session binding for all FailoverError reasons (#77089)

Description (problem / solution / changelog)

Problem

attempt-execution.ts only cleared the persisted CLI session binding from the store when err.reason === "session_expired". When a FailoverError with reason auth, billing, or rate_limit was thrown (either as the primary error or as a secondary error after cli-runner.ts's internal session_expired retry), the stale binding was left in the store. The next turn would try to resume the same dead CLI session and fail again.

Fix

Clear the stale binding unconditionally for any FailoverError that has an activeCliSessionBinding. Then:

  • Recoverable (session_expired, timeout, unknown): clear binding + retry with a fresh session
  • Non-recoverable (auth, billing, rate_limit): clear binding + rethrow immediately so the caller surfaces the real error

This also covers the case where cli-runner.ts's internal session_expired retry succeeds with a fresh session but fails on a subsequent FailoverError (e.g. auth): attempt-execution.ts receives the secondary error, correctly clears the original stale binding, and rethrows.

Changes

  • src/agents/command/attempt-execution.ts: widen the FailoverError gate from session_expired-only to any FailoverError; add rethrow-without-retry path for non-recoverable reasons
  • src/agents/command/attempt-execution.cli.test.ts: add test for non-recoverable path (auth reason) — verifies binding is cleared (prevents dead-session resume) but no retry is attempted
  • CHANGELOG.md: entry for #77141

Test

pnpm test src/agents/command/attempt-execution.cli.test.ts

17/17 pass.

pnpm exec oxfmt --check --threads=1 src/agents/command/attempt-execution.ts src/agents/command/attempt-execution.cli.test.ts CHANGELOG.md

0 errors.

Fixes #77089.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/command/attempt-execution.cli.test.ts (modified, +145/-0)
  • src/agents/command/attempt-execution.ts (modified, +14/-2)

Code Example

if (err instanceof FailoverError && err.reason === "session_expired" && activeCliSessionBinding?.sessionId && ...) {

---

if (err instanceof FailoverError && activeCliSessionBinding?.sessionId && ...) {
    log.warn(`CLI session failed (reason=${err.reason ?? "unknown"}), clearing stale binding and retrying fresh: ...`);
    // existing clear + retry logic
}
RAW_BUFFERClick to expand / collapse

Problem

When a CLI backend session hangs or crashes, OpenClaw only clears the stale session binding and retries fresh for FailoverError.reason === "session_expired". All other failure reasons (timeout, generic crash) keep the stale binding and immediately cascade to the model fallback chain (API Opus → API Sonnet).

This means a single stuck CLI session can take the entire CLI backend offline for hours — every subsequent turn tries to resume the dead session, fails instantly, and falls back to paid API models.

Observed incident (2026-05-03)

  1. 17:27 — CLI session 666dfaad8096 hung, produced no output for 180s, watchdog killed it. Reason: timeout. Fell to API Opus.
  2. 21:44 — Same session binding still stored. OpenClaw tried useResume=true, resumeSession=666dfaad8096. CLI crashed in 476ms with FailoverError. Fell to API Opus → timed out → fell to API Sonnet.
  3. 22:11 — A heartbeat came in with useResume=false, session=none (binding was cleared by a gateway restart). Fresh CLI session succeeded in 13s.

The binding was never cleared between steps 1 and 3 because the FailoverError reasons were timeout and unknown, not session_expired.

Root cause

In attempt-execution (dist: attempt-execution-BGTy1BCv.js:546), the catch block only handles session_expired:

if (err instanceof FailoverError && err.reason === "session_expired" && activeCliSessionBinding?.sessionId && ...) {

The retry-fresh logic (clear binding → runCliWithSession(void 0) → re-store new binding) is already implemented but gated behind this narrow condition.

Proposed fix

Widen the condition to catch any FailoverError when a session binding exists:

if (err instanceof FailoverError && activeCliSessionBinding?.sessionId && ...) {
    log.warn(`CLI session failed (reason=${err.reason ?? "unknown"}), clearing stale binding and retrying fresh: ...`);
    // existing clear + retry logic
}

This way:

  • Any CLI failure on a resumed session clears the stale binding
  • Retries once with a fresh session before falling back to API
  • If the fresh retry also fails, cascades to fallbacks as before
  • No behavior change for non-resume failures (no activeCliSessionBinding → falls through to throw err)

Impact

  • Prevents hours-long silent degradation from CLI → API Sonnet
  • Self-healing: one bad session gets cleared automatically instead of requiring a gateway restart
  • No change to fresh session failure behavior

Additional suggestions

  • Consider adding "timeout" and "unknown" to the explicit reason list if a broad catch feels too aggressive
  • The reliability.watchdog default of 180s may be too aggressive for Opus 4.7 — consider bumping the default or documenting tuning guidance

Environment

  • OpenClaw 2026.5.2 (8b2a6e5)
  • CLI backend: claude-cli → Claude CLI (Max subscription)
  • Primary model: claude-cli/claude-opus-4-7

extent analysis

TL;DR

The proposed fix involves widening the condition in the attempt-execution catch block to clear the stale session binding and retry fresh for any FailoverError when a session binding exists.

Guidance

  • Implement the proposed fix by modifying the condition in attempt-execution to catch any FailoverError with a session binding, as shown in the provided code snippet.
  • Consider adding specific reason checks for "timeout" and "unknown" if a broad catch is deemed too aggressive.
  • Review the reliability.watchdog default of 180s and consider adjusting it or providing tuning guidance for Opus 4.7.

Example

if (err instanceof FailoverError && activeCliSessionBinding?.sessionId && ...) {
    log.warn(`CLI session failed (reason=${err.reason ?? "unknown"}), clearing stale binding and retrying fresh: ...`);
    // existing clear + retry logic
}

Notes

The proposed fix aims to prevent hours-long silent degradation by clearing the stale session binding and retrying fresh for any FailoverError. However, the impact of this change on the overall system behavior should be monitored, especially regarding the fresh session failure behavior.

Recommendation

Apply the proposed workaround by implementing the modified condition in attempt-execution to catch any FailoverError with a session binding, as this should help prevent silent degradation and provide self-healing capabilities.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix CLI backend: retry fresh session on any FailoverError, not just session_expired [2 pull requests, 1 comments, 2 participants]