openclaw - 💡(How to fix) Fix Session lock deadlock on model timeout + failover rotation [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62351Fetched 2026-04-08 03:05:35
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Timeline (top)
closed ×1commented ×1

Root Cause

// failover-policy.ts (before fix)
function shouldRotateAssistant(params) {
  return (
    (!params.aborted && (params.failoverFailure || params.failoverReason !== null)) ||
    (params.timedOut && !params.timedOutDuringCompaction)  // ← this line
  );
}

timedOut triggers rotation. Each rotation attempt holds the session lock. Lock never releases between retries.

Fix Action

Fix

Branch: fix/timeout-failover-lock-deadlock at 09bfe615fd

Treat timedOut and failoverReason === "timeout" as terminal (return false from shouldRotateAssistant). Timeout-triggered compaction is unaffected (separate code path in run.ts).

4 files changed, tests pass (30/30 failover-policy, 16/16 timeout-compaction, 11/11 overflow-compaction).

Code Example

// failover-policy.ts (before fix)
function shouldRotateAssistant(params) {
  return (
    (!params.aborted && (params.failoverFailure || params.failoverReason !== null)) ||
    (params.timedOut && !params.timedOutDuringCompaction)  // ← this line
  );
}
RAW_BUFFERClick to expand / collapse

Bug

When a model times out, shouldRotateAssistant() in failover-policy.ts treats it as retryable and rotates to the next profile/model. Each retry reacquires the same session write lock (*.jsonl.lock) without releasing between attempts. External probes (new Discord messages, CLI commands, heartbeats) block on the same lock until all fallback candidates exhaust — up to 20+ minutes of lock contention per session.

Impact

On 2026-04-06 ~23:26 PDT, a Copilot API outage caused all 4 fleet princes to time out simultaneously. The failover policy rotated each through 3 model candidates under the same session lock, producing 20+ minutes of total unresponsiveness. From the outside: "all dead."

Root Cause

// failover-policy.ts (before fix)
function shouldRotateAssistant(params) {
  return (
    (!params.aborted && (params.failoverFailure || params.failoverReason !== null)) ||
    (params.timedOut && !params.timedOutDuringCompaction)  // ← this line
  );
}

timedOut triggers rotation. Each rotation attempt holds the session lock. Lock never releases between retries.

Fix

Branch: fix/timeout-failover-lock-deadlock at 09bfe615fd

Treat timedOut and failoverReason === "timeout" as terminal (return false from shouldRotateAssistant). Timeout-triggered compaction is unaffected (separate code path in run.ts).

4 files changed, tests pass (30/30 failover-policy, 16/16 timeout-compaction, 11/11 overflow-compaction).

Evidence

  • Gateway logs: embedded run timeoutrotate_profilefallback_model → all 3 candidates failed → FailoverError
  • Session lock held by pid 396 for 20+ minutes
  • CLI probe confirmed: session file locked (timeout 10000ms) on all 3 fallback attempts
  • Found by Codex diagnostic probe on Silas (urudyne)

Upstream-worthy

This affects all OpenClaw deployments with fallback models configured. Any provider timeout triggers the same lock contention pattern.

extent analysis

TL;DR

Treat timedOut and failoverReason === "timeout" as terminal conditions in the shouldRotateAssistant function to prevent lock contention.

Guidance

  • Review the failover-policy.ts file and update the shouldRotateAssistant function to return false when timedOut or failoverReason is "timeout".
  • Verify that the session lock is released between retry attempts by checking the gateway logs for embedded run timeout and rotate_profile events.
  • Test the updated shouldRotateAssistant function with timeout scenarios to ensure it correctly handles terminal conditions.
  • Consider implementing a timeout for the session lock acquisition to prevent prolonged lock contention.

Example

// failover-policy.ts (updated)
function shouldRotateAssistant(params) {
  return (
    (!params.aborted && (params.failoverFailure || params.failoverReason !== null && params.failoverReason !== "timeout")) ||
    (params.timedOut && !params.timedOutDuringCompaction && params.failoverReason !== "timeout")
  );
}

Notes

This fix assumes that the timedOut and failoverReason conditions are correctly set in the params object. Additional logging or debugging may be necessary to ensure the correct behavior.

Recommendation

Apply the workaround by updating the shouldRotateAssistant function to treat timedOut and failoverReason === "timeout" as terminal conditions, as this will prevent lock contention and allow the system to recover from timeouts more quickly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING