openclaw - ✅(Solved) Fix Auth profile failover blocked by file lock contention on rate_limit errors [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#57281Fetched 2026-04-08 01:51:37
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Error Message

Since cooldown checks are per-profile (isProfileInCooldown only reads the candidate profile's stats), marking the failed profile's cooldown does not affect the next profile's eligibility. Making it fire-and-forget with error logging eliminates the lock contention without changing semantics.

  • Error: rate_limit_error ("Extra usage is required for long context requests")

Root Cause

In src/agents/pi-embedded-runner/run.ts, the shouldRotate block calls await maybeMarkAuthProfileFailure() BEFORE await advanceAuthProfile(). For rate_limit errors, markAuthProfileFailure acquires a file lock via updateAuthProfileStoreWithLock. Under contention (common with 10+ concurrent agents), this blocks for minutes — observed 17-minute gap in production logs.

By the time the lock releases and advanceAuthProfile runs, the run's abort signal or session has been cleaned up. The rotation is logged but leads nowhere.

For timeout errors, maybeMarkAuthProfileFailure returns immediately (early exit on reason === "timeout"), so no lock contention occurs.

Fix Action

Fix

Reorder operations: rotate FIRST, then mark cooldown non-blocking.

Since cooldown checks are per-profile (isProfileInCooldown only reads the candidate profile's stats), marking the failed profile's cooldown does not affect the next profile's eligibility. Making it fire-and-forget with error logging eliminates the lock contention without changing semantics.

Applied at both failover sites: assistant-side (~line 1185) and prompt-side (~line 1049).

PR fix notes

PR #57283: fix(failover): defer profile cooldown marking to unblock rate-limit rotation

Description (problem / solution / changelog)

Summary

  • markAuthProfileFailure acquires a file lock (updateAuthProfileStoreWithLock) before advanceAuthProfile can rotate to the next profile. Under contention (10+ concurrent agents), this blocks for minutes — observed 17-minute gap in production logs
  • Rate-limit failover logs decision=rotate_profile but never executes it. Timeout failover works because maybeMarkAuthProfileFailure returns immediately (early exit on reason === "timeout")
  • Fix: rotate FIRST, then mark cooldown non-blocking (fire-and-forget with error logging). Cooldown checks are per-profile, so marking the failed profile doesn't affect the next profile's eligibility

Test plan

  • Rate-limit on profile A triggers immediate rotation to profile B (no multi-minute stall)
  • Timeout failover path unchanged (still skips marking)
  • Cooldown is still eventually marked on the failed profile
  • Multi-profile concurrent agents don't deadlock on auth store file lock
  • Prompt-side failover also rotates before marking

Fixes #57281

🤖 Generated with Claude Code

Changed files

  • src/agents/pi-embedded-runner/run.ts (modified, +46/-24)

Code Example

13:27:55 embedded run agent end: isError=true (retry 4, rate_limit)
13:44:17 auth profile failure state updated: window=cooldown  ← 17 min gap (lock wait)
13:44:17 failover decision: rotate_profile                    ← logged but too late
         (no follow-up — run already dead)
RAW_BUFFERClick to expand / collapse

Bug

When an auth profile hits a 429 rate_limit_error, the gateway logs decision=rotate_profile but never executes the rotation to the next profile. The run terminates as failed. Timeout errors rotate correctly.

Root Cause

In src/agents/pi-embedded-runner/run.ts, the shouldRotate block calls await maybeMarkAuthProfileFailure() BEFORE await advanceAuthProfile(). For rate_limit errors, markAuthProfileFailure acquires a file lock via updateAuthProfileStoreWithLock. Under contention (common with 10+ concurrent agents), this blocks for minutes — observed 17-minute gap in production logs.

By the time the lock releases and advanceAuthProfile runs, the run's abort signal or session has been cleaned up. The rotation is logged but leads nowhere.

For timeout errors, maybeMarkAuthProfileFailure returns immediately (early exit on reason === "timeout"), so no lock contention occurs.

Evidence

Full bug report with 6 affected runIds (3 broken rate_limit, 3 working timeout) and log analysis: see linked report.

Key log pattern:

13:27:55 embedded run agent end: isError=true (retry 4, rate_limit)
13:44:17 auth profile failure state updated: window=cooldown  ← 17 min gap (lock wait)
13:44:17 failover decision: rotate_profile                    ← logged but too late
         (no follow-up — run already dead)

Fix

Reorder operations: rotate FIRST, then mark cooldown non-blocking.

Since cooldown checks are per-profile (isProfileInCooldown only reads the candidate profile's stats), marking the failed profile's cooldown does not affect the next profile's eligibility. Making it fire-and-forget with error logging eliminates the lock contention without changing semantics.

Applied at both failover sites: assistant-side (~line 1185) and prompt-side (~line 1049).

Environment

  • OpenClaw 2026.3.28
  • macOS, 10+ concurrent agents
  • Two Anthropic auth profiles in auth.order.anthropic
  • Error: rate_limit_error ("Extra usage is required for long context requests")

extent analysis

Fix Plan

To fix the issue, we need to reorder the operations in src/agents/pi-embedded-runner/run.ts to rotate the auth profile first and then mark it as failed in a non-blocking manner.

Here are the concrete steps:

  • Modify the shouldRotate block to call await advanceAuthProfile() before maybeMarkAuthProfileFailure().
  • Make maybeMarkAuthProfileFailure non-blocking by using a fire-and-forget approach with error logging.

Example code changes:

// Before
if (shouldRotate) {
  await maybeMarkAuthProfileFailure();
  await advanceAuthProfile();
}

// After
if (shouldRotate) {
  await advanceAuthProfile();
  maybeMarkAuthProfileFailure().catch((error) => {
    // Log the error
    console.error('Error marking auth profile as failed:', error);
  });
}

Verification

To verify that the fix worked, check the production logs for the presence of successful auth profile rotations after encountering a rate_limit_error. The log pattern should change to:

13:27:55 embedded run agent end: isError=true (retry 4, rate_limit)
13:27:56 failover decision: rotate_profile
13:27:56 auth profile failure state updated: window=cooldown

The gap between the error and the rotation decision should be significantly reduced.

Extra Tips

  • Make sure to test the changes with multiple concurrent agents to ensure that the fix works under contention.
  • Consider adding additional logging to monitor the performance of the auth profile rotation mechanism.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Auth profile failover blocked by file lock contention on rate_limit errors [1 pull requests, 1 participants]