openclaw - 💡(How to fix) Fix OAuth auth-profiles.json lock contention: gateway times out when external processes hold non-sidecar locks

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

When this happens, the gateways OAUTH_REFRESH_LOCK_OPTIONS (20 retries, exponential backoff to 10s max, 180s stale) eventually times out with code: "file_lock_timeout", surfacing as the "Model login failed" user-facing error from agent-runner-execution-Cgdfai7Y.js. The fallback logic in resolveOAuthAccess does try to re-read the store and adopt a fresher credential on lock failure, but only when the lock itself fails to acquire — not when a stale sidecar from a different lock library blocks it indefinitely.

Root Cause

OpenClaw uses openclaw.plugin-sdk.file-lock (acquireFileLock in dist/file-lock-CCOJxG89.js) to coordinate writes to auth-profiles.json. It creates a .lock sidecar file with a structured payload (pid, createdAt) and uses shouldRemoveDeadOwnerOrExpiredLock for staleness recovery.

External processes that use a different lock mechanism — fcntl.flock (Linux/macOS), Windows LockFileEx, or proper-lockfile with different defaults — are invisible to the OpenClaw lock library and vice versa. The two locks dont conflict at the OS level, so both processes think they hold the file. They also can both create .lock sidecars that interfere with the other sides staleness detection.

When this happens, the gateways OAUTH_REFRESH_LOCK_OPTIONS (20 retries, exponential backoff to 10s max, 180s stale) eventually times out with code: "file_lock_timeout", surfacing as the "Model login failed" user-facing error from agent-runner-execution-Cgdfai7Y.js. The fallback logic in resolveOAuthAccess does try to re-read the store and adopt a fresher credential on lock failure, but only when the lock itself fails to acquire — not when a stale sidecar from a different lock library blocks it indefinitely.

Fix Action

Fix / Workaround

Workaround (no patch needed for users)

RAW_BUFFERClick to expand / collapse

Filing this so others hit by the same wall can find context.

Symptom

Users running scripts that interact with ~/.openclaw/agents/<agent>/agent/auth-profiles.json (e.g. helpers that share an OAuth profile to make their own calls) see the gateway intermittently surface:

⚠️ Model login failed on the gateway for xai. Please try again. If this keeps happening, re-auth with openclaw models auth login --provider xai.

with logs showing:

OAuth token refresh failed for xai: file lock timeout for /Users/.../auth-profiles.json. Please try again or re-authenticate.

…even though the OAuth profile is healthy and has a valid refresh token.

Root cause

OpenClaw uses openclaw.plugin-sdk.file-lock (acquireFileLock in dist/file-lock-CCOJxG89.js) to coordinate writes to auth-profiles.json. It creates a .lock sidecar file with a structured payload (pid, createdAt) and uses shouldRemoveDeadOwnerOrExpiredLock for staleness recovery.

External processes that use a different lock mechanism — fcntl.flock (Linux/macOS), Windows LockFileEx, or proper-lockfile with different defaults — are invisible to the OpenClaw lock library and vice versa. The two locks dont conflict at the OS level, so both processes think they hold the file. They also can both create .lock sidecars that interfere with the other sides staleness detection.

When this happens, the gateways OAUTH_REFRESH_LOCK_OPTIONS (20 retries, exponential backoff to 10s max, 180s stale) eventually times out with code: "file_lock_timeout", surfacing as the "Model login failed" user-facing error from agent-runner-execution-Cgdfai7Y.js. The fallback logic in resolveOAuthAccess does try to re-read the store and adopt a fresher credential on lock failure, but only when the lock itself fails to acquire — not when a stale sidecar from a different lock library blocks it indefinitely.

Reproducer

  1. Hold an exclusive fcntl.flock on auth-profiles.json for >60s from a separate Python process.
  2. Trigger an xAI request whose token is within refreshMarginMs (5 min) of expiry.
  3. Gateway request fails with the symptom above; the .lock sidecar from the Python process can linger after the script exits.

Suggested upstream improvements

  1. Document the lock contract. Tell skill/script authors that auth-profiles.json should be treated as gateway-owned and only read (atomic-rename makes consistent reads safe). External writes need to use the same sidecar lock format and payload schema, or stick to reading.
  2. Make the lock failure self-healing. When acquireFileLock times out, before re-throwing, do an unlocked read of the store and adopt any credential thats fresher than the one we entered with. This is what resolveOAuthAccess already tries inside the catch — but it only runs after the full timeout (currently ~12 min worst case across both nested locks). A short bypass-read on first failure would let "someone else just refreshed" cases resolve in under a second.
  3. Detect non-OpenClaw .lock sidecars. If acquireFileLock sees a .lock file that doesnt deserialize to the expected payload shape ({pid, createdAt}), treat it as a non-OpenClaw lock and either (a) wait a shorter timeout before forcing removal, or (b) emit a one-line diagnostic so users know who theyre contending with.
  4. Publish a stable Python/Go helper under @openclaw/lock-helpers that wraps the sidecar protocol so external scripts can coordinate properly when they genuinely need to.

Workaround (no patch needed for users)

External scripts can simply never write to auth-profiles.json. They should:

  • Read the file under no lock (the gateway uses atomic rename, so reads are consistent snapshots).
  • If they need to refresh on their own, write to a separate cache file they control. Never touch the source-of-truth profile.

This is what I did in HangGlidersRule/coding-router 1c6b7d9 after hitting this issue. The script previously held an fcntl.flock across its OAuth refresh, which broke every gateway xAI call running concurrently. Removing that — read-only on auth-profiles.json, refresh to a script-local cache — fully resolved the gateway-side symptom.

Environment

  • OpenClaw 2026.5.19 (a185ca2) on macOS Darwin 24.3.0 (arm64)
  • Node v26.0.0
  • xAI Tier 4 OAuth (SuperGrok Heavy)

Happy to PR the docs change for the lock contract if thatd be useful.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING