openclaw - 💡(How to fix) Fix OAuth auth-profiles.json lock contention: gateway times out when external processes hold non-sidecar locks

Error Message

When this happens, the gateways OAUTH_REFRESH_LOCK_OPTIONS (20 retries, exponential backoff to 10s max, 180s stale) eventually times out with code: "file_lock_timeout", surfacing as the "Model login failed" user-facing error from agent-runner-execution-Cgdfai7Y.js. The fallback logic in resolveOAuthAccess does try to re-read the store and adopt a fresher credential on lock failure, but only when the lock itself fails to acquire — not when a stale sidecar from a different lock library blocks it indefinitely.

Root Cause

OpenClaw uses openclaw.plugin-sdk.file-lock (acquireFileLock in dist/file-lock-CCOJxG89.js) to coordinate writes to auth-profiles.json. It creates a .lock sidecar file with a structured payload (pid, createdAt) and uses shouldRemoveDeadOwnerOrExpiredLock for staleness recovery.

External processes that use a different lock mechanism — fcntl.flock (Linux/macOS), Windows LockFileEx, or proper-lockfile with different defaults — are invisible to the OpenClaw lock library and vice versa. The two locks dont conflict at the OS level, so both processes think they hold the file. They also can both create .lock sidecars that interfere with the other sides staleness detection.

Filing this so others hit by the same wall can find context.

Symptom

Users running scripts that interact with ~/.openclaw/agents/<agent>/agent/auth-profiles.json (e.g. helpers that share an OAuth profile to make their own calls) see the gateway intermittently surface:

⚠️ Model login failed on the gateway for xai. Please try again. If this keeps happening, re-auth with openclaw models auth login --provider xai.

with logs showing:

OAuth token refresh failed for xai: file lock timeout for /Users/.../auth-profiles.json. Please try again or re-authenticate.

…even though the OAuth profile is healthy and has a valid refresh token.

Root cause

Reproducer

Hold an exclusive fcntl.flock on auth-profiles.json for >60s from a separate Python process.
Trigger an xAI request whose token is within refreshMarginMs (5 min) of expiry.
Gateway request fails with the symptom above; the .lock sidecar from the Python process can linger after the script exits.

Suggested upstream improvements

Document the lock contract. Tell skill/script authors that auth-profiles.json should be treated as gateway-owned and only read (atomic-rename makes consistent reads safe). External writes need to use the same sidecar lock format and payload schema, or stick to reading.
Make the lock failure self-healing. When acquireFileLock times out, before re-throwing, do an unlocked read of the store and adopt any credential thats fresher than the one we entered with. This is what resolveOAuthAccess already tries inside the catch — but it only runs after the full timeout (currently ~12 min worst case across both nested locks). A short bypass-read on first failure would let "someone else just refreshed" cases resolve in under a second.
Detect non-OpenClaw .lock sidecars. If acquireFileLock sees a .lock file that doesnt deserialize to the expected payload shape ({pid, createdAt}), treat it as a non-OpenClaw lock and either (a) wait a shorter timeout before forcing removal, or (b) emit a one-line diagnostic so users know who theyre contending with.
Publish a stable Python/Go helper under @openclaw/lock-helpers that wraps the sidecar protocol so external scripts can coordinate properly when they genuinely need to.

Workaround (no patch needed for users)

External scripts can simply never write to auth-profiles.json. They should:

Read the file under no lock (the gateway uses atomic rename, so reads are consistent snapshots).
If they need to refresh on their own, write to a separate cache file they control. Never touch the source-of-truth profile.

This is what I did in HangGlidersRule/coding-router 1c6b7d9 after hitting this issue. The script previously held an fcntl.flock across its OAuth refresh, which broke every gateway xAI call running concurrently. Removing that — read-only on auth-profiles.json, refresh to a script-local cache — fully resolved the gateway-side symptom.

Environment

OpenClaw 2026.5.19 (a185ca2) on macOS Darwin 24.3.0 (arm64)
Node v26.0.0
xAI Tier 4 OAuth (SuperGrok Heavy)

Happy to PR the docs change for the lock contract if thatd be useful.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix OAuth auth-profiles.json lock contention: gateway times out when external processes hold non-sidecar locks

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround (no patch needed for users)

Symptom

Root cause

Reproducer

Suggested upstream improvements

Workaround (no patch needed for users)

Environment

Still need to ship something?

TRENDING