openclaw - ✅(Solved) Fix EPERM on auth-profiles.json causes full gateway failure cascade (Windows) [2 pull requests, 1 participants]

openclaw2026-04-06 19:46:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62099•Fetched 2026-04-08 03:09:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Hag-Fish

Participants

Hag-Fish

auth-profiles.json can acquire a Windows ReadOnly attribute during concurrent config writes, causing every LLM request to fail with EPERM: operation not permitted. The error is treated as fatal rather than non-fatal, which cascades through the fallback chain and makes the gateway completely unresponsive.

Error Message

Error: EPERM: operation not permitted, copyfile 
  'C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json.<uuid>.tmp' 
  -> 'C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json'
    at Object.copyFileSync (node:fs:3104:11)
    at renameJsonFileWithFallback (json-file-1PGlTqjr.js:63:7)
    at saveJsonFile (json-file-1PGlTqjr.js:98:3)
    at saveAuthProfileStore (store-HF_Z-jKz.js:427:2)
    at markAuthProfileGood (profiles-DKQdaSwr.js:76:2)
    at pi-embedded-DWASRjxE.js:36473:7

Root Cause

Probable Root Cause

Fix Action

Workaround

attrib -R "C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json"

Then restart the gateway.

PR fix notes

PR #67064: fix(auth-profiles): make post-success bookkeeping saves non-fatal

Repository: openclaw/openclaw
Author: ademczuk
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/67064

Description (problem / solution / changelog)

Summary

Fixes #62099. On Windows, concurrent config hot-reload can leave auth-profiles.json with a ReadOnly attribute. The atomic write in saveAuthProfileStore then throws EPERM, and because markAuthProfileGood / markAuthProfileUsed / markAuthProfileFailure run as post-completion bookkeeping, that throw used to cascade into the LLM request that had already succeeded. Fallback triggers, hits the same read-only file, fails the same way. The gateway becomes unresponsive; restarts don't help because the file attribute persists.

The fix wraps the body of each mark* function in try/catch, logging the persistence failure and continuing. Caller-visible behavior is unchanged on the happy path.

What this does NOT change

saveAuthProfileStore itself still throws on failure. OAuth token refresh (in oauth.ts) depends on that behavior, since a silent token-save failure would be a security concern. Only the three mark* functions that run after a successful provider call now tolerate save errors.

Scope

src/agents/auth-profiles/profiles.ts - wrap markAuthProfileGood body in try/catch, log warn
src/agents/auth-profiles/usage.ts - wrap markAuthProfileUsed and markAuthProfileFailure bodies in try/catch, log warn
src/agents/auth-profiles/usage.persist-nonfatal.test.ts - new regression tests, one per mark function, simulating EPERM from both the lock-guarded and direct save paths

markAuthProfileCooldown delegates to markAuthProfileFailure so it's covered transitively.

Before

Error: EPERM: operation not permitted, copyfile
  'auth-profiles.json.<uuid>.tmp' -> 'auth-profiles.json'
    at Object.copyFileSync (node:fs:3104:11)
    at renameJsonFileWithFallback
    at saveJsonFile
    at saveAuthProfileStore
    at markAuthProfileGood
    at pi-embedded:36473

The LLM response arrived, then the save threw, then every fallback hit the same file, then the gateway ran out of models. User hits attrib -R on the file and restarts to recover.

After

The save throws, the catch runs, the warning lands in the subsystem log, the mark function returns void, the LLM request completes normally. lastGood / usage stats stay slightly stale until the next successful save, which is the right tradeoff for a bookkeeping write.

Testing

New tests in usage.persist-nonfatal.test.ts pass (3/3). Mocks both updateAuthProfileStoreWithLock and saveAuthProfileStore to throw EPERM, asserts the mark* functions resolve without throwing.
Existing usage.test.ts (39 tests) and auth-profiles.markauthprofilefailure.test.ts (9 tests) still pass. Also verified auth-profiles.runtime-snapshot-save.test.ts (1 test).
Two pre-existing test failures in state-observation.test.ts and oauth.fallback-to-main-agent.test.ts already fail on main, unrelated to this change.
oxlint --type-aware clean on modified files.
tsgo --noEmit clean (exit 0).
oxfmt --check clean.

Risk

Low. Behavioral change is scoped to the error path of a bookkeeping function. Callers that previously got an unhandled rejection on EPERM now get a resolved promise, which is the intended outcome. Profile state in memory stays authoritative; disk just gets slightly stale until the next successful write.

AI disclosure

This change was drafted with Claude Code acting as coding assistant. The issue was picked from the triaged backlog, the root-cause analysis and implementation plan were produced interactively, and tests were written against the exact stack trace in the issue report. Human review of the patch and regression tests before submission.

Understanding confirmation

Yes, I've read CONTRIBUTING.md and VISION.md. This is a single-concern bug fix with no dependency updates, no schema changes, and no new public APIs. The affected module is src/agents/auth-profiles/. No Carbon changes, no @ts-nocheck, no lint suppression.

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/auth-profiles/profiles.ts (modified, +32/-20)
src/agents/auth-profiles/usage.persist-nonfatal.test.ts (added, +87/-0)
src/agents/auth-profiles/usage.ts (modified, +113/-91)

PR #67077: fix(auth-profiles): make post-success bookkeeping saves non-fatal

Repository: openclaw/openclaw
Author: ademczuk
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/67077

Description (problem / solution / changelog)

Summary

The fix wraps the body of each mark* function in try/catch, logging the persistence failure and continuing. Caller-visible behavior is unchanged on the happy path.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Auth / tokens

Linked Issue/PR

Closes #62099

User-visible / Behavior Changes

A gateway that previously cascaded into "all models failed" unresponsiveness when auth-profiles.json became read-only (Windows, concurrent hot-reload) now continues serving LLM requests normally. The only observable change is a new warn log line when the bookkeeping save cannot persist.

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: Windows 11 (reproduced from user report); Linux in CI
Runtime/container: Node 22
Model/provider: Anthropic primary, Ollama fallback (per user report)
Integration/channel (if any): None
Relevant config (redacted): auth-profiles.json with Windows ReadOnly attribute set

Steps

Gateway runs on Windows with primary + fallback providers configured
User adds a new model to openclaw.json while the gateway is hot-reloading
Windows sets ReadOnly on auth-profiles.json during the concurrent rename
Every subsequent LLM request fails with EPERM: operation not permitted, copyfile

Expected

LLM requests complete. Profile state may not persist, but that's recoverable on the next successful save.

Actual

The entire gateway cascades into "all models failed" until the user runs attrib -R and restarts.

Evidence

Failing test/log before + passing after - see new usage.persist-nonfatal.test.ts
Stack trace from the issue:

Error: EPERM: operation not permitted, copyfile
  'auth-profiles.json.<uuid>.tmp' -> 'auth-profiles.json'
    at Object.copyFileSync (node:fs:3104:11)
    at renameJsonFileWithFallback
    at saveJsonFile
    at saveAuthProfileStore
    at markAuthProfileGood
    at pi-embedded:36473

Human Verification (required)

Verified scenarios:

Ran new regression tests (usage.persist-nonfatal.test.ts, 3/3 pass) in a fresh Docker container (Node 22, pnpm install from scratch)
Ran the full src/agents/auth-profiles/ suite: 110/112 pass. The 2 failures (session-override.test.ts and oauth.openai-codex-refresh-fallback.test.ts variously) also fail on clean main with identical counts, so they're pre-existing flakiness unrelated to this change
pnpm tsgo --noEmit clean (exit 0)
oxlint --type-aware on modified files: 0 warnings, 0 errors
oxfmt --check on modified files: clean

Edge cases checked:

Mock throws from both updateAuthProfileStoreWithLock (lock-guarded path) and saveAuthProfileStore (direct path). Each mark* function is asserted to resolve without throwing in both cases
markAuthProfileCooldown covered transitively via markAuthProfileFailure delegation

What I did not verify:

Live Windows reproduction of the ReadOnly race. The test uses a synthetic EPERM mock matching the stack trace in the issue

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Failure Recovery (if this breaks)

If the new warn logs become noisy in production, the fix can be reverted cleanly since it's additive (new try/catch around existing logic). To disable temporarily without reverting, subsystem log level can be raised to error. Files to restore: src/agents/auth-profiles/profiles.ts, src/agents/auth-profiles/usage.ts. Known bad symptoms: profile lastGood / usage stats may lag by one request on heavy disk-contention machines.

Risks and Mitigations

Risk: Silencing all throws in mark* could mask a genuine bug in the usage-stats computation
- Mitigation: The try/catch is narrow (function body only), computation is pure (no IO), and failures land in log.warn with the full error message so ops can still see them
Risk: saveAuthProfileStore being called from other paths (OAuth token refresh, store inheritance) might also hit EPERM and still throw
- Mitigation: Intentional. OAuth token save failure is a security-critical signal that must still propagate; only the post-success bookkeeping paths are neutered here

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/auth-profiles/profiles.ts (modified, +38/-20)
src/agents/auth-profiles/usage.persist-nonfatal.test.ts (added, +89/-0)
src/agents/auth-profiles/usage.ts (modified, +142/-109)

Code Example

Error: EPERM: operation not permitted, copyfile 
  'C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json.<uuid>.tmp' 
  -> 'C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json'
    at Object.copyFileSync (node:fs:3104:11)
    at renameJsonFileWithFallback (json-file-1PGlTqjr.js:63:7)
    at saveJsonFile (json-file-1PGlTqjr.js:98:3)
    at saveAuthProfileStore (store-HF_Z-jKz.js:427:2)
    at markAuthProfileGood (profiles-DKQdaSwr.js:76:2)
    at pi-embedded-DWASRjxE.js:36473:7

---

attrib -R "C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json"

RAW_BUFFERClick to expand / collapse

Bug Report: EPERM on auth-profiles.json causes full gateway failure cascade

Summary

Environment

OpenClaw: 2026.4.5 (3e72c03)
OS: Windows 11 (10.0.26200, x64)
Node: v24.14.1
Providers: Anthropic (claude-opus-4-6), Ollama (glm-4.7-flash, gemma4:26b)

Steps to Reproduce

Have a running gateway with Anthropic as primary model and Ollama as fallback (or vice versa)
Add a new Ollama model to the config while the gateway is running (e.g., adding gemma4:26b to openclaw.json models list)
The gateway hot-reloads the config and updates models.json and auth-profiles.json
Under certain timing conditions, auth-profiles.json acquires the Windows ReadOnly file attribute
All subsequent LLM requests fail

Observed Behavior

Once ReadOnly is set on auth-profiles.json:

Every LLM request attempts to write to auth-profiles.json (via markAuthProfileGood)
The atomic write (copyFileSync from .tmp to target) fails with EPERM
This error is treated as a request-level failure, not just a profile-save failure
The fallback system activates: primary model (e.g., ollama/glm-4.7-flash) → fallback model (e.g., anthropic/claude-opus-4-6)
The fallback model hits the same EPERM on the same file → also fails
Result: "All models failed" — complete gateway unresponsiveness
Gateway restarts do NOT fix it (the ReadOnly attribute persists on disk)
Each retry cycle inflates the session context with error metadata, rapidly consuming the context window

Stack Trace

Error: EPERM: operation not permitted, copyfile 
  'C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json.<uuid>.tmp' 
  -> 'C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json'
    at Object.copyFileSync (node:fs:3104:11)
    at renameJsonFileWithFallback (json-file-1PGlTqjr.js:63:7)
    at saveJsonFile (json-file-1PGlTqjr.js:98:3)
    at saveAuthProfileStore (store-HF_Z-jKz.js:427:2)
    at markAuthProfileGood (profiles-DKQdaSwr.js:76:2)
    at pi-embedded-DWASRjxE.js:36473:7

Expected Behavior

Profile write failure should be non-fatal. Failing to save "this API key worked" should not abort the entire LLM request. The response was already received — the profile write is bookkeeping.
Atomic file writes should handle ReadOnly gracefully. renameJsonFileWithFallback should detect the ReadOnly attribute and either clear it or log a warning rather than throwing a fatal error.
Error-loop inflation should be bounded. Failed retries should not dump error metadata into the session context, as this accelerates context exhaustion.

Workaround

attrib -R "C:\Users\OpenClaw\.openclaw\agents\main\agent\auth-profiles.json"

Then restart the gateway.

Impact

Gateway becomes completely unresponsive (no LLM requests succeed)
Gateway restarts do not fix it (file attribute persists)
Fallback chain burns paid API tokens on requests that will fail anyway
Session context inflates rapidly from error metadata (~84% of 200k context window in minutes)
User must manually identify and fix the file attribute — no error message points to the actual cause

Probable Root Cause

Race condition in the atomic JSON file write logic (renameJsonFileWithFallback) when multiple config files are being updated concurrently during hot-reload. On Windows, a failed rename falling back to copyFileSync may leave the target file with a ReadOnly attribute under certain timing conditions, or Windows itself may set ReadOnly as a protective measure during concurrent file access.

extent analysis

TL;DR

The most likely fix is to modify the renameJsonFileWithFallback function to handle the ReadOnly attribute on Windows by either clearing it or logging a warning instead of throwing a fatal error.

Guidance

Identify and modify the renameJsonFileWithFallback function in json-file-1PGlTqjr.js to check for and handle the ReadOnly attribute before attempting to write to auth-profiles.json.
Consider implementing a retry mechanism with a bounded number of attempts to prevent error-loop inflation and session context exhaustion.
Review the atomic file write logic to ensure it can handle concurrent updates and avoid leaving files in a ReadOnly state.
Test the changes on Windows to ensure the fix works as expected and does not introduce new issues.

Example

// Example of how to check and clear the ReadOnly attribute in Node.js on Windows
const fs = require('fs');
const path = require('path');

function clearReadOnlyAttribute(filePath) {
  try {
    // Check if the file has the ReadOnly attribute
    const stats = fs.statSync(filePath);
    if (stats.mode & 0o444) {
      // Clear the ReadOnly attribute
      fs.chmodSync(filePath, stats.mode & ~0o444);
    }
  } catch (error) {
    console.error(`Error clearing ReadOnly attribute: ${error}`);
  }
}

// Call clearReadOnlyAttribute before attempting to write to auth-profiles.json
clearReadOnlyAttribute('C:\\Users\\OpenClaw\\.openclaw\\agents\\main\\agent\\auth-profiles.json');

Notes

The provided workaround using attrib -R can be used as a temporary fix, but it does not address the underlying issue and may need to be reapplied after each gateway restart.
The root cause of the issue is likely related to the atomic file write logic and the handling of concurrent updates on Windows.

Recommendation

Apply the workaround using attrib -R as a temporary fix, and then modify the renameJsonFileWithFallback function to handle the ReadOnly attribute as described in the guidance section. This will provide a more permanent solution to the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix EPERM on auth-profiles.json causes full gateway failure cascade (Windows) [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Probable Root Cause

Fix Action

Workaround

PR fix notes

PR #67064: fix(auth-profiles): make post-success bookkeeping saves non-fatal

Description (problem / solution / changelog)

Summary

What this does NOT change

Scope

Before

After

Testing

Risk

AI disclosure

Understanding confirmation

Changed files

PR #67077: fix(auth-profiles): make post-success bookkeeping saves non-fatal

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

Code Example

Bug Report: EPERM on auth-profiles.json causes full gateway failure cascade

Summary

Environment

Steps to Reproduce

Observed Behavior

Stack Trace

Expected Behavior

Workaround

Impact

Probable Root Cause

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING