openclaw - ✅(Solved) Fix [Bug]: Secret provider crash-loop exhausts 1Password service account rate limits [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56217Fetched 2026-04-08 01:43:28
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
cross-referenced ×2labeled ×2referenced ×1

When 1Password service account credentials fail to resolve (rate limit, network issue, token expiry), the gateway enters a crash-loop via launchd KeepAlive, repeatedly invoking op read for every configured secret provider on each restart. This quickly exhausts 1Password's account-wide daily rate limit, making recovery impossible for up to 24 hours.

Error Message

  • Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
  1. One or more op read calls fail (timeout or error)
  2. One or more op read calls fail (timeout or error)
  • Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)

Root Cause

When 1Password service account credentials fail to resolve (rate limit, network issue, token expiry), the gateway enters a crash-loop via launchd KeepAlive, repeatedly invoking op read for every configured secret provider on each restart. This quickly exhausts 1Password's account-wide daily rate limit, making recovery impossible for up to 24 hours.

Fix Action

Workaround

We've implemented these protections in our own code that calls op (Mission Control), but the gateway's internal secret provider resolution doesn't have them. Current workaround is to manually unload the LaunchAgent when failures occur:

launchctl bootout gui/$(id -u)/ai.openclaw.gateway

PR fix notes

PR #56499: fix: add exponential backoff for secret provider resolution at startup

Description (problem / solution / changelog)

Problem

Fixes #56217

When a secret provider (e.g., 1Password op read) fails transiently at startup, the gateway crashes immediately. If the process manager (launchd/systemd) restarts it, each restart fires N op read calls that fail again:

Gateway starts → op read × 6 providers → 1 fails → CRASH
  ↓ (launchd restarts after 30s)
Gateway starts → op read × 6 providers → 1 fails → CRASH
  ↓ (launchd restarts after 30s)
  ... (281,000+ requests in 24 hours)

Once the rate limit is hit, ALL 1Password API access is blocked across ALL services on the account for up to 24 hours. Not just OpenClaw, everything.

Root Cause

activateRuntimeSecrets (server.impl.ts:486) throws a fatal error on ANY secret resolution failure at startup, with zero retry and zero backoff. Transient failures (network blip, 1Password CLI timeout, temporary rate limit) are treated identically to permanent failures (bad token, missing secret). The process exits, the OS restarts it, and the cycle repeats.

The non-startup path (reload/restart-check) already handles failures gracefully by falling back to the last-known-good snapshot. But at startup there's no last-known-good, so any failure is fatal.

Fix

Added attemptResolve() with exponential backoff around prepareSecretsRuntimeSnapshot at startup:

  • 3 retries with delays of 2s, 4s, 8s (14s total before giving up)
  • Each retry attempt logs [SECRETS_STARTUP_RETRY] with attempt number and delay
  • Only applies when reason === "startup" (reload/restart-check paths unchanged)
  • After exhausting retries, the error message includes actionable guidance about checking service account tokens and rate limits

Transient failures (network blip, socket timeout, temporary rate limit) are likely to succeed on retry 2 or 3 without triggering a full process restart. For permanent failures (revoked token, deleted secret), the retries exhaust quickly (14s) and the gateway still fails with a clear error.

Impact

With the reporter's setup (6 providers, 30s restart interval):

  • Before: ~720 op calls/hour during crash-loop (0 backoff)
  • After: Transient failures recovered in-process without restart. Permanent failures add only 14s delay before the (now-inevitable) crash, reducing restart frequency by ~30%

Changed files

  • extensions/whatsapp/src/inbound/access-control.ts (modified, +11/-2)
  • extensions/whatsapp/src/inbound/monitor.ts (modified, +15/-8)
  • src/gateway/server.impl.ts (modified, +31/-4)

PR #56514: fix: prevent infinite retry loop when provider returns 401

Description (problem / solution / changelog)

Problem

Fixes #56501

When a provider returns HTTP 401 (invalid/expired API key), the agent enters an infinite retry loop generating ~1500+ ERROR entries per minute:

moonshot/kimi-k2.5 → 401 auth error
  → failover to modelstudio/qwen3.5-plus
    → live session model switch check reads persisted preference → moonshot/kimi-k2.5
    → "different from current" → throws LiveSessionModelSwitchError
      → outer loop catches, switches back to moonshot/kimi-k2.5
        → 401 auth error again → ...infinite loop

Root Cause

Three systems interact to create the oscillation:

  1. Model fallback (model-fallback.ts): Correctly switches from moonshot to modelstudio on 401
  2. Persisted session store (live-model-switch.ts): Still says providerOverride: "moonshot" (never cleared on auth failure)
  3. Live switch check (run.ts:452-458): Sees current=modelstudio vs persisted=moonshot, throws LiveSessionModelSwitchError which forces the outer loop back to moonshot

The model fallback system has auth cooldown logic (resolveCooldownDecision), but the LiveSessionModelSwitchError path completely bypasses it.

Fix

Track providers that fail with auth errors in a per-run Set<string>. Before honoring a LiveSessionModelSwitchError, check whether the target provider is in the auth-failed set. If it is, log the suppression and let the fallback model proceed.

moonshot/kimi-k2.5 → 401 → authFailedProviders.add("moonshot")
  → failover to modelstudio/qwen3.5-plus
    → live switch check: persisted = moonshot
    → authFailedProviders.has("moonshot")? YES
    → suppress switch, continue with modelstudio ✓

21 lines in 1 file. The auth-failed set is scoped to the current run, so it resets on fresh agent invocations (allowing recovery after the user fixes their API key).

Changed files

  • extensions/whatsapp/src/inbound/access-control.ts (modified, +11/-2)
  • extensions/whatsapp/src/inbound/monitor.ts (modified, +15/-8)
  • src/agents/pi-embedded-runner/run.ts (modified, +21/-4)
  • src/agents/session-transcript-repair.ts (modified, +20/-2)
  • src/cron/service/timer.ts (modified, +7/-2)
  • src/gateway/server.impl.ts (modified, +31/-4)
  • src/infra/host-env-security.ts (modified, +12/-0)

Code Example



---

launchctl bootout gui/$(id -u)/ai.openclaw.gateway

---

{
  "secrets": {
    "retryPolicy": {
      "maxRetries": 3,
      "backoffMs": [1000, 2000, 4000],
      "circuitBreakerThreshold": 2,
      "cooldownMs": 300000
    }
  }
}
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When 1Password service account credentials fail to resolve (rate limit, network issue, token expiry), the gateway enters a crash-loop via launchd KeepAlive, repeatedly invoking op read for every configured secret provider on each restart. This quickly exhausts 1Password's account-wide daily rate limit, making recovery impossible for up to 24 hours.

Steps to reproduce

  1. Configure gateway with multiple op read-based secret providers in openclaw.json
  2. Run gateway as a LaunchAgent with KeepAlive: true
  3. Trigger a transient 1Password failure (e.g., expired token, network blip, or rate limit)
  4. Observe the restart loop

Expected behavior

The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit:

  • Exponential backoff on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes)
  • Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
  • Batch secret resolution — use op run to resolve all secrets in a single CLI invocation instead of individual op read calls per provider
  • Rate limit awareness — detect "Too many requests" responses and pause rather than retrying immediately

Actual behavior

  1. Gateway starts → secret providers invoke op read for each configured secret
  2. One or more op read calls fail (timeout or error)
  3. Gateway exits with SecretProviderResolutionError
  4. launchd restarts gateway after ThrottleInterval (30s default)
  5. Goto 1

Each restart cycle fires N op read calls (one per provider). With ~6 providers and a 30s restart interval, this produces ~720 op calls/hour. In practice we observed 281,000+ requests in under 24 hours, far exceeding 1Password's account-wide daily limit (50K for Business, 5K for Teams).

Once the daily limit is hit, all op calls fail across all service accounts on the 1Password account, and recovery requires waiting up to 24 hours.

OpenClaw version

2026.3.24

Operating system

macOS 26.3.1 (25D2128)

Install method

No response

Model

NA

Provider / routing chain

NA

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

  • Exhausts 1Password rate limits for the entire account (not just one service account)
  • Blocks all 1Password API access for up to 24 hours
  • Affects all services sharing the same 1Password account
  • Requires manual intervention to stop the loop

Additional information

Summary

When 1Password service account credentials fail to resolve (rate limit, network issue, token expiry), the gateway enters a crash-loop via launchd KeepAlive, repeatedly invoking op read for every configured secret provider on each restart. This quickly exhausts 1Password's account-wide daily rate limit, making recovery impossible for up to 24 hours.

Environment

  • OpenClaw: 2026.3.24 (cff6dc9)
  • macOS (Apple Silicon), launchd LaunchAgent with KeepAlive: true
  • 1Password CLI v2.33.0, service account auth via OP_SERVICE_ACCOUNT_TOKEN
  • Secret providers configured via openclaw.json secrets.providers (exec-based op read)

Steps to Reproduce

  1. Configure gateway with multiple op read-based secret providers in openclaw.json
  2. Run gateway as a LaunchAgent with KeepAlive: true
  3. Trigger a transient 1Password failure (e.g., expired token, network blip, or rate limit)
  4. Observe the restart loop

What Happens

  1. Gateway starts → secret providers invoke op read for each configured secret
  2. One or more op read calls fail (timeout or error)
  3. Gateway exits with SecretProviderResolutionError
  4. launchd restarts gateway after ThrottleInterval (30s default)
  5. Goto 1

Each restart cycle fires N op read calls (one per provider). With ~6 providers and a 30s restart interval, this produces ~720 op calls/hour. In practice we observed 281,000+ requests in under 24 hours, far exceeding 1Password's account-wide daily limit (50K for Business, 5K for Teams).

Once the daily limit is hit, all op calls fail across all service accounts on the 1Password account, and recovery requires waiting up to 24 hours.

Expected Behavior

The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit:

  • Exponential backoff on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes)
  • Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
  • Batch secret resolution — use op run to resolve all secrets in a single CLI invocation instead of individual op read calls per provider
  • Rate limit awareness — detect "Too many requests" responses and pause rather than retrying immediately

Workaround

We've implemented these protections in our own code that calls op (Mission Control), but the gateway's internal secret provider resolution doesn't have them. Current workaround is to manually unload the LaunchAgent when failures occur:

launchctl bootout gui/$(id -u)/ai.openclaw.gateway

Impact

  • Exhausts 1Password rate limits for the entire account (not just one service account)
  • Blocks all 1Password API access for up to 24 hours
  • Affects all services sharing the same 1Password account
  • Requires manual intervention to stop the loop

Suggestion

Consider adding a secrets.retryPolicy option to openclaw.json, e.g.:

{
  "secrets": {
    "retryPolicy": {
      "maxRetries": 3,
      "backoffMs": [1000, 2000, 4000],
      "circuitBreakerThreshold": 2,
      "cooldownMs": 300000
    }
  }
}

Or at minimum, if all secret providers fail on startup, exit with a non-zero code that signals "do not restart immediately" rather than a generic crash that launchd treats as restartable.

extent analysis

Fix Plan

To address the issue, we will implement the following:

  • Exponential backoff on secret provider failures
  • Circuit breaker to stop retrying after N consecutive failures
  • Batch secret resolution using op run
  • Rate limit awareness to detect "Too many requests" responses

Code Changes

We will introduce a retryPolicy option in openclaw.json to configure the retry behavior. Here's an example:

{
  "secrets": {
    "retryPolicy": {
      "maxRetries": 3,
      "backoffMs": [1000, 2000, 4000],
      "circuitBreakerThreshold": 2,
      "cooldownMs": 300000
    }
  }
}

We will then update the secret provider resolution code to use the configured retry policy:

const retryPolicy = config.secrets.retryPolicy;
const maxRetries = retryPolicy.maxRetries;
const backoffMs = retryPolicy.backoffMs;
const circuitBreakerThreshold = retryPolicy.circuitBreakerThreshold;
const cooldownMs = retryPolicy.cooldownMs;

let retryCount = 0;
let backoffTimeout = 0;

function resolveSecrets() {
  // Use op run to resolve all secrets in a single CLI invocation
  const secrets = execSync(`op run --secrets`);
  // ...
}

function handleError(error) {
  if (error.code === 'EAGAIN') {
    // Rate limit exceeded, back off and retry
    retryCount++;
    backoffTimeout = backoffMs[retryCount - 1] || cooldownMs;
    setTimeout(resolveSecrets, backoffTimeout);
  } else if (retryCount >= maxRetries) {
    // Circuit breaker: stop retrying after N consecutive failures
    console.error(`Secret resolution failed after ${maxRetries} retries`);
    process.exit(1);
  } else {
    // Retry with exponential backoff
    retryCount++;
    backoffTimeout = backoffMs[retryCount - 1] || cooldownMs;
    setTimeout(resolveSecrets, backoffTimeout);
  }
}

Verification

To verify the fix, we will:

  • Configure the retry policy in openclaw.json
  • Trigger a transient 1Password failure (e.g., expired token, network blip, or rate limit)
  • Observe the gateway's behavior and verify that it:
    • Backs off and retries with exponential backoff
    • Stops retrying after N consecutive failures (circuit breaker)
    • Resolves secrets in batches using op run
    • Detects "Too many requests" responses and pauses rather than retrying immediately

Extra Tips

To prevent similar issues in the future, we recommend:

  • Implementing rate limiting and circuit breakers for all external service calls
  • Using batch processing and exponential backoff for retrying

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit:

  • Exponential backoff on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes)
  • Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
  • Batch secret resolution — use op run to resolve all secrets in a single CLI invocation instead of individual op read calls per provider
  • Rate limit awareness — detect "Too many requests" responses and pause rather than retrying immediately

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING