The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit: - **Exponential backoff** on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes) - **Circuit breaker** — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart) - **Batch secret resolution** — use `op run` to resolve all secrets in a single CLI invocation instead of individual `op read` calls per provider - **Rate limit awareness** — detect "Too many requests" responses and pause rather than retrying immediately

openclaw - ✅(Solved) Fix [Bug]: Secret provider crash-loop exhausts 1Password service account rate limits [2 pull requests, 1 participants]

openclaw2026-03-28 06:07:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#56217•Fetched 2026-04-08 01:43:28

View on GitHub

Comments

Participants

Timeline

Reactions

Author

stevenc317

Participants

stevenc317

Timeline (top)

cross-referenced ×2labeled ×2referenced ×1

When 1Password service account credentials fail to resolve (rate limit, network issue, token expiry), the gateway enters a crash-loop via launchd KeepAlive, repeatedly invoking op read for every configured secret provider on each restart. This quickly exhausts 1Password's account-wide daily rate limit, making recovery impossible for up to 24 hours.

Error Message

Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)

One or more op read calls fail (timeout or error)
One or more op read calls fail (timeout or error)

Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)

Root Cause

Fix Action

Workaround

We've implemented these protections in our own code that calls op (Mission Control), but the gateway's internal secret provider resolution doesn't have them. Current workaround is to manually unload the LaunchAgent when failures occur:

launchctl bootout gui/$(id -u)/ai.openclaw.gateway

PR fix notes

PR #56499: fix: add exponential backoff for secret provider resolution at startup

Repository: openclaw/openclaw
Author: claygeo
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/56499

Description (problem / solution / changelog)

Problem

Fixes #56217

When a secret provider (e.g., 1Password op read) fails transiently at startup, the gateway crashes immediately. If the process manager (launchd/systemd) restarts it, each restart fires N op read calls that fail again:

Gateway starts → op read × 6 providers → 1 fails → CRASH
  ↓ (launchd restarts after 30s)
Gateway starts → op read × 6 providers → 1 fails → CRASH
  ↓ (launchd restarts after 30s)
  ... (281,000+ requests in 24 hours)

Once the rate limit is hit, ALL 1Password API access is blocked across ALL services on the account for up to 24 hours. Not just OpenClaw, everything.

Root Cause

activateRuntimeSecrets (server.impl.ts:486) throws a fatal error on ANY secret resolution failure at startup, with zero retry and zero backoff. Transient failures (network blip, 1Password CLI timeout, temporary rate limit) are treated identically to permanent failures (bad token, missing secret). The process exits, the OS restarts it, and the cycle repeats.

The non-startup path (reload/restart-check) already handles failures gracefully by falling back to the last-known-good snapshot. But at startup there's no last-known-good, so any failure is fatal.

Fix

Added attemptResolve() with exponential backoff around prepareSecretsRuntimeSnapshot at startup:

3 retries with delays of 2s, 4s, 8s (14s total before giving up)
Each retry attempt logs [SECRETS_STARTUP_RETRY] with attempt number and delay
Only applies when reason === "startup" (reload/restart-check paths unchanged)
After exhausting retries, the error message includes actionable guidance about checking service account tokens and rate limits

Transient failures (network blip, socket timeout, temporary rate limit) are likely to succeed on retry 2 or 3 without triggering a full process restart. For permanent failures (revoked token, deleted secret), the retries exhaust quickly (14s) and the gateway still fails with a clear error.

Impact

With the reporter's setup (6 providers, 30s restart interval):

Before: ~720 op calls/hour during crash-loop (0 backoff)
After: Transient failures recovered in-process without restart. Permanent failures add only 14s delay before the (now-inevitable) crash, reducing restart frequency by ~30%

Changed files

extensions/whatsapp/src/inbound/access-control.ts (modified, +11/-2)
extensions/whatsapp/src/inbound/monitor.ts (modified, +15/-8)
src/gateway/server.impl.ts (modified, +31/-4)

PR #56514: fix: prevent infinite retry loop when provider returns 401

Repository: openclaw/openclaw
Author: claygeo
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/56514

Description (problem / solution / changelog)

Problem

Fixes #56501

When a provider returns HTTP 401 (invalid/expired API key), the agent enters an infinite retry loop generating ~1500+ ERROR entries per minute:

moonshot/kimi-k2.5 → 401 auth error
  → failover to modelstudio/qwen3.5-plus
    → live session model switch check reads persisted preference → moonshot/kimi-k2.5
    → "different from current" → throws LiveSessionModelSwitchError
      → outer loop catches, switches back to moonshot/kimi-k2.5
        → 401 auth error again → ...infinite loop

Root Cause

Three systems interact to create the oscillation:

Model fallback (model-fallback.ts): Correctly switches from moonshot to modelstudio on 401
Persisted session store (live-model-switch.ts): Still says providerOverride: "moonshot" (never cleared on auth failure)
Live switch check (run.ts:452-458): Sees current=modelstudio vs persisted=moonshot, throws LiveSessionModelSwitchError which forces the outer loop back to moonshot

The model fallback system has auth cooldown logic (resolveCooldownDecision), but the LiveSessionModelSwitchError path completely bypasses it.

Fix

Track providers that fail with auth errors in a per-run Set<string>. Before honoring a LiveSessionModelSwitchError, check whether the target provider is in the auth-failed set. If it is, log the suppression and let the fallback model proceed.

moonshot/kimi-k2.5 → 401 → authFailedProviders.add("moonshot")
  → failover to modelstudio/qwen3.5-plus
    → live switch check: persisted = moonshot
    → authFailedProviders.has("moonshot")? YES
    → suppress switch, continue with modelstudio ✓

21 lines in 1 file. The auth-failed set is scoped to the current run, so it resets on fresh agent invocations (allowing recovery after the user fixes their API key).

Changed files

extensions/whatsapp/src/inbound/access-control.ts (modified, +11/-2)
extensions/whatsapp/src/inbound/monitor.ts (modified, +15/-8)
src/agents/pi-embedded-runner/run.ts (modified, +21/-4)
src/agents/session-transcript-repair.ts (modified, +20/-2)
src/cron/service/timer.ts (modified, +7/-2)
src/gateway/server.impl.ts (modified, +31/-4)
src/infra/host-env-security.ts (modified, +12/-0)

Code Example



---

launchctl bootout gui/$(id -u)/ai.openclaw.gateway

---

{
  "secrets": {
    "retryPolicy": {
      "maxRetries": 3,
      "backoffMs": [1000, 2000, 4000],
      "circuitBreakerThreshold": 2,
      "cooldownMs": 300000
    }
  }
}

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Configure gateway with multiple op read-based secret providers in openclaw.json
Run gateway as a LaunchAgent with KeepAlive: true
Trigger a transient 1Password failure (e.g., expired token, network blip, or rate limit)
Observe the restart loop

Expected behavior

The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit:

Exponential backoff on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes)
Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
Batch secret resolution — use op run to resolve all secrets in a single CLI invocation instead of individual op read calls per provider
Rate limit awareness — detect "Too many requests" responses and pause rather than retrying immediately

Actual behavior

Gateway starts → secret providers invoke op read for each configured secret
One or more op read calls fail (timeout or error)
Gateway exits with SecretProviderResolutionError
launchd restarts gateway after ThrottleInterval (30s default)
Goto 1

Each restart cycle fires N op read calls (one per provider). With ~6 providers and a 30s restart interval, this produces ~720 op calls/hour. In practice we observed 281,000+ requests in under 24 hours, far exceeding 1Password's account-wide daily limit (50K for Business, 5K for Teams).

Once the daily limit is hit, all op calls fail across all service accounts on the 1Password account, and recovery requires waiting up to 24 hours.

OpenClaw version

2026.3.24

Operating system

macOS 26.3.1 (25D2128)

Install method

No response

Model

Provider / routing chain

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

Exhausts 1Password rate limits for the entire account (not just one service account)
Blocks all 1Password API access for up to 24 hours
Affects all services sharing the same 1Password account
Requires manual intervention to stop the loop

Additional information

Summary

Environment

OpenClaw: 2026.3.24 (cff6dc9)
macOS (Apple Silicon), launchd LaunchAgent with KeepAlive: true
1Password CLI v2.33.0, service account auth via OP_SERVICE_ACCOUNT_TOKEN
Secret providers configured via openclaw.json secrets.providers (exec-based op read)

Steps to Reproduce

Configure gateway with multiple op read-based secret providers in openclaw.json
Run gateway as a LaunchAgent with KeepAlive: true
Trigger a transient 1Password failure (e.g., expired token, network blip, or rate limit)
Observe the restart loop

What Happens

Gateway starts → secret providers invoke op read for each configured secret
One or more op read calls fail (timeout or error)
Gateway exits with SecretProviderResolutionError
launchd restarts gateway after ThrottleInterval (30s default)
Goto 1

Once the daily limit is hit, all op calls fail across all service accounts on the 1Password account, and recovery requires waiting up to 24 hours.

Expected Behavior

The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit:

Exponential backoff on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes)
Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
Batch secret resolution — use op run to resolve all secrets in a single CLI invocation instead of individual op read calls per provider
Rate limit awareness — detect "Too many requests" responses and pause rather than retrying immediately

Workaround

launchctl bootout gui/$(id -u)/ai.openclaw.gateway

Impact

Exhausts 1Password rate limits for the entire account (not just one service account)
Blocks all 1Password API access for up to 24 hours
Affects all services sharing the same 1Password account
Requires manual intervention to stop the loop

Suggestion

Consider adding a secrets.retryPolicy option to openclaw.json, e.g.:

{
  "secrets": {
    "retryPolicy": {
      "maxRetries": 3,
      "backoffMs": [1000, 2000, 4000],
      "circuitBreakerThreshold": 2,
      "cooldownMs": 300000
    }
  }
}

Or at minimum, if all secret providers fail on startup, exit with a non-zero code that signals "do not restart immediately" rather than a generic crash that launchd treats as restartable.

extent analysis

Fix Plan

To address the issue, we will implement the following:

Exponential backoff on secret provider failures
Circuit breaker to stop retrying after N consecutive failures
Batch secret resolution using op run
Rate limit awareness to detect "Too many requests" responses

Code Changes

We will introduce a retryPolicy option in openclaw.json to configure the retry behavior. Here's an example:

{
  "secrets": {
    "retryPolicy": {
      "maxRetries": 3,
      "backoffMs": [1000, 2000, 4000],
      "circuitBreakerThreshold": 2,
      "cooldownMs": 300000
    }
  }
}

We will then update the secret provider resolution code to use the configured retry policy:

const retryPolicy = config.secrets.retryPolicy;
const maxRetries = retryPolicy.maxRetries;
const backoffMs = retryPolicy.backoffMs;
const circuitBreakerThreshold = retryPolicy.circuitBreakerThreshold;
const cooldownMs = retryPolicy.cooldownMs;

let retryCount = 0;
let backoffTimeout = 0;

function resolveSecrets() {
  // Use op run to resolve all secrets in a single CLI invocation
  const secrets = execSync(`op run --secrets`);
  // ...
}

function handleError(error) {
  if (error.code === 'EAGAIN') {
    // Rate limit exceeded, back off and retry
    retryCount++;
    backoffTimeout = backoffMs[retryCount - 1] || cooldownMs;
    setTimeout(resolveSecrets, backoffTimeout);
  } else if (retryCount >= maxRetries) {
    // Circuit breaker: stop retrying after N consecutive failures
    console.error(`Secret resolution failed after ${maxRetries} retries`);
    process.exit(1);
  } else {
    // Retry with exponential backoff
    retryCount++;
    backoffTimeout = backoffMs[retryCount - 1] || cooldownMs;
    setTimeout(resolveSecrets, backoffTimeout);
  }
}

Verification

To verify the fix, we will:

Configure the retry policy in openclaw.json
Trigger a transient 1Password failure (e.g., expired token, network blip, or rate limit)
Observe the gateway's behavior and verify that it:
- Backs off and retries with exponential backoff
- Stops retrying after N consecutive failures (circuit breaker)
- Resolves secrets in batches using op run
- Detects "Too many requests" responses and pauses rather than retrying immediately

Extra Tips

To prevent similar issues in the future, we recommend:

Implementing rate limiting and circuit breakers for all external service calls
Using batch processing and exponential backoff for retrying

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The gateway should detect repeated secret resolution failures and back off rather than crash-looping into a rate limit:

Exponential backoff on secret provider failures (e.g., 1s → 2s → 4s → ... → cap at 5 minutes)
Circuit breaker — after N consecutive failures, stop retrying and log a clear error instead of exiting (which triggers launchd restart)
Batch secret resolution — use op run to resolve all secrets in a single CLI invocation instead of individual op read calls per provider
Rate limit awareness — detect "Too many requests" responses and pause rather than retrying immediately

#api #callback error #memory management #API rate limit #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Secret provider crash-loop exhausts 1Password service account rate limits [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #56499: fix: add exponential backoff for secret provider resolution at startup

Description (problem / solution / changelog)

Problem

Root Cause

Fix

Impact

Changed files

PR #56514: fix: prevent infinite retry loop when provider returns 401

Description (problem / solution / changelog)

Problem

Root Cause

Fix

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Summary

Environment

Steps to Reproduce

What Happens

Expected Behavior

Workaround

Impact

Suggestion

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING