openclaw - ✅(Solved) Fix [Bug]:auth_permanent cache never expires — gateway restart required to recover from transient Google API failures [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56838Fetched 2026-04-08 01:47:10
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
referenced ×3closed ×1cross-referenced ×1labeled ×1

When a Google API call fails transiently (e.g., due to a momentary GCP outage or rate limit), OpenClaw marks the entire Google provider as auth_permanent, causing all Google models to be permanently skipped for the rest of the gateway session. Even after the GCP issue resolves, the gateway continues to skip all Google models indefinitely until manually restarted.

Root Cause

When a Google API call fails transiently (e.g., due to a momentary GCP outage or rate limit), OpenClaw marks the entire Google provider as auth_permanent, causing all Google models to be permanently skipped for the rest of the gateway session. Even after the GCP issue resolves, the gateway continues to skip all Google models indefinitely until manually restarted.

Fix Action

Fix / Workaround

Frequency: Occurs every time GCP has even a brief transient failure Severity: Medium-High — agent becomes completely non-functional until user manually restarts the gateway Workaround: Manual systemctl --user restart openclaw-gateway.service — works but requires user intervention

PR fix notes

PR #60404: fix(auth): use shorter backoff for auth_permanent failures

Description (problem / solution / changelog)

Problem

When a provider returns a transient auth error (e.g. API_KEY_INVALID during a GCP outage), OpenClaw marks it as auth_permanent and applies the same 5h–24h exponential backoff used for billing failures. This effectively disables the provider for hours even after the upstream issue resolves — the only workaround is restarting the gateway.

Reported in #56838.

Fix

Give auth_permanent its own backoff curve with much shorter defaults:

Config keyDefaultDescription
auth.cooldowns.authPermanentBackoffMinutes10Base backoff (minutes)
auth.cooldowns.authPermanentMaxMinutes60Backoff cap (minutes)

The exponential progression (base × 2^(n-1), capped) still applies, so repeated failures ramp up to 1 hour max rather than 24 hours.

Changes

  • src/agents/auth-profiles/usage.ts — separate auth_permanent from billing in computeNextProfileUsageStats, add new config resolution
  • src/config/types.auth.ts — add authPermanentBackoffMinutes and authPermanentMaxMinutes to cooldown config type
  • src/config/schema.labels.ts / schema.help.ts — labels and help text for new config keys
  • src/agents/auth-profiles/usage.test.ts — update test expectations for the new backoff values

All existing tests pass (136/136).

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/.generated/config-baseline.core.json (modified, +32/-0)
  • docs/.generated/config-baseline.json (modified, +32/-0)
  • docs/gateway/configuration-reference.md (modified, +4/-0)
  • src/agents/auth-profiles.markauthprofilefailure.test.ts (modified, +29/-1)
  • src/agents/auth-profiles/usage.test.ts (modified, +3/-3)
  • src/agents/auth-profiles/usage.ts (modified, +34/-5)
  • src/agents/failover-error.test.ts (modified, +9/-13)
  • src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts (modified, +15/-6)
  • src/agents/pi-embedded-helpers/failover-matches.ts (modified, +3/-4)
  • src/config/config-misc.test.ts (modified, +14/-0)
  • src/config/schema.base.generated.ts (modified, +18/-0)
  • src/config/schema.help.ts (modified, +4/-0)
  • src/config/schema.labels.ts (modified, +2/-0)
  • src/config/types.auth.ts (modified, +9/-0)
  • src/config/zod-schema.ts (modified, +2/-0)
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When a Google API call fails transiently (e.g., due to a momentary GCP outage or rate limit), OpenClaw marks the entire Google provider as auth_permanent, causing all Google models to be permanently skipped for the rest of the gateway session. Even after the GCP issue resolves, the gateway continues to skip all Google models indefinitely until manually restarted.

Steps to reproduce

Configure OpenClaw with Google Gemini as primary model and fallback chain Wait for a transient GCP API failure (e.g., INVALID_ARGUMENT: API_KEY_INVALID or UNAUTHENTICATED) — this can happen during periods of GCP instability Observe auth_permanent being set for Google provider in logs Wait for GCP to recover (confirmed working via GCP Console) Send a message to the agent

Expected behavior

After a transient failure, OpenClaw should:

Retry the provider after a cooldown period (e.g., 5–10 minutes) OR expose a CLI command to clear auth_permanent state without restarting the gateway OR automatically clear auth_permanent on the next heartbeat/health check if the provider recovers

Actual behavior

model fallback decision: decision=skip_candidate reason=auth_permanent model fallback decision: decision=skip_candidate reason=auth_permanent model fallback decision: decision=skip_candidate reason=auth_permanent Embedded agent failed before reply: All models failed: google/gemini-3-flash-preview: Provider google has auth_permanent issue (skipping all models) google/gemini-3.1-flash-lite-preview: Provider google has auth_permanent issue (skipping all models) ...

OpenClaw version

2026.3.24

Operating system

Rocky 10.1

Install method

No response

Model

Google Gemini (gemini-3-flash-preview, gemini-3.1-flash-lite-preview, gemini-3.1-pro-preview, gemini-2.5-flash)

Provider / routing chain

gemini-3-flash-preview, gemini-3.1-flash-lite-preview, gemini-3.1-pro-preview, gemini-2.5-flash

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

Frequency: Occurs every time GCP has even a brief transient failure Severity: Medium-High — agent becomes completely non-functional until user manually restarts the gateway Workaround: Manual systemctl --user restart openclaw-gateway.service — works but requires user intervention

Additional information

Suggested fix Add a TTL (e.g., 5–15 minutes) to auth_permanent state so it automatically retries after a cooldown period, similar to how transient errors are handled in other systems. Alternatively, expose a CLI command such as: openclaw models auth reset --provider google to clear the cached failure state without requiring a full gateway restart.

extent analysis

Fix Plan

To address the issue, we will implement a TTL (time-to-live) for the auth_permanent state. This will allow the system to automatically retry the Google provider after a cooldown period.

Step-by-Step Solution

  1. Add a TTL to the auth_permanent state:
    • Introduce a new configuration option, e.g., auth_permanent_ttl, to set the cooldown period (e.g., 5-15 minutes).
    • Update the auth_permanent state to include a timestamp when it is set.
  2. Implement automatic retry:
    • Create a scheduled task (e.g., every 1-2 minutes) to check the auth_permanent state for each provider.
    • If the cooldown period has passed, clear the auth_permanent state and allow the provider to be retried.
  3. Expose a CLI command to clear the auth_permanent state:
    • Add a new CLI command, e.g., openclaw models auth reset --provider google, to manually clear the auth_permanent state for a specific provider.

Example Code (Python)

import time
from datetime import datetime, timedelta

# Configuration option for auth_permanent TTL
auth_permanent_ttl = 10  # minutes

# Example function to set auth_permanent state with TTL
def set_auth_permanent(provider):
    auth_permanent_state[provider] = {
        'timestamp': datetime.now(),
        'ttl': auth_permanent_ttl
    }

# Example function to check and clear auth_permanent state
def check_auth_permanent(provider):
    if provider in auth_permanent_state:
        state = auth_permanent_state[provider]
        if datetime.now() - state['timestamp'] > timedelta(minutes=state['ttl']):
            del auth_permanent_state[provider]
            return True
    return False

# Example CLI command to clear auth_permanent state
def clear_auth_permanent(provider):
    if provider in auth_permanent_state:
        del auth_permanent_state[provider]
        print(f"Cleared auth_permanent state for {provider}")
    else:
        print(f"No auth_permanent state found for {provider}")

Verification

To verify the fix, follow these steps:

  1. Configure the auth_permanent_ttl option to a suitable value (e.g., 5 minutes).
  2. Simulate a transient GCP API failure to trigger the auth_permanent state.
  3. Wait for the cooldown period to pass and verify that the auth_permanent state is automatically cleared.
  4. Use the CLI command to manually clear the auth_permanent state and verify that it is removed.

Extra Tips

  • Monitor

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

After a transient failure, OpenClaw should:

Retry the provider after a cooldown period (e.g., 5–10 minutes) OR expose a CLI command to clear auth_permanent state without restarting the gateway OR automatically clear auth_permanent on the next heartbeat/health check if the provider recovers

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING