openclaw - 💡(How to fix) Fix [Bug]: Google API rate-limited connections time out and bypass rate-limit cooldown — failover classifies rate-limit hangs as transient timeouts

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  1. src/agents/pi-embedded-helpers/provider-error-patterns.test.ts — Test matrix covers AWS Bedrock, Google Vertex context-overflow, Ollama, Mistral, Cohere, llama.cpp. No Google 429 / RESOURCE_EXHAUSTED / quota-exceeded patterns are tested. This gap means Google throttle-hang behavior has no regression coverage.
  2. src/agents/failover-error.tsresolveFailoverStatus() maps rate_limit → 429, but the reverse path (429 → profile suspension) only triggers when the rate_limit classification fires. A timeout classification skips profile suspension. Option A (highest confidence, widest fix): In src/agents/pi-embedded-helpers/errors.ts, if isTimeoutErrorMessage(raw) fires but the request metadata indicates a "consecutive timeout" pattern for the same profile within a short window (e.g., 3 timeouts in 60s), promote the classification from timeout to probable_rate_limit. Add the pattern to provider-error-patterns.test.ts. Option C (test coverage only — won't fix the race, but will catch regressions): Add Google 429 / quota-exceeded test cases to provider-error-patterns.test.ts to ensure any Google-specific error message pattern is classified as rate_limit by classifyFailoverClassificationFromMessage().

Issue investigated with AI assistance — root cause is a hypothesis based on static code reading of src/agents/pi-embedded-helpers/errors.ts, src/infra/retry-policy.ts, and src/agents/failover-error.ts. Author has not run the specific reproduction steps. Analysis, not observed behavior — maintainers should validate the abort-race hypothesis before choosing a fix option.

Root Cause

Issue investigated with AI assistance — root cause is a hypothesis based on static code reading of src/agents/pi-embedded-helpers/errors.ts, src/infra/retry-policy.ts, and src/agents/failover-error.ts. Author has not run the specific reproduction steps. Analysis, not observed behavior — maintainers should validate the abort-race hypothesis before choosing a fix option.

Code Example

failoverReason: "timeout"
timedOut: true, aborted: true
profileFailureReason: "timeout"
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Bug: Google API rate-limited connections time out and bypass rate-limit cooldown logic

When a Google (Gemini) model provider is rate-limited, the API appears to throttle by hanging the connection rather than returning an immediate HTTP 429 response. The OpenClaw failover system interprets the resulting AbortError / timedOut: true as a transient timeout (not a rate-limit event), and immediately rotates to the next profile without applying any cooldown. If all profiles are rate-limited, subsequent requests repeat the same cycle — each one timing out on each profile — until the user force-restarts.

Observation (code investigation, not directly observed behavior)

Related user report: #81902 describes this exact pattern in logs:

failoverReason: "timeout"
timedOut: true, aborted: true
profileFailureReason: "timeout"

Code paths involved:

  1. src/agents/pi-embedded-helpers/errors.tsclassifyFailoverClassificationFromMessage() runs isTimeoutErrorMessage(raw) before checking HTTP status codes. If the request aborts before an HTTP 429 is received, the message-based timeout check fires, classifying the event as timeout rather than rate_limit. No Google-specific message patterns are present.

  2. src/agents/pi-embedded-helpers/provider-error-patterns.test.ts — Test matrix covers AWS Bedrock, Google Vertex context-overflow, Ollama, Mistral, Cohere, llama.cpp. No Google 429 / RESOURCE_EXHAUSTED / quota-exceeded patterns are tested. This gap means Google throttle-hang behavior has no regression coverage.

  3. src/infra/retry-policy.tsCHANNEL_API_RETRY_RE = /429|timeout|connect|reset|closed|unavailable|temporarily/i retries both 429 and timeout uniformly. No "consecutive timeouts from same provider → treat as probable rate-limit" circuit-breaker exists.

  4. src/agents/failover-error.tsresolveFailoverStatus() maps rate_limit → 429, but the reverse path (429 → profile suspension) only triggers when the rate_limit classification fires. A timeout classification skips profile suspension.

Steps to Reproduce

  1. Configure two Google Gemini model auth profiles with a quota-constrained API key on the first profile
  2. Use the first profile until its quota is exhausted
  3. Make a new request — observe it times out (instead of immediately failing over with a rate-limit cooldown)
  4. Check logs: failoverReason: "timeout" appears; the second profile is tried; subsequent requests still attempt the first profile

Expected behavior

When a Google provider is rate-limited (quota exhausted), requests should:

  1. Fail with rate_limit classification rather than timeout
  2. Suspend the rate-limited profile for a cooldown period (same as other providers on explicit 429)
  3. NOT retry the rate-limited profile on the next user request

Actual behavior

Rate-limited Google requests time out, are classified as timeout, profile rotation happens without suspension, and the rate-limited profile is retried on the next request.

Hypothesis (analysis only — not confirmed by running the code)

Google's API returns HTTP 429 with the body "RESOURCE_EXHAUSTED" or similar, but when the request has an active modelRequestTimeoutMs abort controller, the streaming fetch may not have sufficient time to read the 429 response body before the abort fires. The abort precedes the status-code-based classification path in errors.ts, so isTimeoutErrorMessage(raw) catches the AbortError and short-circuits to timeout.

Alternatively: Google's quota enforcement imposes a long back-pressure delay before closing the connection, so the timeout fires while the connection is still open.

Proposed fix options

Option A (highest confidence, widest fix): In src/agents/pi-embedded-helpers/errors.ts, if isTimeoutErrorMessage(raw) fires but the request metadata indicates a "consecutive timeout" pattern for the same profile within a short window (e.g., 3 timeouts in 60s), promote the classification from timeout to probable_rate_limit. Add the pattern to provider-error-patterns.test.ts.

Option B (narrow, Google-specific): Add a Google-specific wrapper in the transport layer that checks for RESOURCE_EXHAUSTED in the response body or gRPC status before the abort fires, and surfaces it as a proper 429. This prevents the status from being lost in the abort race.

Option C (test coverage only — won't fix the race, but will catch regressions): Add Google 429 / quota-exceeded test cases to provider-error-patterns.test.ts to ensure any Google-specific error message pattern is classified as rate_limit by classifyFailoverClassificationFromMessage().

OpenClaw version

Observed on v2026.5.7 / v2026.5.12 (per related #81902 report)

Operating system

Various (not OS-specific)

Install method

npm global


Issue investigated with AI assistance — root cause is a hypothesis based on static code reading of src/agents/pi-embedded-helpers/errors.ts, src/infra/retry-policy.ts, and src/agents/failover-error.ts. Author has not run the specific reproduction steps. Analysis, not observed behavior — maintainers should validate the abort-race hypothesis before choosing a fix option.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When a Google provider is rate-limited (quota exhausted), requests should:

  1. Fail with rate_limit classification rather than timeout
  2. Suspend the rate-limited profile for a cooldown period (same as other providers on explicit 429)
  3. NOT retry the rate-limited profile on the next user request

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING