When a Google provider is rate-limited (quota exhausted), requests should: 1. Fail with `rate_limit` classification rather than `timeout` 2. Suspend the rate-limited profile for a cooldown period (same as other providers on explicit 429) 3. NOT retry the rate-limited profile on the next user request

openclaw - 💡(How to fix) Fix [Bug]: Google API rate-limited connections time out and bypass rate-limit cooldown — failover classifies rate-limit hangs as transient timeouts

Error Message

src/agents/pi-embedded-helpers/provider-error-patterns.test.ts — Test matrix covers AWS Bedrock, Google Vertex context-overflow, Ollama, Mistral, Cohere, llama.cpp. No Google 429 / RESOURCE_EXHAUSTED / quota-exceeded patterns are tested. This gap means Google throttle-hang behavior has no regression coverage.
src/agents/failover-error.ts — resolveFailoverStatus() maps rate_limit → 429, but the reverse path (429 → profile suspension) only triggers when the rate_limit classification fires. A timeout classification skips profile suspension. Option A (highest confidence, widest fix): In src/agents/pi-embedded-helpers/errors.ts, if isTimeoutErrorMessage(raw) fires but the request metadata indicates a "consecutive timeout" pattern for the same profile within a short window (e.g., 3 timeouts in 60s), promote the classification from timeout to probable_rate_limit. Add the pattern to provider-error-patterns.test.ts. Option C (test coverage only — won't fix the race, but will catch regressions): Add Google 429 / quota-exceeded test cases to provider-error-patterns.test.ts to ensure any Google-specific error message pattern is classified as rate_limit by classifyFailoverClassificationFromMessage().

Issue investigated with AI assistance — root cause is a hypothesis based on static code reading of src/agents/pi-embedded-helpers/errors.ts, src/infra/retry-policy.ts, and src/agents/failover-error.ts. Author has not run the specific reproduction steps. Analysis, not observed behavior — maintainers should validate the abort-race hypothesis before choosing a fix option.

Root Cause

Issue investigated with AI assistance — root cause is a hypothesis based on static code reading of src/agents/pi-embedded-helpers/errors.ts, src/infra/retry-policy.ts, and src/agents/failover-error.ts. Author has not run the specific reproduction steps. Analysis, not observed behavior — maintainers should validate the abort-race hypothesis before choosing a fix option.

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Bug: Google API rate-limited connections time out and bypass rate-limit cooldown logic

When a Google (Gemini) model provider is rate-limited, the API appears to throttle by hanging the connection rather than returning an immediate HTTP 429 response. The OpenClaw failover system interprets the resulting AbortError / timedOut: true as a transient timeout (not a rate-limit event), and immediately rotates to the next profile without applying any cooldown. If all profiles are rate-limited, subsequent requests repeat the same cycle — each one timing out on each profile — until the user force-restarts.

Observation (code investigation, not directly observed behavior)

Related user report: #81902 describes this exact pattern in logs:

failoverReason: "timeout"
timedOut: true, aborted: true
profileFailureReason: "timeout"

Code paths involved:

src/agents/pi-embedded-helpers/errors.ts — classifyFailoverClassificationFromMessage() runs isTimeoutErrorMessage(raw) before checking HTTP status codes. If the request aborts before an HTTP 429 is received, the message-based timeout check fires, classifying the event as timeout rather than rate_limit. No Google-specific message patterns are present.
src/agents/pi-embedded-helpers/provider-error-patterns.test.ts — Test matrix covers AWS Bedrock, Google Vertex context-overflow, Ollama, Mistral, Cohere, llama.cpp. No Google 429 / RESOURCE_EXHAUSTED / quota-exceeded patterns are tested. This gap means Google throttle-hang behavior has no regression coverage.
src/infra/retry-policy.ts — CHANNEL_API_RETRY_RE = /429|timeout|connect|reset|closed|unavailable|temporarily/i retries both 429 and timeout uniformly. No "consecutive timeouts from same provider → treat as probable rate-limit" circuit-breaker exists.
src/agents/failover-error.ts — resolveFailoverStatus() maps rate_limit → 429, but the reverse path (429 → profile suspension) only triggers when the rate_limit classification fires. A timeout classification skips profile suspension.

Steps to Reproduce

Configure two Google Gemini model auth profiles with a quota-constrained API key on the first profile
Use the first profile until its quota is exhausted
Make a new request — observe it times out (instead of immediately failing over with a rate-limit cooldown)
Check logs: failoverReason: "timeout" appears; the second profile is tried; subsequent requests still attempt the first profile

Expected behavior

When a Google provider is rate-limited (quota exhausted), requests should:

Fail with rate_limit classification rather than timeout
Suspend the rate-limited profile for a cooldown period (same as other providers on explicit 429)
NOT retry the rate-limited profile on the next user request

Actual behavior

Rate-limited Google requests time out, are classified as timeout, profile rotation happens without suspension, and the rate-limited profile is retried on the next request.

Hypothesis (analysis only — not confirmed by running the code)

Google's API returns HTTP 429 with the body "RESOURCE_EXHAUSTED" or similar, but when the request has an active modelRequestTimeoutMs abort controller, the streaming fetch may not have sufficient time to read the 429 response body before the abort fires. The abort precedes the status-code-based classification path in errors.ts, so isTimeoutErrorMessage(raw) catches the AbortError and short-circuits to timeout.

Alternatively: Google's quota enforcement imposes a long back-pressure delay before closing the connection, so the timeout fires while the connection is still open.

Proposed fix options

Option A (highest confidence, widest fix): In src/agents/pi-embedded-helpers/errors.ts, if isTimeoutErrorMessage(raw) fires but the request metadata indicates a "consecutive timeout" pattern for the same profile within a short window (e.g., 3 timeouts in 60s), promote the classification from timeout to probable_rate_limit. Add the pattern to provider-error-patterns.test.ts.

Option B (narrow, Google-specific): Add a Google-specific wrapper in the transport layer that checks for RESOURCE_EXHAUSTED in the response body or gRPC status before the abort fires, and surfaces it as a proper 429. This prevents the status from being lost in the abort race.

Option C (test coverage only — won't fix the race, but will catch regressions): Add Google 429 / quota-exceeded test cases to provider-error-patterns.test.ts to ensure any Google-specific error message pattern is classified as rate_limit by classifyFailoverClassificationFromMessage().

OpenClaw version

Observed on v2026.5.7 / v2026.5.12 (per related #81902 report)

Operating system

Various (not OS-specific)

Install method

npm global

Issue investigated with AI assistance — root cause is a hypothesis based on static code reading of src/agents/pi-embedded-helpers/errors.ts, src/infra/retry-policy.ts, and src/agents/failover-error.ts. Author has not run the specific reproduction steps. Analysis, not observed behavior — maintainers should validate the abort-race hypothesis before choosing a fix option.

FAQ

Expected behavior

When a Google provider is rate-limited (quota exhausted), requests should:

Fail with rate_limit classification rather than timeout
Suspend the rate-limited profile for a cooldown period (same as other providers on explicit 429)
NOT retry the rate-limited profile on the next user request

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Google API rate-limited connections time out and bypass rate-limit cooldown — failover classifies rate-limit hangs as transient timeouts

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug type

Beta release blocker

Summary

Bug: Google API rate-limited connections time out and bypass rate-limit cooldown logic

Observation (code investigation, not directly observed behavior)

Steps to Reproduce

Expected behavior

Actual behavior

Hypothesis (analysis only — not confirmed by running the code)

Proposed fix options

OpenClaw version

Operating system

Install method

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Google API rate-limited connections time out and bypass rate-limit cooldown — failover classifies rate-limit hangs as transient timeouts

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug type

Beta release blocker

Summary

Bug: Google API rate-limited connections time out and bypass rate-limit cooldown logic

Observation (code investigation, not directly observed behavior)

Steps to Reproduce

Expected behavior

Actual behavior

Hypothesis (analysis only — not confirmed by running the code)

Proposed fix options

OpenClaw version

Operating system

Install method

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING