openclaw - 💡(How to fix) Fix Feature Request: Per-candidate retry count for model fallback (support pool-based/proxy providers) [6 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#59413Fetched 2026-04-08 02:24:08
View on GitHub
Comments
6
Participants
2
Timeline
8
Reactions
0
Timeline (top)
commented ×6mentioned ×1subscribed ×1

Root Cause

Based on source code inspection (model-fallback.ts):

  1. No per-candidate retry: The fallback loop (for (candidate of candidates)) calls runFallbackAttempt() once per candidate, then continues to the next on failure
  2. 502 classified as "timeout": classifyFailoverReasonFromHttpStatus(502)"timeout", same as real network timeouts
  3. Aggressive cooldown: calculateAuthProfileCooldownMs applies 30s → 60s → 5min escalating cooldown after failures
  4. Cooldown skip + probe: During cooldown, candidates are skipped with a single probe attempt — but pool providers need multiple retries, not a probe

For pool-based providers, a single failure means "this particular key is bad, try another" — not "this provider is down, avoid it."

Code Example

TP fails (502)switch to GLMGLM rate_limit → switch to TP
TP in cooldown (30s skip) → probe TPTP still 502
switch to GLMGLM also cooldown → dead loop for 4-5 minutes

---

agents:
  defaults:
    models:
      fallbacks:
        - provider: ticketpro
          model: claude-opus-4-6
          retriesBeforeFallback: 3  # NEW: retry this candidate up to 3 times
          retryDelayMs: 2000        # NEW: delay between retries
        - provider: zhipu-coding
          model: glm-5-turbo
          retriesBeforeFallback: 1  # default behavior: try once

---

agents:
  defaults:
    models:
      fallbackMaxRetriesPerCandidate: 3  # applies to all fallback candidates
RAW_BUFFERClick to expand / collapse

Problem

When using pool-based/proxy API providers (e.g., third-party Anthropic resellers that rotate through a pool of API keys), OpenClaw's model fallback mechanism switches to the next candidate after a single failure instead of retrying the same candidate. This makes pool-based providers nearly unusable as fallback options.

Real-World Scenario

We use a proxy provider (TicketPro) that forwards requests to Anthropic's API through a rotating key pool. We ran a test of 30 consecutive requests:

MetricValue
Success rate40% (12/30)
Avg latency (success)~8s
Failure patternConnection/stream timeout (15s)
Pool behavior~1 in 3-5 keys is healthy; after 5-7 failures, a success usually appears

With OpenClaw's current behavior, the proxy provider is tried once, gets a 502, and immediately falls back to the next provider — never getting a chance to hit a healthy key in the pool.

What Happens in Practice

TP fails (502) → switch to GLM → GLM rate_limit → switch to TP
→ TP in cooldown (30s skip) → probe TP → TP still 502
→ switch to GLM → GLM also cooldown → dead loop for 4-5 minutes

Every provider ends up in cooldown, and the fallback chain bounces between them until all are exhausted.

Root Cause Analysis

Based on source code inspection (model-fallback.ts):

  1. No per-candidate retry: The fallback loop (for (candidate of candidates)) calls runFallbackAttempt() once per candidate, then continues to the next on failure
  2. 502 classified as "timeout": classifyFailoverReasonFromHttpStatus(502)"timeout", same as real network timeouts
  3. Aggressive cooldown: calculateAuthProfileCooldownMs applies 30s → 60s → 5min escalating cooldown after failures
  4. Cooldown skip + probe: During cooldown, candidates are skipped with a single probe attempt — but pool providers need multiple retries, not a probe

For pool-based providers, a single failure means "this particular key is bad, try another" — not "this provider is down, avoid it."

Expected Behavior

  • Allow configuring per-candidate retry count (e.g., maxRetriesPerCandidate: 3) so pool-based providers get multiple attempts before being marked as failed
  • Alternatively, treat timeout-classified errors differently from rate_limit/auth — timeouts from pool providers are often transient and worth retrying immediately
  • Optionally, support a retryBeforeFallback: true/false flag per provider or per model entry

Suggested Configuration

agents:
  defaults:
    models:
      fallbacks:
        - provider: ticketpro
          model: claude-opus-4-6
          retriesBeforeFallback: 3  # NEW: retry this candidate up to 3 times
          retryDelayMs: 2000        # NEW: delay between retries
        - provider: zhipu-coding
          model: glm-5-turbo
          retriesBeforeFallback: 1  # default behavior: try once

Or a simpler global setting:

agents:
  defaults:
    models:
      fallbackMaxRetriesPerCandidate: 3  # applies to all fallback candidates

Environment

  • OpenClaw 2026.3.x
  • 17 agents in multi-tier hierarchy
  • Proxy provider: TicketPro (Anthropic reseller with key pool rotation)
  • Fallback chain: ticketpro/claude-opus-4-6 → zhipu-coding/glm-5-turbo

Related Issues

  • #57906 — proposes reducing retries (opposite direction, but same config surface)

Tested with real production data: 30 consecutive requests to TicketPro, 40% success rate, consistent pool rotation behavior observed.

extent analysis

TL;DR

Implement a per-candidate retry mechanism to allow pool-based providers multiple attempts before being marked as failed.

Guidance

  • Introduce a retriesBeforeFallback configuration option per provider or model entry to specify the number of retries before switching to the next candidate.
  • Consider adding a retryDelayMs setting to control the delay between retries.
  • Alternatively, explore treating timeout-classified errors differently from rate_limit/auth errors to allow for immediate retries.
  • Review the calculateAuthProfileCooldownMs function to ensure it doesn't overly penalize pool-based providers with aggressive cooldowns.

Example

agents:
  defaults:
    models:
      fallbacks:
        - provider: ticketpro
          model: claude-opus-4-6
          retriesBeforeFallback: 3
          retryDelayMs: 2000

Notes

The proposed solution focuses on introducing a retry mechanism to better handle pool-based providers. However, the optimal retry count and delay may vary depending on the specific provider and use case. Further testing and tuning may be necessary to achieve the desired behavior.

Recommendation

Apply a workaround by introducing a retriesBeforeFallback configuration option to allow pool-based providers multiple attempts before being marked as failed. This should help improve the success rate and reduce the likelihood of dead loops in the fallback chain.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING