openclaw - 💡(How to fix) Fix Feature Request: Per-candidate retry count for model fallback (support pool-based/proxy providers) [6 comments, 2 participants]

openclaw2026-04-02 04:24:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#59413•Fetched 2026-04-08 02:24:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

1052326311

Participants

1052326311

rynomster

Timeline (top)

commented ×6mentioned ×1subscribed ×1

Root Cause

Based on source code inspection (model-fallback.ts):

No per-candidate retry: The fallback loop (for (candidate of candidates)) calls runFallbackAttempt() once per candidate, then continues to the next on failure
502 classified as "timeout": classifyFailoverReasonFromHttpStatus(502) → "timeout", same as real network timeouts
Aggressive cooldown: calculateAuthProfileCooldownMs applies 30s → 60s → 5min escalating cooldown after failures
Cooldown skip + probe: During cooldown, candidates are skipped with a single probe attempt — but pool providers need multiple retries, not a probe

For pool-based providers, a single failure means "this particular key is bad, try another" — not "this provider is down, avoid it."

Code Example

TP fails (502) → switch to GLM → GLM rate_limit → switch to TP
→ TP in cooldown (30s skip) → probe TP → TP still 502
→ switch to GLM → GLM also cooldown → dead loop for 4-5 minutes

---

agents:
  defaults:
    models:
      fallbacks:
        - provider: ticketpro
          model: claude-opus-4-6
          retriesBeforeFallback: 3  # NEW: retry this candidate up to 3 times
          retryDelayMs: 2000        # NEW: delay between retries
        - provider: zhipu-coding
          model: glm-5-turbo
          retriesBeforeFallback: 1  # default behavior: try once

---

agents:
  defaults:
    models:
      fallbackMaxRetriesPerCandidate: 3  # applies to all fallback candidates

RAW_BUFFERClick to expand / collapse

Problem

When using pool-based/proxy API providers (e.g., third-party Anthropic resellers that rotate through a pool of API keys), OpenClaw's model fallback mechanism switches to the next candidate after a single failure instead of retrying the same candidate. This makes pool-based providers nearly unusable as fallback options.

Real-World Scenario

We use a proxy provider (TicketPro) that forwards requests to Anthropic's API through a rotating key pool. We ran a test of 30 consecutive requests:

Metric	Value
Success rate	40% (12/30)
Avg latency (success)	~8s
Failure pattern	Connection/stream timeout (15s)
Pool behavior	~1 in 3-5 keys is healthy; after 5-7 failures, a success usually appears

With OpenClaw's current behavior, the proxy provider is tried once, gets a 502, and immediately falls back to the next provider — never getting a chance to hit a healthy key in the pool.

What Happens in Practice

TP fails (502) → switch to GLM → GLM rate_limit → switch to TP
→ TP in cooldown (30s skip) → probe TP → TP still 502
→ switch to GLM → GLM also cooldown → dead loop for 4-5 minutes

Every provider ends up in cooldown, and the fallback chain bounces between them until all are exhausted.

Root Cause Analysis

Based on source code inspection (model-fallback.ts):

No per-candidate retry: The fallback loop (for (candidate of candidates)) calls runFallbackAttempt() once per candidate, then continues to the next on failure
502 classified as "timeout": classifyFailoverReasonFromHttpStatus(502) → "timeout", same as real network timeouts
Aggressive cooldown: calculateAuthProfileCooldownMs applies 30s → 60s → 5min escalating cooldown after failures
Cooldown skip + probe: During cooldown, candidates are skipped with a single probe attempt — but pool providers need multiple retries, not a probe

For pool-based providers, a single failure means "this particular key is bad, try another" — not "this provider is down, avoid it."

Expected Behavior

Allow configuring per-candidate retry count (e.g., maxRetriesPerCandidate: 3) so pool-based providers get multiple attempts before being marked as failed
Alternatively, treat timeout-classified errors differently from rate_limit/auth — timeouts from pool providers are often transient and worth retrying immediately
Optionally, support a retryBeforeFallback: true/false flag per provider or per model entry

Suggested Configuration

agents:
  defaults:
    models:
      fallbacks:
        - provider: ticketpro
          model: claude-opus-4-6
          retriesBeforeFallback: 3  # NEW: retry this candidate up to 3 times
          retryDelayMs: 2000        # NEW: delay between retries
        - provider: zhipu-coding
          model: glm-5-turbo
          retriesBeforeFallback: 1  # default behavior: try once

Or a simpler global setting:

agents:
  defaults:
    models:
      fallbackMaxRetriesPerCandidate: 3  # applies to all fallback candidates

Environment

OpenClaw 2026.3.x
17 agents in multi-tier hierarchy
Proxy provider: TicketPro (Anthropic reseller with key pool rotation)
Fallback chain: ticketpro/claude-opus-4-6 → zhipu-coding/glm-5-turbo

Related Issues

#57906 — proposes reducing retries (opposite direction, but same config surface)

Tested with real production data: 30 consecutive requests to TicketPro, 40% success rate, consistent pool rotation behavior observed.

extent analysis

TL;DR

Implement a per-candidate retry mechanism to allow pool-based providers multiple attempts before being marked as failed.

Guidance

Introduce a retriesBeforeFallback configuration option per provider or model entry to specify the number of retries before switching to the next candidate.
Consider adding a retryDelayMs setting to control the delay between retries.
Alternatively, explore treating timeout-classified errors differently from rate_limit/auth errors to allow for immediate retries.
Review the calculateAuthProfileCooldownMs function to ensure it doesn't overly penalize pool-based providers with aggressive cooldowns.

Example

agents:
  defaults:
    models:
      fallbacks:
        - provider: ticketpro
          model: claude-opus-4-6
          retriesBeforeFallback: 3
          retryDelayMs: 2000

Notes

The proposed solution focuses on introducing a retry mechanism to better handle pool-based providers. However, the optimal retry count and delay may vary depending on the specific provider and use case. Further testing and tuning may be necessary to achieve the desired behavior.

Recommendation

Apply a workaround by introducing a retriesBeforeFallback configuration option to allow pool-based providers multiple attempts before being marked as failed. This should help improve the success rate and reduce the likelihood of dead loops in the fallback chain.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature Request: Per-candidate retry count for model fallback (support pool-based/proxy providers) [6 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Problem

Real-World Scenario

What Happens in Practice

Root Cause Analysis

Expected Behavior

Suggested Configuration

Environment

Related Issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature Request: Per-candidate retry count for model fallback (support pool-based/proxy providers) [6 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Problem

Real-World Scenario

What Happens in Practice

Root Cause Analysis

Expected Behavior

Suggested Configuration

Environment

Related Issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING