openclaw - โœ…(Solved) Fix classifyFailoverReasonFromHttpStatus doesn't handle HTTP 500 โ€” no retry/failover on Anthropic server errors [1 pull requests, 1 comments, 1 participants]

Official PRs (โ€ฆ)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful ยท Quick feedback

Loadingโ€ฆ
GitHub stats
openclaw/openclaw#55575โ€ขFetched 2026-04-08 01:37:45
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
closed ร—1commented ร—1cross-referenced ร—1locked ร—1

Error Message

  1. OpenClaw doesn't classify it as a transient error HTTP 500 from Anthropic (or any provider) is a transient server-side error. It should be classified as "timeout" to trigger the same retry/failover logic as 502/504.

Fix Action

Workaround

Local patch: sed -i 's#status === 502 || status === 504#status === 500 || status === 502 || status === 504#' dist/pi-embedded-*.js

PR fix notes

PR #55579: fix(agents): classify nested HTTP 500 for failover

Description (problem / solution / changelog)

Summary

  • classify nested HTTP response statuses from provider/client errors, not just top-level status fields
  • ensure HTTP 500 follows the existing transient failover path used for 502/504
  • add a focused regression test covering nested response.status values for 500/502/504

Testing

  • pnpm test -- src/agents/failover-error.test.ts (fails in this environment because node_modules is missing)
  • pnpm install (fails in this environment due npm DNS/network error: EAI_AGAIN to registry.npmjs.org)

Closes #55575

Changed files

  • src/agents/failover-error.test.ts (modified, +18/-0)
  • src/agents/failover-error.ts (modified, +5/-1)

Code Example

// Before:
if (status === 502 || status === 504) return "timeout";

// After:
if (status === 500 || status === 502 || status === 504) return "timeout";
RAW_BUFFERClick to expand / collapse

Bug

classifyFailoverReasonFromHttpStatus in pi-embedded returns null for HTTP 500, which means:

  1. OpenClaw doesn't classify it as a transient error
  2. The failover system doesn't trigger
  3. No retry, no fallback to the configured fallback model
  4. The run just dies with isError=true

Current behavior

The function handles:

  • 502, 504 โ†’ "timeout" (triggers retry/failover)
  • 503 โ†’ "timeout" or "overloaded"
  • 529 โ†’ "overloaded"
  • 408 โ†’ "timeout"

But skips 500 entirely, returning null.

Expected behavior

HTTP 500 from Anthropic (or any provider) is a transient server-side error. It should be classified as "timeout" to trigger the same retry/failover logic as 502/504.

Proposed fix

// Before:
if (status === 502 || status === 504) return "timeout";

// After:
if (status === 500 || status === 502 || status === 504) return "timeout";

Impact

When Anthropic returns HTTP 500 (which happens occasionally during high load), agents on models with configured fallbacks (e.g., Opus primary โ†’ GPT-5.4 fallback) don't failover โ€” the entire run fails instead of gracefully switching to the fallback model.

Reproduction

  1. Configure a model with fallbacks in openclaw.json
  2. Wait for an Anthropic 500 (or simulate one with a proxy)
  3. Observe that the run dies instead of failing over

Workaround

Local patch: sed -i 's#status === 502 || status === 504#status === 500 || status === 502 || status === 504#' dist/pi-embedded-*.js

extent analysis

Fix Plan

To fix the issue, update the classifyFailoverReasonFromHttpStatus function in pi-embedded to classify HTTP 500 as a transient error.

  • Update the condition to include HTTP 500:
if (status === 500 || status === 502 || status === 504) return "timeout";
  • Ensure the change is deployed to production.

Verification

To verify the fix:

  1. Configure a model with fallbacks in openclaw.json.
  2. Simulate an Anthropic 500 error (or wait for a natural occurrence).
  3. Check that the run fails over to the configured fallback model instead of dying.

Extra Tips

  • Consider adding a test case to cover this scenario and prevent regressions.
  • Review other HTTP status codes to ensure they are correctly classified as transient or non-transient errors.

Vote matrix ยท Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loadingโ€ฆ

FAQ

Expected behavior

HTTP 500 from Anthropic (or any provider) is a transient server-side error. It should be classified as "timeout" to trigger the same retry/failover logic as 502/504.

Still need to ship something?

ร—6

Another batch ranked right after the header list โ€” different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - โœ…(Solved) Fix classifyFailoverReasonFromHttpStatus doesn't handle HTTP 500 โ€” no retry/failover on Anthropic server errors [1 pull requests, 1 comments, 1 participants]