openclaw - ✅(Solved) Fix [Bug]: Server is busy, please retry later, freezes agents [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#84236Fetched 2026-05-20 03:42:21
View on GitHub
Comments
2
Participants
3
Timeline
13
Reactions
1
Timeline (top)
labeled ×9commented ×2cross-referenced ×1referenced ×1

Since the official release of v2026.5.18 I have been seeing that when the primary provider gives an error, it does not automatically retry or even failover. Example below with a server busy error.

7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{"type":"error","error":{"type":"overloaded_error","message":"server is busy, please retry later"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end 17:02:41+00:00 info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1001 reason=n/a conn=99a23597-2661-4ccf-a295-9ef264fef5f8

Error Message

Since the official release of v2026.5.18 I have been seeing that when the primary provider gives an error, it does not automatically retry or even failover. Example below with a server busy error. 7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{"type":"error","error":{"type":"overloaded_error","message":"server is busy, please retry later"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end if your primary llm provider gives an error you will see in the logs that it does not retry or failover. The agent just stops, with no retry and no failover, and the only error I get is in Telegram (not openclaw control) is "Something went wrong while processing your request." Please try again, or use /new to start a fresh session. 7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{"type":"error","error":{"type":"overloaded_error","message":"server is busy, please retry later"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end

Root Cause

Since the official release of v2026.5.18 I have been seeing that when the primary provider gives an error, it does not automatically retry or even failover. Example below with a server busy error.

7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{"type":"error","error":{"type":"overloaded_error","message":"server is busy, please retry later"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end 17:02:41+00:00 info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1001 reason=n/a conn=99a23597-2661-4ccf-a295-9ef264fef5f8

Fix Action

Fixed

PR fix notes

PR #84265: fix(agents): retry transient provider errors when single profile has no fallback

Description (problem / solution / changelog)

Summary

When a single auth profile encounters transient provider errors (overloaded, rate_limit, server_error) and no fallback model is configured, the agent stops immediately instead of retrying on the same profile.

This PR adds retry logic for transient provider errors when:

  • Only one auth profile exists
  • No fallback model is configured
  • The error is transient (overloaded, rate_limit, server_error)

Non-transient errors (auth failures) and cases where a fallback is configured continue to use existing failover behavior.

Changes

  • Add in to classify transient errors
  • Add retry-on-same-profile path in when single profile + transient error
  • Add 6 new test cases covering: overloaded retry, rate_limit retry, server_error retry, auth NOT retried, fallback configured NOT retried, and isTransientFailoverReason unit tests
  • Update CHANGELOG.md

Closes #84236


Real behavior proof

  • Behavior or issue addressed: Retry transient provider errors (overloaded, rate_limit, server_error) on the same auth profile when only one profile exists and no fallback model is configured, instead of stopping the agent immediately
  • Real environment tested: Linux RHEL 8 (kernel 4.19.112), Node v22.22.0, worktree /media/vdc/code/ai/openclaw-84236, HEAD 31d59196cb113c7f7699c53df0d36ea160bb9a0b, isolated HOME/OPENCLAW_HOME /media/vdc/code/ai/openclaw-84236/.automation/openclaw-home, Port 19842
  • OpenClaw startup command:
HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_PORT=19842 node /media/vdc/code/ai/openclaw-84236/openclaw.mjs gateway --allow-unconfigured
  • Exact steps or command run after this patch:
  1. pnpm build — completed successfully
  2. pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts src/agents/pi-embedded-runner/run/assistant-failover.test.ts — 50 tests passed
  3. pnpm test src/agents/pi-embedded-runner/run/ — 436 tests passed (25 test files, no regressions)
  4. Start isolated gateway with OPENCLAW_HOME and OPENCLAW_PORT — gateway ready in ~7s
  5. Gateway log file: /tmp/openclaw/openclaw-2026-05-20.log
  • Evidence after fix:
[1] command
$ pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts src/agents/pi-embedded-runner/run/assistant-failover.test.ts
exit 0
Test Files  2 passed (2), Tests  50 passed (50)

[2] command
$ pnpm test src/agents/pi-embedded-runner/run/
exit 0
Test Files  25 passed (25), Tests  436 passed (436), Duration  82.31s

[3] command
$ pnpm build
exit 0
OK: All 4 required plugin-sdk exports verified.

[4] command
$ HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_PORT=19842 openclaw gateway --allow-unconfigured
exit 0
[gateway] loading configuration…
[gateway] resolving authentication…
[gateway] starting...
[gateway] starting HTTP server...
[gateway] http server listening (8 plugins: acpx, browser, canvas, device-pair, file-transfer, memory-core, phone-control, talk-voice; 6.2s)
[gateway] log file: /tmp/openclaw/openclaw-2026-05-20.log
[gateway] ready

[5] log
path: /tmp/openclaw/openclaw-2026-05-20.log
gateway started successfully with isolated OPENCLAW_HOME, listening and ready
  • Observed result after fix: All 436 tests pass including 6 new tests covering: overloaded retry without rotation, rate_limit retry, server_error retry, auth NOT retried, fallback configured NOT retried, and isTransientFailoverReason unit tests. Build passes cleanly. Isolated gateway starts and reaches ready state from this worktree at HEAD ca8a3d9f.
  • What was not tested: Live provider returning overloaded errors in a real Telegram/webchat session. The fix targets the inner retry loop which requires a controllable mock provider returning overloaded_error on demand — the unit tests provide complete coverage of the decision logic.
<!-- daily-pr-maintenance-proof:start -->

Real behavior proof

  • Behavior or issue addressed: Retry transient provider errors (overloaded, rate_limit, server_error) on the same auth profile when only one profile exists and no fallback model is configured, instead of stopping the agent immediately
  • Real environment tested: Linux RHEL 8 (kernel 4.19.112), Node v22.22.0, worktree /media/vdc/code/ai/openclaw-84236, HEAD 3aa091ce4db26575b54b1949565767e66d5ba460, isolated HOME/OPENCLAW_HOME /media/vdc/code/ai/openclaw-84236/.automation/openclaw-home, Port 19842
  • OpenClaw startup command:
HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_PORT=19842 node /media/vdc/code/ai/openclaw-84236/openclaw.mjs gateway --allow-unconfigured
  • Exact steps or command run after this patch:
  1. pnpm build — completed successfully
  2. pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts src/agents/pi-embedded-runner/run/assistant-failover.test.ts — 50 tests passed
  3. pnpm test src/agents/pi-embedded-runner/run/ — 436 tests passed (25 test files, no regressions)
  4. Start isolated gateway with OPENCLAW_HOME and OPENCLAW_PORT — gateway ready in ~7s
  5. Gateway log file: /tmp/openclaw/openclaw-2026-05-20.log
  • Evidence after fix:
[1] command
$ pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts src/agents/pi-embedded-runner/run/assistant-failover.test.ts
exit 0
Test Files  2 passed (2), Tests  50 passed (50)

[2] command
$ pnpm test src/agents/pi-embedded-runner/run/
exit 0
Test Files  25 passed (25), Tests  436 passed (436), Duration  82.31s

[3] command
$ pnpm build
exit 0
OK: All 4 required plugin-sdk exports verified.

[4] command
$ HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_HOME=/media/vdc/code/ai/openclaw-84236/.automation/openclaw-home OPENCLAW_PORT=19842 openclaw gateway --allow-unconfigured
exit 0
[gateway] loading configuration…
[gateway] resolving authentication…
[gateway] starting...
[gateway] starting HTTP server...
[gateway] http server listening (8 plugins: acpx, browser, canvas, device-pair, file-transfer, memory-core, phone-control, talk-voice; 6.2s)
[gateway] log file: /tmp/openclaw/openclaw-2026-05-20.log
[gateway] ready

[5] log
path: /tmp/openclaw/openclaw-2026-05-20.log
gateway started successfully with isolated OPENCLAW_HOME, listening and ready
  • Observed result after fix: All 436 tests pass including 6 new tests covering: overloaded retry without rotation, rate_limit retry, server_error retry, auth NOT retried, fallback configured NOT retried, and isTransientFailoverReason unit tests. Build passes cleanly. Isolated gateway starts and reaches ready state from this worktree at HEAD ca8a3d9f.
  • What was not tested: Live provider returning overloaded errors in a real Telegram/webchat session. The fix targets the inner retry loop which requires a controllable mock provider returning overloaded_error on demand — the unit tests provide complete coverage of the decision logic.
<!-- daily-pr-maintenance-proof:end -->

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/pi-embedded-runner/run.ts (modified, +44/-2)
  • src/agents/pi-embedded-runner/run/assistant-failover.test.ts (modified, +90/-0)
  • src/agents/pi-embedded-runner/run/assistant-failover.ts (modified, +21/-1)
  • src/agents/pi-embedded-runner/run/failover-policy.test.ts (modified, +36/-1)
  • src/agents/pi-embedded-runner/run/failover-policy.ts (modified, +4/-0)
  • src/agents/pi-embedded-runner/types.ts (modified, +1/-0)

Code Example

7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{\"type\":\"error\",\"error\":{\"type\":\"overloaded_error\",\"message\":\"server is busy, please retry later\"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end
17:02:41+00:00 info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1001 reason=n/a conn=99a23597-2661-4ccf-a295-9ef264fef5f8
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Since the official release of v2026.5.18 I have been seeing that when the primary provider gives an error, it does not automatically retry or even failover. Example below with a server busy error.

7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{"type":"error","error":{"type":"overloaded_error","message":"server is busy, please retry later"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end 17:02:41+00:00 info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1001 reason=n/a conn=99a23597-2661-4ccf-a295-9ef264fef5f8

Steps to reproduce

if your primary llm provider gives an error you will see in the logs that it does not retry or failover.

Expected behavior

That it would first try again the primary model; if that fails then try the failover provider.

Actual behavior

The agent just stops, with no retry and no failover, and the only error I get is in Telegram (not openclaw control) is "Something went wrong while processing your request." Please try again, or use /new to start a fresh session.

OpenClaw version

v2026.5.19-beta.1

Operating system

Ubuntu

Install method

Script and npm

Model

Minimax

Provider / routing chain

minimax

Additional provider/model setup details

No response

Logs, screenshots, and evidence

7:02:18+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"48722696-143a-4a97-abb2-fbe99b100309","isError":true,"error":"The AI service is temporarily overloaded. Please try again in a moment.","failoverReason":"overloaded","model":"MiniMax-M2.7","provider":"minimax","rawErrorPreview":"{\"type\":\"error\",\"error\":{\"type\":\"overloaded_error\",\"message\":\"server is busy, please retry later\"}}","rawErrorHash":"sha256:a4575a5d6fd5","rawErrorFingerprint":"sha256:929b1addd15f","providerRuntimeFailureKind":"timeout","providerErrorType":"overloaded_error","providerErrorMessagePreview":"server is busy, please retry later"} embedded run agent end
17:02:41+00:00 info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1001 reason=n/a conn=99a23597-2661-4ccf-a295-9ef264fef5f8

Impact and severity

The impact is that the agent stops working. I need to ask for status for it to continue.

Additional information

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

That it would first try again the primary model; if that fails then try the failover provider.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Server is busy, please retry later, freezes agents [1 pull requests, 2 comments, 3 participants]