openclaw - 💡(How to fix) Fix Bug: Gateway falsely marks healthy local vLLM endpoints as timed out/overloaded, causing 1–23 min fallback cascades [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63229Fetched 2026-04-09 07:56:37
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

The gateway's model-fallback/routing subsystem incorrectly marks healthy, responsive local vLLM endpoints as "timed out" or "overloaded", causing cascading fallback chains that take 1–23 minutes to resolve. The endpoints themselves respond in 0.27–0.93s when tested directly via curl.

Error Message

Error previews: The gateway logs show reason: "overloaded" with errorPreview: "Live session model switch requested: novita/zai-org/glm-4.7". A session model mismatch is being classified as a provider overload — the gateway is conflating an internal session state error with provider unavailability.

Root Cause

Root Cause Hypothesis

Code Example

gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789
RAW_BUFFERClick to expand / collapse

Summary

The gateway's model-fallback/routing subsystem incorrectly marks healthy, responsive local vLLM endpoints as "timed out" or "overloaded", causing cascading fallback chains that take 1–23 minutes to resolve. The endpoints themselves respond in 0.27–0.93s when tested directly via curl.

Environment

  • OpenClaw: 2026.4.5 (container, Linux 6.12.63 x64)
  • Gateway: loopback bind, port 18789
  • Local vLLM endpoints:
    • vllm-8001 (gemma4, 27B) on jupiter.wg.local:8001 — dedicated GPU
    • vllm-7002 (qwen3.5-27b) on jupiter.wg.local:7002 — dedicated GPU
  • Remote providers: Novita (GLM-5, Kimi), DeepInfra (Kimi), Anthropic (Sonnet)
  • Config: agents.defaults.timeoutSeconds: 1200, agents.defaults.llm.idleTimeoutSeconds: 300

Observed Behaviour

1. Endpoints are fast (direct curl, concurrent)

Both GPUs idle, tested concurrently:

EndpointAvg latency (5 reqs)
vllm-8001/gemma40.28s
vllm-7002/qwen3.5-27b0.91s

2. Gateway marks them as timed out or overloaded

From gateway logs, model fallback decisions for today:

Failure reasons:

  • timeout: 17 occurrences
  • unknown: 8
  • overloaded: 2

Error previews:

  • LLM request timed out.: 12
  • Gateway is draining for restart; new tasks are not accepted: 8
  • cron: job execution timed out: 4
  • Live session model switch requested: <model>: 2
  • Request was aborted.: 1

3. Fallback chains take minutes

Example fallback chains from today's logs:

Run IDChainTotal time
7c914aaeqwen→timeout → Kimi→timeout → gemma→timeout → Sonnet✓23.4 min
21ca97c0gemma→timeout → qwen✓4.1 min
0cb06206gemma→timeout → Kimi✓56.6s
66f5e9e5gemma→timeout → GLM-5✓80.6s

4. Gateway can't even spawn subagents

Attempting sessions_spawn returns:

gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789

Meanwhile, direct curl to the same endpoints returns in <1s.

5. "Overloaded" misclassification

The gateway logs show reason: "overloaded" with errorPreview: "Live session model switch requested: novita/zai-org/glm-4.7". A session model mismatch is being classified as a provider overload — the gateway is conflating an internal session state error with provider unavailability.

Expected Behaviour

  • Requests to healthy, sub-second local endpoints should not time out
  • Session model switch errors should not be classified as "overloaded"
  • Fallback chains should not take minutes when all providers are responsive
  • sessions_spawn should not time out when the gateway is under normal load

Root Cause Hypothesis

Two distinct bugs:

  1. Internal timeout too aggressive or misapplied: The gateway's LLM request timeout fires before the endpoint responds, or the timeout is applied to an internal queue wait rather than the actual HTTP request. Endpoints respond in <1s but the gateway reports "LLM request timed out" 17 times today.

  2. LiveSessionModelSwitchError misclassified as "overloaded": When a cron job or isolated session requests a model different from the live session's current model, the gateway throws LiveSessionModelSwitchError and classifies this as reason: "overloaded" in the fallback system. This is semantically wrong and triggers unnecessary fallback cascading.

Reproduction

  1. Configure two local vLLM providers with fast endpoints (<1s response)
  2. Configure 3+ agents with cron jobs using different model overrides
  3. Observe gateway logs: endpoints will be marked as "timed out" despite being healthy
  4. Run curl directly against the same endpoints to confirm sub-second response

Impact

  • Interactive sessions experience minutes-long delays for responses that should take seconds
  • Cron jobs time out and fail unnecessarily
  • Session continuity breaks (related to #63195)
  • Gateway becomes unresponsive to sessions_spawn and CLI commands
  • Users lose trust in the interface as a reliable work surface

Additional Context

  • Related: #63195 (sessions disappearing during normal use)
  • The LiveSessionModelSwitchError appears 17 times in today's logs, suggesting this is the dominant failure mode
  • agents.defaults.maxConcurrent: 8 with 5 agents may amplify the issue but is not the root cause — the endpoints are idle when failures occur

extent analysis

TL;DR

Adjust the agents.defaults.timeoutSeconds configuration to a higher value to prevent premature timeouts, and correct the classification of LiveSessionModelSwitchError to prevent unnecessary fallback cascading.

Guidance

  • Increase the agents.defaults.timeoutSeconds value to at least 3000 (5 minutes) to allow for more time before timing out LLM requests, considering the current value of 1200 (20 minutes) might be too aggressive.
  • Modify the gateway's error handling to correctly classify LiveSessionModelSwitchError as a session state error rather than an "overloaded" provider, preventing unnecessary fallback chains.
  • Review the cron job configuration to ensure that model overrides are properly handled and do not trigger unnecessary LiveSessionModelSwitchError instances.
  • Monitor the gateway logs to verify that the adjusted timeout and corrected error classification reduce the occurrence of "timed out" and "overloaded" errors.

Example

No explicit code example is provided, as the necessary changes are related to configuration adjustments and error handling modifications, which depend on the specific implementation details of the gateway and its components.

Notes

The provided guidance assumes that the issue is primarily caused by the aggressive timeout and incorrect error classification. However, other factors, such as the agents.defaults.maxConcurrent setting, might also contribute to the problem. Further investigation and testing may be necessary to fully resolve the issue.

Recommendation

Apply the workaround by adjusting the agents.defaults.timeoutSeconds value and correcting the error classification, as this approach addresses the identified root causes and can help mitigate the issue without requiring a full version upgrade or extensive code changes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING