openclaw - 💡(How to fix) Fix Bug: Gateway falsely marks healthy local vLLM endpoints as timed out/overloaded, causing 1–23 min fallback cascades [1 participants]

Error Message

Error previews: The gateway logs show reason: "overloaded" with errorPreview: "Live session model switch requested: novita/zai-org/glm-4.7". A session model mismatch is being classified as a provider overload — the gateway is conflating an internal session state error with provider unavailability.

Summary

The gateway's model-fallback/routing subsystem incorrectly marks healthy, responsive local vLLM endpoints as "timed out" or "overloaded", causing cascading fallback chains that take 1–23 minutes to resolve. The endpoints themselves respond in 0.27–0.93s when tested directly via curl.

Environment

OpenClaw: 2026.4.5 (container, Linux 6.12.63 x64)
Gateway: loopback bind, port 18789
Local vLLM endpoints:
- vllm-8001 (gemma4, 27B) on jupiter.wg.local:8001 — dedicated GPU
- vllm-7002 (qwen3.5-27b) on jupiter.wg.local:7002 — dedicated GPU
Remote providers: Novita (GLM-5, Kimi), DeepInfra (Kimi), Anthropic (Sonnet)
Config: agents.defaults.timeoutSeconds: 1200, agents.defaults.llm.idleTimeoutSeconds: 300

Observed Behaviour

1. Endpoints are fast (direct curl, concurrent)

Both GPUs idle, tested concurrently:

Endpoint	Avg latency (5 reqs)
vllm-8001/gemma4	0.28s
vllm-7002/qwen3.5-27b	0.91s

2. Gateway marks them as timed out or overloaded

From gateway logs, model fallback decisions for today:

Failure reasons:

timeout: 17 occurrences
unknown: 8
overloaded: 2

Error previews:

LLM request timed out.: 12
Gateway is draining for restart; new tasks are not accepted: 8
cron: job execution timed out: 4
Live session model switch requested: <model>: 2
Request was aborted.: 1

3. Fallback chains take minutes

Example fallback chains from today's logs:

Run ID	Chain	Total time
`7c914aae`	qwen→timeout → Kimi→timeout → gemma→timeout → Sonnet✓	23.4 min
`21ca97c0`	gemma→timeout → qwen✓	4.1 min
`0cb06206`	gemma→timeout → Kimi✓	56.6s
`66f5e9e5`	gemma→timeout → GLM-5✓	80.6s

4. Gateway can't even spawn subagents

Attempting sessions_spawn returns:

gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789

Meanwhile, direct curl to the same endpoints returns in <1s.

5. "Overloaded" misclassification

The gateway logs show reason: "overloaded" with errorPreview: "Live session model switch requested: novita/zai-org/glm-4.7". A session model mismatch is being classified as a provider overload — the gateway is conflating an internal session state error with provider unavailability.

Expected Behaviour

Requests to healthy, sub-second local endpoints should not time out
Session model switch errors should not be classified as "overloaded"
Fallback chains should not take minutes when all providers are responsive
sessions_spawn should not time out when the gateway is under normal load

Root Cause Hypothesis

Two distinct bugs:

Internal timeout too aggressive or misapplied: The gateway's LLM request timeout fires before the endpoint responds, or the timeout is applied to an internal queue wait rather than the actual HTTP request. Endpoints respond in <1s but the gateway reports "LLM request timed out" 17 times today.
LiveSessionModelSwitchError misclassified as "overloaded": When a cron job or isolated session requests a model different from the live session's current model, the gateway throws LiveSessionModelSwitchError and classifies this as reason: "overloaded" in the fallback system. This is semantically wrong and triggers unnecessary fallback cascading.

Reproduction

Configure two local vLLM providers with fast endpoints (<1s response)
Configure 3+ agents with cron jobs using different model overrides
Observe gateway logs: endpoints will be marked as "timed out" despite being healthy
Run curl directly against the same endpoints to confirm sub-second response

Impact

Interactive sessions experience minutes-long delays for responses that should take seconds
Cron jobs time out and fail unnecessarily
Session continuity breaks (related to #63195)
Gateway becomes unresponsive to sessions_spawn and CLI commands
Users lose trust in the interface as a reliable work surface

Additional Context

Related: #63195 (sessions disappearing during normal use)
The LiveSessionModelSwitchError appears 17 times in today's logs, suggesting this is the dominant failure mode
agents.defaults.maxConcurrent: 8 with 5 agents may amplify the issue but is not the root cause — the endpoints are idle when failures occur

extent analysis

TL;DR

Adjust the agents.defaults.timeoutSeconds configuration to a higher value to prevent premature timeouts, and correct the classification of LiveSessionModelSwitchError to prevent unnecessary fallback cascading.

Guidance

Increase the agents.defaults.timeoutSeconds value to at least 3000 (5 minutes) to allow for more time before timing out LLM requests, considering the current value of 1200 (20 minutes) might be too aggressive.
Modify the gateway's error handling to correctly classify LiveSessionModelSwitchError as a session state error rather than an "overloaded" provider, preventing unnecessary fallback chains.
Review the cron job configuration to ensure that model overrides are properly handled and do not trigger unnecessary LiveSessionModelSwitchError instances.
Monitor the gateway logs to verify that the adjusted timeout and corrected error classification reduce the occurrence of "timed out" and "overloaded" errors.

Example

No explicit code example is provided, as the necessary changes are related to configuration adjustments and error handling modifications, which depend on the specific implementation details of the gateway and its components.

Notes

The provided guidance assumes that the issue is primarily caused by the aggressive timeout and incorrect error classification. However, other factors, such as the agents.defaults.maxConcurrent setting, might also contribute to the problem. Further investigation and testing may be necessary to fully resolve the issue.

Recommendation

Apply the workaround by adjusting the agents.defaults.timeoutSeconds value and correcting the error classification, as this approach addresses the identified root causes and can help mitigate the issue without requiring a full version upgrade or extensive code changes.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Bug: Gateway falsely marks healthy local vLLM endpoints as timed out/overloaded, causing 1–23 min fallback cascades [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Hypothesis

Code Example

Summary

Environment

Observed Behaviour

1. Endpoints are fast (direct curl, concurrent)

2. Gateway marks them as timed out or overloaded

3. Fallback chains take minutes

4. Gateway can't even spawn subagents

5. "Overloaded" misclassification

Expected Behaviour

Root Cause Hypothesis

Reproduction

Impact

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Bug: Gateway falsely marks healthy local vLLM endpoints as timed out/overloaded, causing 1–23 min fallback cascades [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Hypothesis

Code Example

Summary

Environment

Observed Behaviour

1. Endpoints are fast (direct curl, concurrent)

2. Gateway marks them as timed out or overloaded

3. Fallback chains take minutes

4. Gateway can't even spawn subagents

5. "Overloaded" misclassification

Expected Behaviour

Root Cause Hypothesis

Reproduction

Impact

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING