openclaw - ✅(Solved) Fix [Bug]: Agent timeout does not surface error to UI, UI hangs indefinitely [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64793Fetched 2026-04-12 13:26:47
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
commented ×2cross-referenced ×2labeled ×2closed ×1

When an LLM request times out during agent execution, the agent correctly logs decision=surface_error reason=timeout but the Web UI hangs indefinitely showing a loading spinner instead of displaying a timeout error.

Error Message

Gateway logs showing the timeout and connection abort: [LLM-Gateway][INFO] 流式尝试 1/4 [LLM-Gateway][INFO] 流式响应状态码: 200 [LLM-Gateway][ERROR] ⚠️ 流式请求异常 RemoteProtocolError (尝试 1/4) - Server disconnected without sending a response. [LLM-Gateway][INFO] 等待 1s 重试... ... [LLM-Gateway][INFO] 流式尝试 2/4 [LLM-Gateway][INFO] 流式响应状态码: 200 ... ConnectionAbortedError: [WinError 10053] 你的主机中的软件中止了一个已建立的连接。

========================================================================

Agent logs showing timeout detection: [agent] embedded run failover decision: runId=... stage=assistant decision=surface_error reason=timeout provider=sweetmido/minimax-m2.5

Root Cause

  • The issue appears to be that when the agent aborts a run due to timeout, the final event with error status never reaches the Web UI, possibly because the WebSocket connection is terminated before the event can be sent.
  • Workaround: Refresh the page, though this loses conversation context.
  • The custom gateway (ai_router.py) functions correctly and retries on network errors, but cannot compensate for the agent aborting the connection.
  • Suggest: Ensure final event with status: "timeout" is always sent to the Web UI before or during connection teardown when a timeout occurs.

Fix Action

Fix / Workaround

  • The issue appears to be that when the agent aborts a run due to timeout, the final event with error status never reaches the Web UI, possibly because the WebSocket connection is terminated before the event can be sent.
  • Workaround: Refresh the page, though this loses conversation context.
  • The custom gateway (ai_router.py) functions correctly and retries on network errors, but cannot compensate for the agent aborting the connection.
  • Suggest: Ensure final event with status: "timeout" is always sent to the Web UI before or during connection teardown when a timeout occurs.

PR fix notes

PR #64809: fix(webchat): map clientRunId to real runId on agent run start so timeout events reach UI

Description (problem / solution / changelog)

Closes #64793

Changed files

  • extensions/memory-core/src/dreaming-narrative.ts (modified, +2/-0)
  • src/gateway/server-methods/chat.directive-tags.test.ts (modified, +31/-0)
  • src/gateway/server-methods/chat.ts (modified, +3/-0)

PR #64817: fix: surface_error failover now throws FailoverError to prevent UI hang

Description (problem / solution / changelog)

Summary

Fixes #64793

When an LLM request times out and the failover policy decides to surface_error, the handleAssistantFailover function previously fell through to return { action: "continue_normal" }, silently swallowing the error. This caused the WebSocket connection to abort before any final or error event could be broadcast to the UI, leaving the client spinner hanging indefinitely.

  • assistant-failover.ts: surface_error decisions now return { action: "throw" } with a properly constructed FailoverError, ensuring the error propagates through the promise chain to broadcastChatError
  • chat.ts: Added a terminalEventSent safety net in the .finally() handler — if neither .then() nor .catch() managed to broadcast a terminal event (e.g. due to a connection abort race), a fallback error event is emitted
  • Tests: Added 9 regression tests covering timeout, billing, rate-limit, auth, generic failure, idle-timeout retry, and continue_normal preservation scenarios; extended failover-policy.test.ts with 2 timeout surface_error policy assertions

Test plan

  • assistant-failover.surface-error-throws.test.ts — 9 tests pass
  • failover-policy.test.ts — all assertions pass with new timeout cases
  • Manual: deploy with a slow LLM provider (e.g. minimax-m2.5 via NVIDIA API), trigger timeout, verify UI displays error instead of hanging
  • Verify existing retry/fallback behavior is preserved (idle timeout retry, profile rotation, model fallback)

🤖 Generated with Claude Code

Changed files

  • src/agents/pi-embedded-runner/run/assistant-failover.surface-error-throws.test.ts (added, +185/-0)
  • src/agents/pi-embedded-runner/run/assistant-failover.ts (modified, +38/-0)
  • src/agents/pi-embedded-runner/run/failover-policy.test.ts (modified, +34/-0)
  • src/gateway/server-methods/chat.ts (modified, +25/-0)

Code Example

Gateway logs showing the timeout and connection abort:
[LLM-Gateway][INFO] 流式尝试 1/4
[LLM-Gateway][INFO] 流式响应状态码: 200
[LLM-Gateway][ERROR] ⚠️ 流式请求异常 RemoteProtocolError (尝试 1/4) - Server disconnected without sending a response.
[LLM-Gateway][INFO] 等待 1s 重试...
...
[LLM-Gateway][INFO] 流式尝试 2/4
[LLM-Gateway][INFO] 流式响应状态码: 200
...
ConnectionAbortedError: [WinError 10053] 你的主机中的软件中止了一个已建立的连接。

========================================================================

Agent logs showing timeout detection:
[agent] embedded run failover decision: runId=... stage=assistant decision=surface_error reason=timeout provider=sweetmido/minimax-m2.5
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When an LLM request times out during agent execution, the agent correctly logs decision=surface_error reason=timeout but the Web UI hangs indefinitely showing a loading spinner instead of displaying a timeout error.

Steps to reproduce

  1. Deploy OpenClaw with Docker (or any deployment method)
  2. Configure a slow LLM provider (e.g., minimax-m2.5 via NVIDIA API) that occasionally takes >60 seconds to respond
  3. Send a message through the Web UI
  4. Observe that when the LLM response exceeds the agent timeout threshold, the connection is aborted (ConnectionAbortedError in gateway logs)
  5. Observe that the Web UI continues to show a loading spinner indefinitely, never displaying an error message or recovering

Expected behavior

When an agent run times out, the agent should return a status: "timeout" via agent.wait and the Web UI should receive a final event with error information, then display an appropriate timeout error message to the user and allow retry.

Actual behavior

  • Gateway logs show ConnectionAbortedError: [WinError 10053] indicating the client (OpenClaw) aborted the connection
  • Agent logs show decision=surface_error reason=timeout
  • The Web UI spinner continues indefinitely
  • No error message is displayed
  • The user cannot retry without refreshing the page

OpenClaw version

2026.4.9

Operating system

Linux (Docker deployment: Ubuntu-based container, host OS Windows 11)

Install method

docker

Model

minimax-m2.5 (via NVIDIA API)

Provider / routing chain

openclaw -> ai_router.py (custom gateway) -> NVIDIA API -> minimax-m2.5

Additional provider/model setup details

Custom gateway implementation (ai_router.py) handles streaming requests to NVIDIA API with retry logic. The gateway correctly streams responses and handles timeouts, but the agent's timeout mechanism cuts the connection before the gateway can return a proper error response to the UI.

Logs, screenshots, and evidence

Gateway logs showing the timeout and connection abort:
[LLM-Gateway][INFO] 流式尝试 1/4
[LLM-Gateway][INFO] 流式响应状态码: 200
[LLM-Gateway][ERROR] ⚠️ 流式请求异常 RemoteProtocolError (尝试 1/4) - Server disconnected without sending a response.
[LLM-Gateway][INFO] 等待 1s 重试...
...
[LLM-Gateway][INFO] 流式尝试 2/4
[LLM-Gateway][INFO] 流式响应状态码: 200
...
ConnectionAbortedError: [WinError 10053] 你的主机中的软件中止了一个已建立的连接。

========================================================================

Agent logs showing timeout detection:
[agent] embedded run failover decision: runId=... stage=assistant decision=surface_error reason=timeout provider=sweetmido/minimax-m2.5

Impact and severity

  • Affected users/systems/channels: All Web UI users when using slow LLM providers
  • Severity: High (blocks workflow, makes UI unusable until page refresh)
  • Frequency: Intermittent (occurs when LLM response exceeds agent timeout threshold)
  • Consequence: Users see indefinite loading spinner, cannot retry, must refresh page losing conversation context

Additional information

  • The issue appears to be that when the agent aborts a run due to timeout, the final event with error status never reaches the Web UI, possibly because the WebSocket connection is terminated before the event can be sent.
  • Workaround: Refresh the page, though this loses conversation context.
  • The custom gateway (ai_router.py) functions correctly and retries on network errors, but cannot compensate for the agent aborting the connection.
  • Suggest: Ensure final event with status: "timeout" is always sent to the Web UI before or during connection teardown when a timeout occurs.

extent analysis

TL;DR

The Web UI hangs indefinitely when an LLM request times out during agent execution, likely due to the agent aborting the connection before sending a final event with error status.

Guidance

  • Verify that the final event with status: "timeout" is being generated by the agent when a timeout occurs, but not being sent to the Web UI due to the aborted connection.
  • Modify the agent to ensure the final event is sent to the Web UI before or during connection teardown when a timeout occurs, allowing the UI to display a timeout error message and recover.
  • Investigate the custom gateway (ai_router.py) to see if it can be modified to handle the agent's timeout mechanism and send a proper error response to the UI.
  • Consider increasing the agent timeout threshold or implementing a retry mechanism in the Web UI to mitigate the issue.

Example

No code snippet is provided as the issue is more related to the interaction between the agent, gateway, and Web UI, and the exact implementation details are not specified.

Notes

The issue appears to be specific to the interaction between the agent, custom gateway, and Web UI, and may require modifications to the agent and/or gateway to resolve. The workaround of refreshing the page is not ideal as it loses conversation context.

Recommendation

Apply a workaround by modifying the agent to send the final event with status: "timeout" before or during connection teardown, allowing the Web UI to display a timeout error message and recover. This should mitigate the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When an agent run times out, the agent should return a status: "timeout" via agent.wait and the Web UI should receive a final event with error information, then display an appropriate timeout error message to the user and allow retry.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Agent timeout does not surface error to UI, UI hangs indefinitely [2 pull requests, 2 comments, 2 participants]