openclaw - 💡(How to fix) Fix [Bug]: Multiple Agents Stalled - Session Recovery Not Triggered on LLM Timeout [1 comments, 2 participants]

Q: Expected behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 2. User needs to understand what is happening and not just wait for a response that never comes. I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently.

openclaw2026-05-05 17:19:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77964•Fetched 2026-05-06 06:18:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

najef1979-code

Participants

clawsweeper[bot]

najef1979-code

Timeline (top)

labeled ×2closed ×1commented ×1

Stuck session: sessionId=d939c79e... reason=queued_work_without_active_run classification=stale_session_state recovery=checking

stuck session recovery skipped: reason=active_reply_work action=keep_lane sessionId=d939c79e... activeSessionId=d939c79e...

long-running session: classification=long_running recovery=none

stalled session: classification=stalled_agent_run recovery=none


**Key points:**
1. Gateway detected stuck session (`recovery=checking`)
2. Recovery was **SKIPPED** because `reason=active_reply_work` - kept the lane
3. Session later marked `stalled_agent_run` with `recovery=none`

Error Message

Root Cause

Key points:

Gateway detected stuck session (recovery=checking)
Recovery was SKIPPED because reason=active_reply_work - kept the lane
Session later marked stalled_agent_run with recovery=none

Code Example

**Key points:**
1. Gateway detected stuck session (`recovery=checking`)
2. Recovery was **SKIPPED** because `reason=active_reply_work` - kept the lane
3. Session later marked `stalled_agent_run` with `recovery=none`

### 3. Why Recovery=None?

The logs show the progression:
- `recovery=checking` - Gateway identifies potential issue
- `recovery=skipped` - Can't recover because `active_reply_work` in progress
- `recovery=none` - No recovery path implemented for this failure mode

---

## Diagnosis

**What caused the HTTP 408?**
- HTTP 408 = "Request Timeout" - MiniMax took too long to respond
- This means the API **WAS reached** but MiniMax didn't respond within their timeout window
- The 408 is MiniMax telling OpenClaw "we're too slow, abort"

I downgraded and tried with version 2026.4.23, Behavoir was different, I after a few hours of work I has 2 "similar" API errors but the Openclaw retried them automatically and was never stuck of frozen.  Did not have the messave recovery=none status

### Steps to reproduce

Both primary and secondary LLM providers need to fail.

### Expected behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently. 

### Actual behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

### OpenClaw version

2026.5.4

### Operating system

Ubuntu

### Install method

Script

### Model

Minimax m2.7

### Provider / routing chain

Minimax m2.7

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

Summary

Stuck session: sessionId=d939c79e... reason=queued_work_without_active_run classification=stale_session_state recovery=checking

stuck session recovery skipped: reason=active_reply_work action=keep_lane sessionId=d939c79e... activeSessionId=d939c79e...

long-running session: classification=long_running recovery=none

stalled session: classification=stalled_agent_run recovery=none


**Key points:**
1. Gateway detected stuck session (`recovery=checking`)
2. Recovery was **SKIPPED** because `reason=active_reply_work` - kept the lane
3. Session later marked `stalled_agent_run` with `recovery=none`

### 3. Why Recovery=None?

The logs show the progression:
- `recovery=checking` - Gateway identifies potential issue
- `recovery=skipped` - Can't recover because `active_reply_work` in progress
- `recovery=none` - No recovery path implemented for this failure mode

---

## Diagnosis

**What caused the HTTP 408?**
- HTTP 408 = "Request Timeout" - MiniMax took too long to respond
- This means the API **WAS reached** but MiniMax didn't respond within their timeout window
- The 408 is MiniMax telling OpenClaw "we're too slow, abort"

I downgraded and tried with version 2026.4.23, Behavoir was different, I after a few hours of work I has 2 "similar" API errors but the Openclaw retried them automatically and was never stuck of frozen.  Did not have the messave recovery=none status

### Steps to reproduce

Both primary and secondary LLM providers need to fail.

### Expected behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently. 

### Actual behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

### OpenClaw version

2026.5.4

### Operating system

Ubuntu

### Install method

Script

### Model

Minimax m2.7

### Provider / routing chain

Minimax m2.7

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

Impact and severity

Agent stops mid work without clear error. in other cases I can ask status and it will continue, but when both primary and seconday llm fail, I need to restart the gateway for the agent to start working again.
Version .23 and before had more gracefull way to deal with agent freezing like this and did not require gateway restart.

Additional information

system-info.txt version.txt plugins.json

gateway-status.txt

logs-filtered.txt

extent analysis

TL;DR

Downgrade to OpenClaw version 2026.4.23 to potentially resolve the issue with stuck sessions and recovery failures.

Guidance

The issue seems to be related to the handling of primary and secondary LLM provider failures in OpenClaw version 2026.5.4, which can cause the agent to freeze and require a gateway restart.
The logs indicate that the recovery process is skipped due to active_reply_work, leading to a recovery=none status, which may be a contributing factor to the issue.
Downgrading to version 2026.4.23, as mentioned in the issue, may provide a more graceful handling of agent freezing and failure scenarios.
Reviewing the provided logs and system information files (e.g., logs-filtered.txt, system-info.txt) may help identify additional factors contributing to the issue.

Notes

The exact cause of the issue is unclear, and downgrading may not be a permanent solution. Further investigation into the differences between OpenClaw versions 2026.4.23 and 2026.5.4 may be necessary to resolve the issue.

Recommendation

Apply the workaround by downgrading to OpenClaw version 2026.4.23, as it has shown to handle primary and secondary LLM provider failures more robustly in the past.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Retry the primary, then fail over a few more times, and notify the user there is an issue.
User needs to understand what is happening and not just wait for a response that never comes.

I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently.

#api #installation #tensor shape #autograd error #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Multiple Agents Stalled - Session Recovery Not Triggered on LLM Timeout [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug type

Beta release blocker

Summary

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Multiple Agents Stalled - Session Recovery Not Triggered on LLM Timeout [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug type

Beta release blocker

Summary

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING