openclaw - 💡(How to fix) Fix [Bug]: Multiple Agents Stalled - Session Recovery Not Triggered on LLM Timeout [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77964Fetched 2026-05-06 06:18:41
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
2
Timeline (top)
labeled ×2closed ×1commented ×1

Stuck session: sessionId=d939c79e... reason=queued_work_without_active_run classification=stale_session_state recovery=checking

stuck session recovery skipped: reason=active_reply_work action=keep_lane sessionId=d939c79e... activeSessionId=d939c79e...

long-running session: classification=long_running recovery=none

stalled session: classification=stalled_agent_run recovery=none


**Key points:**
1. Gateway detected stuck session (`recovery=checking`)
2. Recovery was **SKIPPED** because `reason=active_reply_work` - kept the lane
3. Session later marked `stalled_agent_run` with `recovery=none`

Error Message

Agent stops mid work without clear error. in other cases I can ask status and it will continue, but when both primary and seconday llm fail, I need to restart the gateway for the agent to start working again.

Root Cause

Key points:

  1. Gateway detected stuck session (recovery=checking)
  2. Recovery was SKIPPED because reason=active_reply_work - kept the lane
  3. Session later marked stalled_agent_run with recovery=none

Code Example

**Key points:**
1. Gateway detected stuck session (`recovery=checking`)
2. Recovery was **SKIPPED** because `reason=active_reply_work` - kept the lane
3. Session later marked `stalled_agent_run` with `recovery=none`

### 3. Why Recovery=None?

The logs show the progression:
- `recovery=checking` - Gateway identifies potential issue
- `recovery=skipped` - Can't recover because `active_reply_work` in progress
- `recovery=none` - No recovery path implemented for this failure mode

---

## Diagnosis

**What caused the HTTP 408?**
- HTTP 408 = "Request Timeout" - MiniMax took too long to respond
- This means the API **WAS reached** but MiniMax didn't respond within their timeout window
- The 408 is MiniMax telling OpenClaw "we're too slow, abort"

I downgraded and tried with version 2026.4.23, Behavoir was different, I after a few hours of work I has 2 "similar" API errors but the Openclaw retried them automatically and was never stuck of frozen.  Did not have the messave recovery=none status

### Steps to reproduce

Both primary and secondary LLM providers need to fail.

### Expected behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently. 

### Actual behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

### OpenClaw version

2026.5.4

### Operating system

Ubuntu

### Install method

Script

### Model

Minimax m2.7

### Provider / routing chain

Minimax m2.7

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

Stuck session: sessionId=d939c79e... reason=queued_work_without_active_run classification=stale_session_state recovery=checking

stuck session recovery skipped: reason=active_reply_work action=keep_lane sessionId=d939c79e... activeSessionId=d939c79e...

long-running session: classification=long_running recovery=none

stalled session: classification=stalled_agent_run recovery=none


**Key points:**
1. Gateway detected stuck session (`recovery=checking`)
2. Recovery was **SKIPPED** because `reason=active_reply_work` - kept the lane
3. Session later marked `stalled_agent_run` with `recovery=none`

### 3. Why Recovery=None?

The logs show the progression:
- `recovery=checking` - Gateway identifies potential issue
- `recovery=skipped` - Can't recover because `active_reply_work` in progress
- `recovery=none` - No recovery path implemented for this failure mode

---

## Diagnosis

**What caused the HTTP 408?**
- HTTP 408 = "Request Timeout" - MiniMax took too long to respond
- This means the API **WAS reached** but MiniMax didn't respond within their timeout window
- The 408 is MiniMax telling OpenClaw "we're too slow, abort"

I downgraded and tried with version 2026.4.23, Behavoir was different, I after a few hours of work I has 2 "similar" API errors but the Openclaw retried them automatically and was never stuck of frozen.  Did not have the messave recovery=none status

### Steps to reproduce

Both primary and secondary LLM providers need to fail.

### Expected behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently. 

### Actual behavior

1. Retry the primary, then fail over a few more times, and notify the user there is an issue. 
2. User needs to understand what is happening and not just wait for a response that never comes. 

### OpenClaw version

2026.5.4

### Operating system

Ubuntu

### Install method

Script

### Model

Minimax m2.7

### Provider / routing chain

Minimax m2.7

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

Impact and severity

Agent stops mid work without clear error. in other cases I can ask status and it will continue, but when both primary and seconday llm fail, I need to restart the gateway for the agent to start working again.
Version .23 and before had more gracefull way to deal with agent freezing like this and did not require gateway restart.

Additional information

system-info.txt version.txt plugins.json

<!-- Failed to upload "report.md" -->

gateway-status.txt

logs-filtered.txt

extent analysis

TL;DR

Downgrade to OpenClaw version 2026.4.23 to potentially resolve the issue with stuck sessions and recovery failures.

Guidance

  • The issue seems to be related to the handling of primary and secondary LLM provider failures in OpenClaw version 2026.5.4, which can cause the agent to freeze and require a gateway restart.
  • The logs indicate that the recovery process is skipped due to active_reply_work, leading to a recovery=none status, which may be a contributing factor to the issue.
  • Downgrading to version 2026.4.23, as mentioned in the issue, may provide a more graceful handling of agent freezing and failure scenarios.
  • Reviewing the provided logs and system information files (e.g., logs-filtered.txt, system-info.txt) may help identify additional factors contributing to the issue.

Notes

The exact cause of the issue is unclear, and downgrading may not be a permanent solution. Further investigation into the differences between OpenClaw versions 2026.4.23 and 2026.5.4 may be necessary to resolve the issue.

Recommendation

Apply the workaround by downgrading to OpenClaw version 2026.4.23, as it has shown to handle primary and secondary LLM provider failures more robustly in the past.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Retry the primary, then fail over a few more times, and notify the user there is an issue.
  2. User needs to understand what is happening and not just wait for a response that never comes.

I had not experienced the exact same thing with version 2026.04.23, it seems older versions handle primary and failure failure differently.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Multiple Agents Stalled - Session Recovery Not Triggered on LLM Timeout [1 comments, 2 participants]