openclaw - 💡(How to fix) Fix diagnostic: 'stuck session' false positive floods log for long-running embedded agent runs [1 pull requests]

openclaw2026-05-09 20:48:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

diagnostic subsystem emits stuck session warn-level events every 30s for any agent session that runs longer than 120s, even when the session is healthy and actively producing tool calls. 2026-05-09T10:03:23.234Z warn diagnostic stuck session: sessionId=blane sessionKey=agent:blane:cron:67563352-...:run:63c24af6-... state=processing age=121s queueDepth=0 reason=processing_without_queue recovery=checking 2026-05-09T10:03:23.240Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=blane ... age=121s ... 2026-05-09T10:03:53.241Z warn diagnostic stuck session: ... age=151s ... 2026-05-09T10:03:53.243Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only ... age=151s ... Each "stuck" warning is followed almost immediately (within milliseconds) by recovery skipped: reason=active_embedded_run action=observe_only — i.e. the diagnostic itself recognizes the session is fine and takes no action. It's a benign condition logged at warn level. For long-running agent sessions (cron-triggered article-writing flows that take 5-10 minutes), the gateway log accumulates dozens of pairs of warn lines per session. In our setup with two agents on 30-min crons, this generates ~70% of the warn-level log volume and drowns out the signals that actually matter (real LLM timeouts, network errors, etc.).

Demote to debug level when recovery=skipped due to active_embedded_run — only fire warn when recovery actually runs or fails.

Fix Action

Fixed

Fixed by PR: fix(diagnostic): demote active_embedded_run recovery skips to debug (https://github.com/openclaw/openclaw/pull/79979)

Code Example

2026-05-09T10:03:23.234Z warn diagnostic stuck session: sessionId=blane sessionKey=agent:blane:cron:67563352-...:run:63c24af6-... state=processing age=121s queueDepth=0 reason=processing_without_queue recovery=checking
2026-05-09T10:03:23.240Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=blane ... age=121s ...
2026-05-09T10:03:53.241Z warn diagnostic stuck session: ... age=151s ...
2026-05-09T10:03:53.243Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only ... age=151s ...
[and so on, every 30s, until the session ends]

RAW_BUFFERClick to expand / collapse

Bug

diagnostic subsystem emits stuck session warn-level events every 30s for any agent session that runs longer than 120s, even when the session is healthy and actively producing tool calls.

What I see in the log

2026-05-09T10:03:23.234Z warn diagnostic stuck session: sessionId=blane sessionKey=agent:blane:cron:67563352-...:run:63c24af6-... state=processing age=121s queueDepth=0 reason=processing_without_queue recovery=checking
2026-05-09T10:03:23.240Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=blane ... age=121s ...
2026-05-09T10:03:53.241Z warn diagnostic stuck session: ... age=151s ...
2026-05-09T10:03:53.243Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only ... age=151s ...
[and so on, every 30s, until the session ends]

Each "stuck" warning is followed almost immediately (within milliseconds) by recovery skipped: reason=active_embedded_run action=observe_only — i.e. the diagnostic itself recognizes the session is fine and takes no action. It's a benign condition logged at warn level.

Why this is a problem

For long-running agent sessions (cron-triggered article-writing flows that take 5-10 minutes), the gateway log accumulates dozens of pairs of warn lines per session. In our setup with two agents on 30-min crons, this generates ~70% of the warn-level log volume and drowns out the signals that actually matter (real LLM timeouts, network errors, etc.).

Suggested fix (any of)

Demote to debug level when recovery=skipped due to active_embedded_run — only fire warn when recovery actually runs or fails.
Don't log the "stuck" detection at all when reason=processing_without_queue AND there's an active embedded run — the diagnostic is by its own admission a false positive in that case.
Increase the threshold for processing_without_queue to something like 600s (above the cron timeout), so legitimate long sessions never get flagged.

Environment

OpenClaw 2026.4.29 (a448042)
Windows 11, Node 24
Two agents (brockman, blane), running on 30-minute cron schedules
Sessions naturally run 4-9 minutes when there's editorial work to do

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix diagnostic: 'stuck session' false positive floods log for long-running embedded agent runs [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

Code Example

Bug

What I see in the log

Why this is a problem

Suggested fix (any of)

Environment

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix diagnostic: 'stuck session' false positive floods log for long-running embedded agent runs [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

Code Example

Bug

What I see in the log

Why this is a problem

Suggested fix (any of)

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING