openclaw - 💡(How to fix) Fix diagnostic: 'stuck session' false positive floods log for long-running embedded agent runs [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

diagnostic subsystem emits stuck session warn-level events every 30s for any agent session that runs longer than 120s, even when the session is healthy and actively producing tool calls. 2026-05-09T10:03:23.234Z warn diagnostic stuck session: sessionId=blane sessionKey=agent:blane:cron:67563352-...:run:63c24af6-... state=processing age=121s queueDepth=0 reason=processing_without_queue recovery=checking 2026-05-09T10:03:23.240Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=blane ... age=121s ... 2026-05-09T10:03:53.241Z warn diagnostic stuck session: ... age=151s ... 2026-05-09T10:03:53.243Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only ... age=151s ... Each "stuck" warning is followed almost immediately (within milliseconds) by recovery skipped: reason=active_embedded_run action=observe_only — i.e. the diagnostic itself recognizes the session is fine and takes no action. It's a benign condition logged at warn level. For long-running agent sessions (cron-triggered article-writing flows that take 5-10 minutes), the gateway log accumulates dozens of pairs of warn lines per session. In our setup with two agents on 30-min crons, this generates ~70% of the warn-level log volume and drowns out the signals that actually matter (real LLM timeouts, network errors, etc.).

  1. Demote to debug level when recovery=skipped due to active_embedded_run — only fire warn when recovery actually runs or fails.

Fix Action

Fixed

Code Example

2026-05-09T10:03:23.234Z warn diagnostic stuck session: sessionId=blane sessionKey=agent:blane:cron:67563352-...:run:63c24af6-... state=processing age=121s queueDepth=0 reason=processing_without_queue recovery=checking
2026-05-09T10:03:23.240Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=blane ... age=121s ...
2026-05-09T10:03:53.241Z warn diagnostic stuck session: ... age=151s ...
2026-05-09T10:03:53.243Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only ... age=151s ...
[and so on, every 30s, until the session ends]
RAW_BUFFERClick to expand / collapse

Bug

diagnostic subsystem emits stuck session warn-level events every 30s for any agent session that runs longer than 120s, even when the session is healthy and actively producing tool calls.

What I see in the log

2026-05-09T10:03:23.234Z warn diagnostic stuck session: sessionId=blane sessionKey=agent:blane:cron:67563352-...:run:63c24af6-... state=processing age=121s queueDepth=0 reason=processing_without_queue recovery=checking
2026-05-09T10:03:23.240Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=blane ... age=121s ...
2026-05-09T10:03:53.241Z warn diagnostic stuck session: ... age=151s ...
2026-05-09T10:03:53.243Z warn diagnostic stuck session recovery skipped: reason=active_embedded_run action=observe_only ... age=151s ...
[and so on, every 30s, until the session ends]

Each "stuck" warning is followed almost immediately (within milliseconds) by recovery skipped: reason=active_embedded_run action=observe_only — i.e. the diagnostic itself recognizes the session is fine and takes no action. It's a benign condition logged at warn level.

Why this is a problem

For long-running agent sessions (cron-triggered article-writing flows that take 5-10 minutes), the gateway log accumulates dozens of pairs of warn lines per session. In our setup with two agents on 30-min crons, this generates ~70% of the warn-level log volume and drowns out the signals that actually matter (real LLM timeouts, network errors, etc.).

Suggested fix (any of)

  1. Demote to debug level when recovery=skipped due to active_embedded_run — only fire warn when recovery actually runs or fails.
  2. Don't log the "stuck" detection at all when reason=processing_without_queue AND there's an active embedded run — the diagnostic is by its own admission a false positive in that case.
  3. Increase the threshold for processing_without_queue to something like 600s (above the cron timeout), so legitimate long sessions never get flagged.

Environment

  • OpenClaw 2026.4.29 (a448042)
  • Windows 11, Node 24
  • Two agents (brockman, blane), running on 30-minute cron schedules
  • Sessions naturally run 4-9 minutes when there's editorial work to do

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix diagnostic: 'stuck session' false positive floods log for long-running embedded agent runs [1 pull requests]