openclaw - 💡(How to fix) Fix Diagnostic recovery aborts active embedded runs after terminal-looking app-server progress [1 pull requests]

openclaw2026-05-22 22:59:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

OpenClaw core diagnostics repeatedly classify embedded Codex app-server runs as stalled after terminal-looking progress events such as codex_app_server:notification:rawResponseItem/completed, then runs the core recovery action abort_embedded_run. This releases the session lane, but it can also cut off long-running direct/group work that was still producing useful navigation, browser, tool, or assistant progress.

This is not Taskdash, raw-Codex watch, or our custom outcome-supervisor making a decision. The exact warning and recovery strings are emitted by the installed OpenClaw runtime:

/usr/local/lib/node_modules/openclaw/dist/diagnostic-CgdFvhDv.js
/usr/local/lib/node_modules/openclaw/dist/diagnostic-stuck-session-recovery.runtime-C6DQkhmb.js

The user-visible failure is severe: direct work can appear "done" or simply stop responding after CAPTCHA/browser/navigation work, while the task outcome is incomplete and there is no clear terminal delivery. The operator then has to manually reconcile whether the work finished, was blocked, or was system-aborted.

Root Cause

OpenClaw should not abort a direct/group embedded run solely because the last low-level app-server event looks terminal while queued work exists.

Fix Action

Fixed

Fixed by PR: fix(diagnostic): recover terminal-progress orphan embedded runs at 60s (https://github.com/openclaw/openclaw/pull/85571)

Code Example

2026-05-19T20:07:58.897+03:00 [diagnostic] stalled session: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> state=processing age=389s queueDepth=1 reason=queued_behind_terminal_active_work classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=3s terminalProgressStale=true recovery=checking
2026-05-19T20:07:58.940+03:00 [diagnostic] stuck session recovery: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> age=389s action=abort_embedded_run aborted=true drained=true released=0
2026-05-19T20:07:58.942+03:00 [diagnostic] stuck session recovery outcome: status=aborted action=abort_embedded_run sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> activeSessionId=<redacted-session-id> activeWorkKind=embedded_run lane=session:agent:main:telegram:default:direct:<redacted-chat> aborted=true drained=true forceCleared=false released=0

---

2026-05-21T19:49:35.284+03:00 [diagnostic] stalled session: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> state=processing age=435s queueDepth=1 reason=queued_behind_terminal_active_work classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=1s terminalProgressStale=true recovery=checking
2026-05-21T19:49:35.331+03:00 [diagnostic] stuck session recovery: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> age=435s action=abort_embedded_run aborted=true drained=true released=0

2026-05-21T23:54:05.945+03:00 [diagnostic] stalled session: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> state=processing age=417s queueDepth=1 reason=queued_behind_terminal_active_work classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=2s terminalProgressStale=true recovery=checking
2026-05-21T23:54:05.987+03:00 [diagnostic] stuck session recovery: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> age=417s action=abort_embedded_run aborted=true drained=true released=0

---

if (params.queueDepth > 0 && params.activity.activeWorkKind === "embedded_run" && isTerminalDiagnosticProgressReason(params.activity.lastProgressReason)) return {
  eventType: "session.stalled",
  reason: "queued_behind_terminal_active_work",
  classification: "stalled_agent_run",
  activeWorkKind: params.activity.activeWorkKind,
  recoveryEligible: false
};

---

return params.classification?.eventType === "session.stalled" &&
  params.classification.classification === "stalled_agent_run" &&
  params.classification.activeWorkKind === "embedded_run" &&
  params.ageMs >= params.stuckSessionAbortMs;

---

action=abort_embedded_run aborted=true drained=true

RAW_BUFFERClick to expand / collapse

OpenClaw Diagnostic Recovery Aborts Active Embedded Runs

Date: 2026-05-23 Reporter environment: macOS 26.4.1 arm64, OpenClaw 2026.5.20 (e510042), Node v24.13.0, OpenAI Codex embedded runner.

Summary

This is not Taskdash, raw-Codex watch, or our custom outcome-supervisor making a decision. The exact warning and recovery strings are emitted by the installed OpenClaw runtime:

/usr/local/lib/node_modules/openclaw/dist/diagnostic-CgdFvhDv.js
/usr/local/lib/node_modules/openclaw/dist/diagnostic-stuck-session-recovery.runtime-C6DQkhmb.js

Impact

Long direct/group tasks are aborted by core recovery even when the app-server stream still alternates between terminal-looking events and real progress events.
Queued user turns can resume, but the original task outcome is left uncertain.
The session/run can look terminal in dashboards even when task-level work did not finish.
Operators must manually inspect transcripts, browser state, logs, and local files to determine whether work succeeded.

Observed Evidence

From local gateway diagnostics, sanitized:

14 stuck session recovery: ... action=abort_embedded_run aborted=true events in the retained gateway diagnostic log.
11 of those were direct Telegram sessions.
3 were group Telegram sessions.
59 queued_behind_terminal_active_work stall warnings in the same retained diagnostic log.

Representative sanitized sequence:

2026-05-19T20:07:58.897+03:00 [diagnostic] stalled session: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> state=processing age=389s queueDepth=1 reason=queued_behind_terminal_active_work classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=3s terminalProgressStale=true recovery=checking
2026-05-19T20:07:58.940+03:00 [diagnostic] stuck session recovery: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> age=389s action=abort_embedded_run aborted=true drained=true released=0
2026-05-19T20:07:58.942+03:00 [diagnostic] stuck session recovery outcome: status=aborted action=abort_embedded_run sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> activeSessionId=<redacted-session-id> activeWorkKind=embedded_run lane=session:agent:main:telegram:default:direct:<redacted-chat> aborted=true drained=true forceCleared=false released=0

Later examples show the same pattern:

2026-05-21T19:49:35.284+03:00 [diagnostic] stalled session: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> state=processing age=435s queueDepth=1 reason=queued_behind_terminal_active_work classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=1s terminalProgressStale=true recovery=checking
2026-05-21T19:49:35.331+03:00 [diagnostic] stuck session recovery: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> age=435s action=abort_embedded_run aborted=true drained=true released=0

2026-05-21T23:54:05.945+03:00 [diagnostic] stalled session: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> state=processing age=417s queueDepth=1 reason=queued_behind_terminal_active_work classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=2s terminalProgressStale=true recovery=checking
2026-05-21T23:54:05.987+03:00 [diagnostic] stuck session recovery: sessionId=<redacted-session-id> sessionKey=agent:main:telegram:default:direct:<redacted-chat> age=417s action=abort_embedded_run aborted=true drained=true released=0

An operator-provided fresh excerpt from 2026-05-23 showed the same session repeatedly alternating between:

long-running session ... lastProgress=codex_app_server:notification:item/agentMessage/delta
long-running session ... lastProgress=codex_app_server:notification:turn/diff/updated
stalled session ... lastProgress=codex_app_server:notification:rawResponseItem/completed ... recovery=checking
stuck session recovery ... action=abort_embedded_run aborted=true drained=true

That run had recently solved a CAPTCHA and navigated pages, then stopped without a visible final direct response. Live state later showed no active direct run, which is consistent with core recovery having ended the embedded run while task-level work remained unresolved.

Source-Level Suspect

In the installed build, classification treats terminal-looking Codex app-server notifications as stale/terminal active work when queued work exists:

if (params.queueDepth > 0 && params.activity.activeWorkKind === "embedded_run" && isTerminalDiagnosticProgressReason(params.activity.lastProgressReason)) return {
  eventType: "session.stalled",
  reason: "queued_behind_terminal_active_work",
  classification: "stalled_agent_run",
  activeWorkKind: params.activity.activeWorkKind,
  recoveryEligible: false
};

Then separate recovery eligibility permits active abort for stalled embedded runs after the abort threshold:

return params.classification?.eventType === "session.stalled" &&
  params.classification.classification === "stalled_agent_run" &&
  params.classification.activeWorkKind === "embedded_run" &&
  params.ageMs >= params.stuckSessionAbortMs;

The recovery runtime then calls abortAndDrainEmbeddedPiRun and emits:

action=abort_embedded_run aborted=true drained=true

This means a notification such as rawResponseItem/completed can become a recovery trigger even when the larger app-server turn/session still has useful later progress or task-level obligations.

Expected Behavior

OpenClaw should not abort a direct/group embedded run solely because the last low-level app-server event looks terminal while queued work exists.

Safer behavior:

Distinguish "terminal response item" from "terminal run/session/task".
Require a durable run/session terminal event, or a stronger no-progress invariant, before aborting active embedded work.
If recovery is necessary, mark the session/run outcome distinctly as system_aborted or equivalent, with enough evidence for UI/API consumers to avoid showing normal done.
Preserve and surface whether a final assistant response was delivered to the original channel.
Prefer lane release or queue backpressure mechanisms that do not interrupt active browser/tool/model work unless active work is proven orphaned.

Actual Behavior

OpenClaw core emits recovery=checking, calls abort_embedded_run, reports aborted=true drained=true, and the original user task can become outcome-ambiguous. Downstream tools that reconstruct task state from session rows can then flatten the row to done/completed because they see terminal timestamps or clean model completion fragments without the diagnostic recovery context.

Related Issues

Potentially related but not identical:

This report is specifically about diagnostic recovery using terminal-looking app-server notification reasons to abort embedded direct/group runs, causing task outcome loss or ambiguity.

Suggested Fix Shape

Treat rawResponseItem/completed, response.completed, output_item.done, and similar item-level events as terminal only for the item/span they describe, not for the whole embedded run.
Reset or downgrade terminalProgressStale when newer non-terminal progress follows, including item/started, item/agentMessage/delta, turn/diff/updated, tool activity, browser activity, or assistant delta.
Add a "system aborted" terminal classification to session/run state when core recovery does abort, so dashboard/API consumers can distinguish core recovery from normal completion and operator abort.
Add regression coverage for a sequence that alternates:
- rawResponseItem/completed
- later assistant/tool/progress events
- queued follow-up work
- no durable run/session completion

The expected result for that sequence should not be abort_embedded_run unless the embedded run is independently proven orphaned.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering