openclaw - 💡(How to fix) Fix [Bug]: Direct Telegram session can stall on orphaned diagnostic tool activity, recovery aborts only after ~6 min and may close Codex app-server

Root Cause

A direct Telegram session stopped replying. Gateway diagnostics detected the stuck state, but the user-visible recovery happened too late and the recovery path appears to abort the embedded run in a way that also caused Codex app-server client failures.

This does not look like a config reload. The OpenClaw config file was unchanged and there was no config-change/reload event around the incident.

Code Example

stalled session: sessionId=<redacted>
sessionKey=agent:dalva:telegram:direct:<redacted>
state=processing
age=130s
queueDepth=1
reason=blocked_tool_call
classification=blocked_tool_call
activeWorkKind=tool_call
lastProgress=codex_app_server:notification:rawResponseItem/completed
lastProgressAge=128s
activeTool=bash
activeToolCallId=exec-847c8ce1-0c30-4a5c-aeed-7619305d0f01
activeToolAge=70201s
terminalProgressStale=true
recovery=none

---

stalled session ... age=370s ... activeToolAge=70441s ... recovery=checking
stuck session recovery: sessionId=<redacted> sessionKey=<redacted> age=370s action=abort_embedded_run aborted=true drained=true released=0
stuck session recovery outcome: status=aborted action=abort_embedded_run ... activeWorkKind=embedded_run ... aborted=true drained=true forceCleared=false released=0

---

codex app-server connection closed during startup; restarting app-server and retrying
Codex agent harness failed; not falling back to embedded PI backend
lane task error: lane=main durationMs=665 error="Error: codex app-server client is closed"
Embedded agent failed before reply: codex app-server client is closed

---

if (activity.activeTools.size > 0) activeWorkKind = "tool_call";
else if (activity.activeModelCalls.size > 0) activeWorkKind = "model_call";
else if (activity.activeEmbeddedRuns.size > 0) activeWorkKind = "embedded_run";

---

toolAgeMs >= params.stuckSessionAbortMs &&
lastProgressAgeMs >= params.stuckSessionAbortMs

---

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

This does not look like a config reload. The OpenClaw config file was unchanged and there was no config-change/reload event around the incident.

Steps to reproduce

Environment

OpenClaw: 2026.5.26 (10ad3aa)
Runtime: OpenAI Codex / gpt-5.5
Channel: Telegram direct
Host: macOS, Node 22.22.2

Observed behavior

At 14:02-14:06 Europe/Paris, the direct session was logged repeatedly as stalled:

stalled session: sessionId=<redacted>
sessionKey=agent:dalva:telegram:direct:<redacted>
state=processing
age=130s
queueDepth=1
reason=blocked_tool_call
classification=blocked_tool_call
activeWorkKind=tool_call
lastProgress=codex_app_server:notification:rawResponseItem/completed
lastProgressAge=128s
activeTool=bash
activeToolCallId=exec-847c8ce1-0c30-4a5c-aeed-7619305d0f01
activeToolAge=70201s
terminalProgressStale=true
recovery=none

The important invariant breach is that activeToolAge=70201s while the current session processing age was only 130s. That suggests the active tool activity was orphaned from an older run and still attached to the same session key.

Recovery only became eligible around 370s:

stalled session ... age=370s ... activeToolAge=70441s ... recovery=checking
stuck session recovery: sessionId=<redacted> sessionKey=<redacted> age=370s action=abort_embedded_run aborted=true drained=true released=0
stuck session recovery outcome: status=aborted action=abort_embedded_run ... activeWorkKind=embedded_run ... aborted=true drained=true forceCleared=false released=0

Immediately after that, other work hit app-server failures:

codex app-server connection closed during startup; restarting app-server and retrying
Codex agent harness failed; not falling back to embedded PI backend
lane task error: lane=main durationMs=665 error="Error: codex app-server client is closed"
Embedded agent failed before reply: codex app-server client is closed

Expected behavior

A new turn/session should not inherit stale active tool activity from a previous embedded run.

If a session is blocked behind orphaned tool activity, recovery should clear the stale activity or release the lane without requiring the user to wait several minutes and without cascading into Codex app-server client closure.

Actual behavior

Code reading notes

The relevant paths in the installed bundle appear to map to:

src/logging/diagnostic-run-activity.ts
src/logging/diagnostic-session-attention.ts
src/logging/diagnostic-stuck-session-recovery.runtime.ts

In diagnostic-run-activity.ts, activity is keyed by sessionId and sessionKey and can be merged across refs. recordRunCompleted and markDiagnosticEmbeddedRunEnded clear activeTools, but markDiagnosticEmbeddedRunStarted does not clear inherited activeTools / activeModelCalls.

In getDiagnosticSessionActivitySnapshot, any leftover activeTools entry wins over activeEmbeddedRuns:

if (activity.activeTools.size > 0) activeWorkKind = "tool_call";
else if (activity.activeModelCalls.size > 0) activeWorkKind = "model_call";
else if (activity.activeEmbeddedRuns.size > 0) activeWorkKind = "embedded_run";

In diagnostic-session-attention.ts, blocked_tool_call is classified once both activeToolAgeMs and lastProgressAgeMs exceed the warn threshold, but recovery waits for both to exceed stuckSessionAbortMs:

toolAgeMs >= params.stuckSessionAbortMs &&
lastProgressAgeMs >= params.stuckSessionAbortMs

That explains why the session logged recovery=none for several minutes even though the tool age was already many hours old.

Patch direction, not applied locally

I would consider a small fix in diagnostic-run-activity.ts:

On markDiagnosticEmbeddedRunStarted, clear inherited activeTools and activeModelCalls before adding the new embedded run work key.
Alternatively, track a generation/run owner for active tools and ignore active tools whose startedAt predates the current embedded run start.
Add a regression test where an old active tool remains in activity for a session key, a new embedded run starts for the same session key, and the snapshot must not report the old tool as current active work.

A second defensive improvement could live in isBlockedToolCallRecoveryEligible: if the active tool age is much greater than the current session processing age or predates the active embedded run, classify it as orphaned diagnostic activity and recover earlier. The cleaner fix is probably ownership/generation in the activity tracker.

Why I think this is not config

~/.openclaw/openclaw.json mtime: 2026-05-28 12:59:45 +0200
~/.openclaw/openclaw.json.last-good mtime: 2026-05-28 13:00:06 +0200
No Gateway config reload event around the direct-session stall window.

OpenClaw version

2026.5.26

Operating system

macOS26.4

Install method

npm global

Model

OpenAI Codex / gpt-5.5

Provider / routing chain

openclaw -> gateway -> Telegram

Additional provider/model setup details

Environment

OpenClaw: 2026.5.26 (10ad3aa)
Runtime: OpenAI Codex / gpt-5.5
Channel: Telegram direct
Host: macOS, Node 22.22.2

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

FAQ

Expected behavior

A new turn/session should not inherit stale active tool activity from a previous embedded run.

openclaw - 💡(How to fix) Fix [Bug]: Direct Telegram session can stall on orphaned diagnostic tool activity, recovery aborts only after ~6 min and may close Codex app-server

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Patch direction, not applied locally

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Environment

Observed behavior

Expected behavior

Expected behavior

Actual behavior

Code reading notes

Patch direction, not applied locally

Why I think this is not config

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Environment

Logs, screenshots, and evidence

Impact and severity

Additional information

FAQ

Expected behavior

Still need to ship something?

TRENDING