openclaw - 💡(How to fix) Fix [Bug]: Direct Telegram session can stall on orphaned diagnostic tool activity, recovery aborts only after ~6 min and may close Codex app-server

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A direct Telegram session stopped replying. Gateway diagnostics detected the stuck state, but the user-visible recovery happened too late and the recovery path appears to abort the embedded run in a way that also caused Codex app-server client failures.

This does not look like a config reload. The OpenClaw config file was unchanged and there was no config-change/reload event around the incident.

Error Message

At 14:02-14:06 Europe/Paris, the direct session was logged repeatedly as stalled:

Root Cause

A direct Telegram session stopped replying. Gateway diagnostics detected the stuck state, but the user-visible recovery happened too late and the recovery path appears to abort the embedded run in a way that also caused Codex app-server client failures.

This does not look like a config reload. The OpenClaw config file was unchanged and there was no config-change/reload event around the incident.

Fix Action

Fix / Workaround

Patch direction, not applied locally

Code Example

stalled session: sessionId=<redacted>
sessionKey=agent:dalva:telegram:direct:<redacted>
state=processing
age=130s
queueDepth=1
reason=blocked_tool_call
classification=blocked_tool_call
activeWorkKind=tool_call
lastProgress=codex_app_server:notification:rawResponseItem/completed
lastProgressAge=128s
activeTool=bash
activeToolCallId=exec-847c8ce1-0c30-4a5c-aeed-7619305d0f01
activeToolAge=70201s
terminalProgressStale=true
recovery=none

---

stalled session ... age=370s ... activeToolAge=70441s ... recovery=checking
stuck session recovery: sessionId=<redacted> sessionKey=<redacted> age=370s action=abort_embedded_run aborted=true drained=true released=0
stuck session recovery outcome: status=aborted action=abort_embedded_run ... activeWorkKind=embedded_run ... aborted=true drained=true forceCleared=false released=0

---

codex app-server connection closed during startup; restarting app-server and retrying
Codex agent harness failed; not falling back to embedded PI backend
lane task error: lane=main durationMs=665 error="Error: codex app-server client is closed"
Embedded agent failed before reply: codex app-server client is closed

---

if (activity.activeTools.size > 0) activeWorkKind = "tool_call";
else if (activity.activeModelCalls.size > 0) activeWorkKind = "model_call";
else if (activity.activeEmbeddedRuns.size > 0) activeWorkKind = "embedded_run";

---

toolAgeMs >= params.stuckSessionAbortMs &&
lastProgressAgeMs >= params.stuckSessionAbortMs

---
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

A direct Telegram session stopped replying. Gateway diagnostics detected the stuck state, but the user-visible recovery happened too late and the recovery path appears to abort the embedded run in a way that also caused Codex app-server client failures.

This does not look like a config reload. The OpenClaw config file was unchanged and there was no config-change/reload event around the incident.

Steps to reproduce

Environment

  • OpenClaw: 2026.5.26 (10ad3aa)
  • Runtime: OpenAI Codex / gpt-5.5
  • Channel: Telegram direct
  • Host: macOS, Node 22.22.2

Observed behavior

At 14:02-14:06 Europe/Paris, the direct session was logged repeatedly as stalled:

stalled session: sessionId=<redacted>
sessionKey=agent:dalva:telegram:direct:<redacted>
state=processing
age=130s
queueDepth=1
reason=blocked_tool_call
classification=blocked_tool_call
activeWorkKind=tool_call
lastProgress=codex_app_server:notification:rawResponseItem/completed
lastProgressAge=128s
activeTool=bash
activeToolCallId=exec-847c8ce1-0c30-4a5c-aeed-7619305d0f01
activeToolAge=70201s
terminalProgressStale=true
recovery=none

The important invariant breach is that activeToolAge=70201s while the current session processing age was only 130s. That suggests the active tool activity was orphaned from an older run and still attached to the same session key.

Recovery only became eligible around 370s:

stalled session ... age=370s ... activeToolAge=70441s ... recovery=checking
stuck session recovery: sessionId=<redacted> sessionKey=<redacted> age=370s action=abort_embedded_run aborted=true drained=true released=0
stuck session recovery outcome: status=aborted action=abort_embedded_run ... activeWorkKind=embedded_run ... aborted=true drained=true forceCleared=false released=0

Immediately after that, other work hit app-server failures:

codex app-server connection closed during startup; restarting app-server and retrying
Codex agent harness failed; not falling back to embedded PI backend
lane task error: lane=main durationMs=665 error="Error: codex app-server client is closed"
Embedded agent failed before reply: codex app-server client is closed

Expected behavior

Expected behavior

A new turn/session should not inherit stale active tool activity from a previous embedded run.

If a session is blocked behind orphaned tool activity, recovery should clear the stale activity or release the lane without requiring the user to wait several minutes and without cascading into Codex app-server client closure.

Actual behavior

Code reading notes

The relevant paths in the installed bundle appear to map to:

  • src/logging/diagnostic-run-activity.ts
  • src/logging/diagnostic-session-attention.ts
  • src/logging/diagnostic-stuck-session-recovery.runtime.ts

In diagnostic-run-activity.ts, activity is keyed by sessionId and sessionKey and can be merged across refs. recordRunCompleted and markDiagnosticEmbeddedRunEnded clear activeTools, but markDiagnosticEmbeddedRunStarted does not clear inherited activeTools / activeModelCalls.

In getDiagnosticSessionActivitySnapshot, any leftover activeTools entry wins over activeEmbeddedRuns:

if (activity.activeTools.size > 0) activeWorkKind = "tool_call";
else if (activity.activeModelCalls.size > 0) activeWorkKind = "model_call";
else if (activity.activeEmbeddedRuns.size > 0) activeWorkKind = "embedded_run";

In diagnostic-session-attention.ts, blocked_tool_call is classified once both activeToolAgeMs and lastProgressAgeMs exceed the warn threshold, but recovery waits for both to exceed stuckSessionAbortMs:

toolAgeMs >= params.stuckSessionAbortMs &&
lastProgressAgeMs >= params.stuckSessionAbortMs

That explains why the session logged recovery=none for several minutes even though the tool age was already many hours old.

Patch direction, not applied locally

I would consider a small fix in diagnostic-run-activity.ts:

  • On markDiagnosticEmbeddedRunStarted, clear inherited activeTools and activeModelCalls before adding the new embedded run work key.
  • Alternatively, track a generation/run owner for active tools and ignore active tools whose startedAt predates the current embedded run start.
  • Add a regression test where an old active tool remains in activity for a session key, a new embedded run starts for the same session key, and the snapshot must not report the old tool as current active work.

A second defensive improvement could live in isBlockedToolCallRecoveryEligible: if the active tool age is much greater than the current session processing age or predates the active embedded run, classify it as orphaned diagnostic activity and recover earlier. The cleaner fix is probably ownership/generation in the activity tracker.

Why I think this is not config

  • ~/.openclaw/openclaw.json mtime: 2026-05-28 12:59:45 +0200
  • ~/.openclaw/openclaw.json.last-good mtime: 2026-05-28 13:00:06 +0200
  • No Gateway config reload event around the direct-session stall window.

OpenClaw version

2026.5.26

Operating system

macOS26.4

Install method

npm global

Model

OpenAI Codex / gpt-5.5

Provider / routing chain

openclaw -> gateway -> Telegram

Additional provider/model setup details

Environment

  • OpenClaw: 2026.5.26 (10ad3aa)
  • Runtime: OpenAI Codex / gpt-5.5
  • Channel: Telegram direct
  • Host: macOS, Node 22.22.2

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A new turn/session should not inherit stale active tool activity from a previous embedded run.

If a session is blocked behind orphaned tool activity, recovery should clear the stale activity or release the lane without requiring the user to wait several minutes and without cascading into Codex app-server client closure.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Direct Telegram session can stall on orphaned diagnostic tool activity, recovery aborts only after ~6 min and may close Codex app-server