openclaw - 💡(How to fix) Fix Stale diagnostic tool_call activity can survive recovery/reset and re-block sessions as blocked_tool_call [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

OpenClaw diagnostics can retain stale native tool-call activity for a session after stuck-session recovery, session reset, or session replacement. Future work on the same sessionKey can then be classified as blocked_tool_call even when the original tool process is no longer running and the affected session transcript has already been reset/archived.

This is narrower than the existing stuck-session umbrella issues: the suspected defect is not just that a session stalls, but that the diagnostic activity tracker can keep an orphaned activeWorkKind=tool_call / activeTool=bash entry long enough to poison later recovery decisions for the same session key.

Error Message

A local gateway repeatedly emitted stalled-session diagnostics with this shape, redacted:

Root Cause

OpenClaw diagnostics can retain stale native tool-call activity for a session after stuck-session recovery, session reset, or session replacement. Future work on the same sessionKey can then be classified as blocked_tool_call even when the original tool process is no longer running and the affected session transcript has already been reset/archived.

This is narrower than the existing stuck-session umbrella issues: the suspected defect is not just that a session stalls, but that the diagnostic activity tracker can keep an orphaned activeWorkKind=tool_call / activeTool=bash entry long enough to poison later recovery decisions for the same session key.

Fix Action

Fixed

Code Example

stalled session: sessionId=<redacted> sessionKey=<redacted>
  state=processing
  reason=blocked_tool_call
  classification=blocked_tool_call
  activeWorkKind=tool_call
  activeTool=bash
  activeToolAgeMs=<very large>
  lastProgress=<old tool start>
  recovery=checking

---

stuck session recovery outcome: status=aborted action=abort_embedded_run
  activeWorkKind=embedded_run
  aborted=true drained=true forceCleared=false released=0
RAW_BUFFERClick to expand / collapse

Summary

OpenClaw diagnostics can retain stale native tool-call activity for a session after stuck-session recovery, session reset, or session replacement. Future work on the same sessionKey can then be classified as blocked_tool_call even when the original tool process is no longer running and the affected session transcript has already been reset/archived.

This is narrower than the existing stuck-session umbrella issues: the suspected defect is not just that a session stalls, but that the diagnostic activity tracker can keep an orphaned activeWorkKind=tool_call / activeTool=bash entry long enough to poison later recovery decisions for the same session key.

Environment

  • OpenClaw: 2026.5.20 (e510042)
  • OS: Linux 7.0.10-arch1-1 x86_64
  • Node: v24.11.1
  • Gateway: systemd user service, loopback gateway
  • Runtime observed: embedded Codex app-server / native tool execution

Observed behavior

A local gateway repeatedly emitted stalled-session diagnostics with this shape, redacted:

stalled session: sessionId=<redacted> sessionKey=<redacted>
  state=processing
  reason=blocked_tool_call
  classification=blocked_tool_call
  activeWorkKind=tool_call
  activeTool=bash
  activeToolAgeMs=<very large>
  lastProgress=<old tool start>
  recovery=checking

The active tool ages were on the order of many hours. The two stale tool ids found in local state resolved only to reset/trajectory archives, not live active sessions. One stale tool was a long foreground bash command that had started a dev server; another was an rg command. At investigation time, there was no matching live process for the dev-server case, and the current gateway stability view showed no active work after recovery.

Recovery later emitted an abort/drain outcome for an embedded run, for example:

stuck session recovery outcome: status=aborted action=abort_embedded_run
  activeWorkKind=embedded_run
  aborted=true drained=true forceCleared=false released=0

The concerning part is that stale native tool activity and embedded-run/reply-run recovery appear to be tracked by separate state paths. If the native tool never emits the matching completion event, or if recovery/reset replaces the session before the completion event is reconciled, the diagnostic tracker can continue to report tool_call as active and drive blocked_tool_call classification for future turns.

Expected behavior

Stuck-session recovery, session reset, and session replacement should clear or reconcile diagnostic active-work state for the affected sessionId/sessionKey, including native tool calls.

After a recovery aborts/drains an embedded run or a session is reset/replaced:

  1. stale activeTools entries for the old session/run should not continue to classify later turns as blocked_tool_call;
  2. blocked_tool_call should mean there is still an owned active native tool, not just an orphaned diagnostic record;
  3. if the original tool cannot be cancelled or observed, the diagnostic state should be explicitly evicted/quarantined with a structured recovery event.

Actual behavior

A stale activeTool=bash record can remain associated with a session key for many hours, repeatedly producing session.stalled with classification=blocked_tool_call. Recovery can abort/drain the embedded run but does not clearly guarantee that stale native tool activity for the session key is cleared.

Source-level suspect

Relevant current source paths:

  • src/logging/diagnostic-run-activity.ts
    • recordToolStarted adds native tools to activeTools.
    • recordToolEnded removes them.
    • recordRunCompleted clears active tools/model calls/embedded runs.
    • markDiagnosticEmbeddedRunEnded can clear run activity, but callers can opt out.
  • src/auto-reply/reply/reply-run-registry.ts
    • markReplyRunDiagnosticWorkEnded calls markDiagnosticEmbeddedRunEnded(..., clearRunActivity: false).
  • src/logging/diagnostic-session-attention.ts
    • stale tool_call activity is classified as blocked_tool_call.
  • src/logging/diagnostic.ts
    • isBlockedToolCallRecoveryEligible allows recovery once the blocked tool call crosses the abort threshold.
  • src/logging/diagnostic-stuck-session-recovery.runtime.ts
    • recovery can abort active embedded work, but the cleanup contract for orphaned native tool activity is not obvious from the observed outcome.

The missing primitive may be something like a targeted diagnostic cleanup/reconciliation path, for example clearDiagnosticSessionActivity({ sessionId, sessionKey, reason }), called when recovery aborts/drains a run, when a session is reset/replaced, and when a diagnostic tool-call owner is no longer present.

Suggested fix shape

  1. Add a diagnostic activity cleanup primitive that removes or quarantines all active tool/model/embedded-run state for a given sessionId and/or sessionKey.
  2. Call it from stuck-session recovery when reason=blocked_tool_call recovery aborts/drains a run or determines the owner is gone.
  3. Call it from session reset/replacement paths after the old session is archived or superseded.
  4. Add regression coverage for:
    • native tool_call starts and never emits completion;
    • session is reset/replaced or embedded recovery aborts/drains;
    • later work on the same sessionKey is not classified as blocked_tool_call from the old tool;
    • genuine active native tools still classify as blocked until completion/abort.
  5. Emit a structured diagnostic event when stale tool activity is evicted, so operators can distinguish a real running tool from recovered stale state.

Related issues

This is a narrow follow-up to prior stuck-session/recovery work rather than a duplicate:

  • #71127 - canonical stuck-session recovery issue, now closed after base recovery landed.
  • #77115 - ghost session / stale activity family; comments discuss stale model_call and tool_call recovery gaps.
  • #76038 - stuck-session recovery skip/no-op cases and active-work recovery policy.
  • #84903 - stalled session isolation failure and stale diagnostic activity affecting gateway health.
  • #76562 - broad gateway pressure issue; includes reports of blocked_tool_call / hung tool calls contributing to gateway instability.
  • #85251 and #85532 - related embedded-run abort/recovery behavior, but focused more on app-server/embedded progress semantics than orphaned native tool-call state.

Impact

A single orphaned native tool diagnostic record can keep a session lane looking blocked long after the original command is gone. The user-visible effect is delayed or lost replies, repeated stalled-session logs, and confusing recovery outcomes that make the gateway appear to still be blocked by bash when there is no corresponding live tool process.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Stuck-session recovery, session reset, and session replacement should clear or reconcile diagnostic active-work state for the affected sessionId/sessionKey, including native tool calls.

After a recovery aborts/drains an embedded run or a session is reset/replaced:

  1. stale activeTools entries for the old session/run should not continue to classify later turns as blocked_tool_call;
  2. blocked_tool_call should mean there is still an owned active native tool, not just an orphaned diagnostic record;
  3. if the original tool cannot be cancelled or observed, the diagnostic state should be explicitly evicted/quarantined with a structured recovery event.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Stale diagnostic tool_call activity can survive recovery/reset and re-block sessions as blocked_tool_call [1 pull requests]