After `abort_embedded_run` completes, the channel session should not be classifiable as `idle/embedded_run` and should not immediately emit another `active_work_without_progress` stall unless a new run has actually started. More concretely: - The active embedded-run marker should be cleared before or atomically with transition to `idle`. - Any reply-operation/session lock associated with the aborted run should be released. - If `queueDepth > 0` remains, the next queued item should either be explicitly resumed/drained or safely dropped/failed in a way that cannot leave stale active-work state attached to an idle session.

openclaw - 💡(How to fix) Fix WhatsApp lane remains idle/embedded_run after abort_embedded_run recovery

StepCodex · 2026-05-31T14:22:18Z

[openclaw] After stuck-session recovery aborts a WhatsApp channel lane embedded run , the session can transition to idle while still retaining stale embedded-r… After stuck-session recovery aborts a WhatsApp channel lane `embedded_run`, the session can transition to `idle` while still retaining stale embedded-run state and queue depth. The result is a loop where the lane appears idle/healthy at a high level, but diagnostics continue to report `idle/embedded_run,q=1` and then re-emit `session.stalled` for `active_work_without_progress`. This is related to #85251, but this issue is narrower: it is not just that an embedded run wedges before recovery; it is that recovery can complete with residual active-work/reply-operation state still attached to an idle channel session. ## Fix / Workaround - OpenClaw: `2026.5.28 (e932160)` - OS: Debian 12 / Linux systemd user gateway - Channel: WhatsApp direct session - Runtime/profile: Codex embedded harness with `tools.profile: "coding"` - Local diagnostic mitigation in use: ## Summary After stuck-session recovery aborts a WhatsApp channel lane `embedded_run`, the session can transition to `idle` while still retaining stale embedded-run state and queue depth. The result is a loop where the lane appears idle/healthy at a high level, but diagnostics continue to report `idle/embedded_run,q=1` and then re-emit `session.stalled` for `active_work_without_progress`. This is related to #85251, but this issue is narrower: it is not just that an embedded run wedges before recovery; it is that recovery can complete with residual active-work/reply-operation state still attached to an idle channel session. ## Environment - OpenClaw: `2026.5.28 (e932160)` - OS: Debian 12 / Linux systemd user gateway - Channel: WhatsApp direct session - Runtime/profile: Codex embedded harness with `tools.profile: "coding"` - Local diagnostic mitigation in use: ```json "diagnostics": { "stuckSessionWarnMs": 60000, "stuckSessionAbortMs": 120000 } ``` WhatsApp transport itself was verified separately as healthy: direct WhatsApp send and full `agent --deliver --reply-channel whatsapp --reply-account default --reply-to ...` succeeded via the gateway after repairing local CLI `operator.write` scope. The remaining failure is the session lifecycle state after embedded-run recovery. ## User-visible symptom - Inbound WhatsApp/app messages are accepted and queued. - Stuck-session recovery eventually aborts a stale embedded run instead of requiring a manual restart. - However, the channel can immediately remain or return to an `idle/embedded_run` state with queued work, causing another stall window and delayed/lost replies. - From an operator perspective, the gateway and WhatsApp connection look healthy while the direct session lane is not actually clean. ## Observed diagnostic sequence Sanitized stability log excerpt: ```text session.stalled outcome=processing reason=active_work_without_progress ageMs=79014 queueDepth=2 activeWorkKind=embedded_run session.stalled outcome=processing ageMs=109014 queueDepth=2 activeWorkKind=embedded_run session.stalled outcome=processing ageMs=139014 queueDepth=2 activeWorkKind=embedded_run session.recovery.requested action=abort ageMs=139014 queueDepth=2 activeWorkKind=embedded_run session.state outcome=idle reason=stuck_recovery:aborted queueDepth=1 session.recovery.completed outcome=aborted action=abort_embedded_run ageMs=139014 queueDepth=2 activeWorkKind=embedded_run ``` A later sample showed the residual/inconsistent state: ```text diagnostic.liveness.warning source=agent:main:whatsapp:direct: (idle/embedded_run,q=1,age=60s last=embedded_run:started) session.stalled outcome=idle reason=active_work_without_progress ageMs=89994 queueDepth=1 activeWorkKind=embedded_run ``` The key inconsistency is `outcome=idle` plus `activeWorkKind=embedded_run` / `last=embedded_run:started` / `queueDepth=1` after `abort_embedded_run` already completed. ## Expected behavior After `abort_embedded_run` completes, the channel session should not be classifiable as `idle/embedded_run` and should not immediately emit another `active_work_without_progress` stall unless a new run has actually started. More concretely: - The active embedded-run marker should be cleared before or atomically with transition to `idle`. - Any reply-operation/session lock associated with the aborted run should be released. - If `queueDepth > 0` remains, the next queued item should either be explicitly resumed/drained or safely dropped/failed in a way that cannot leave stale active-work state attached to an idle session. ## Suggested fix shape 1. In the `abort_embedded_run` recovery path, clear the active embedded-run handle/marker and reply-operation registry entry atomically before emitting `session.state outcome=idle`. 2. Reconcile queue state after abort: if the aborted item is still counted in queue depth, remove or terminally mark it; if another item remains, start it through the normal run-start path instead of

openclaw2026-05-31 14:22:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

After stuck-session recovery aborts a WhatsApp channel lane embedded_run, the session can transition to idle while still retaining stale embedded-run state and queue depth. The result is a loop where the lane appears idle/healthy at a high level, but diagnostics continue to report idle/embedded_run,q=1 and then re-emit session.stalled for active_work_without_progress.

This is related to #85251, but this issue is narrower: it is not just that an embedded run wedges before recovery; it is that recovery can complete with residual active-work/reply-operation state still attached to an idle channel session.

Root Cause

In the abort_embedded_run recovery path, clear the active embedded-run handle/marker and reply-operation registry entry atomically before emitting session.state outcome=idle.
Reconcile queue state after abort: if the aborted item is still counted in queue depth, remove or terminally mark it; if another item remains, start it through the normal run-start path instead of leaving the session as idle with active work.
Add a regression test for a channel lane where recovery completes and post-recovery diagnostics must not report idle/embedded_run.
Include a reused channel session key case such as WhatsApp/Slack, not only direct openclaw agent, because direct invocations are much less likely to reproduce this stale lane state.

Fix Action

Fix / Workaround

OpenClaw: 2026.5.28 (e932160)
OS: Debian 12 / Linux systemd user gateway
Channel: WhatsApp direct session
Runtime/profile: Codex embedded harness with tools.profile: "coding"
Local diagnostic mitigation in use:

Code Example

"diagnostics": {
  "stuckSessionWarnMs": 60000,
  "stuckSessionAbortMs": 120000
}

---

session.stalled
  outcome=processing
  reason=active_work_without_progress
  ageMs=79014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=109014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.recovery.requested
  action=abort
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.state
  outcome=idle
  reason=stuck_recovery:aborted
  queueDepth=1

session.recovery.completed
  outcome=aborted
  action=abort_embedded_run
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

---

diagnostic.liveness.warning
  source=agent:main:whatsapp:direct:<redacted>(idle/embedded_run,q=1,age=60s last=embedded_run:started)

session.stalled
  outcome=idle
  reason=active_work_without_progress
  ageMs=89994
  queueDepth=1
  activeWorkKind=embedded_run

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.5.28 (e932160)
OS: Debian 12 / Linux systemd user gateway
Channel: WhatsApp direct session
Runtime/profile: Codex embedded harness with tools.profile: "coding"
Local diagnostic mitigation in use:

"diagnostics": {
  "stuckSessionWarnMs": 60000,
  "stuckSessionAbortMs": 120000
}

WhatsApp transport itself was verified separately as healthy: direct WhatsApp send and full agent --deliver --reply-channel whatsapp --reply-account default --reply-to ... succeeded via the gateway after repairing local CLI operator.write scope. The remaining failure is the session lifecycle state after embedded-run recovery.

User-visible symptom

Inbound WhatsApp/app messages are accepted and queued.
Stuck-session recovery eventually aborts a stale embedded run instead of requiring a manual restart.
However, the channel can immediately remain or return to an idle/embedded_run state with queued work, causing another stall window and delayed/lost replies.
From an operator perspective, the gateway and WhatsApp connection look healthy while the direct session lane is not actually clean.

Observed diagnostic sequence

Sanitized stability log excerpt:

session.stalled
  outcome=processing
  reason=active_work_without_progress
  ageMs=79014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=109014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.recovery.requested
  action=abort
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.state
  outcome=idle
  reason=stuck_recovery:aborted
  queueDepth=1

session.recovery.completed
  outcome=aborted
  action=abort_embedded_run
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

A later sample showed the residual/inconsistent state:

diagnostic.liveness.warning
  source=agent:main:whatsapp:direct:<redacted>(idle/embedded_run,q=1,age=60s last=embedded_run:started)

session.stalled
  outcome=idle
  reason=active_work_without_progress
  ageMs=89994
  queueDepth=1
  activeWorkKind=embedded_run

The key inconsistency is outcome=idle plus activeWorkKind=embedded_run / last=embedded_run:started / queueDepth=1 after abort_embedded_run already completed.

Expected behavior

After abort_embedded_run completes, the channel session should not be classifiable as idle/embedded_run and should not immediately emit another active_work_without_progress stall unless a new run has actually started.

More concretely:

The active embedded-run marker should be cleared before or atomically with transition to idle.
Any reply-operation/session lock associated with the aborted run should be released.
If queueDepth > 0 remains, the next queued item should either be explicitly resumed/drained or safely dropped/failed in a way that cannot leave stale active-work state attached to an idle session.

Suggested fix shape

In the abort_embedded_run recovery path, clear the active embedded-run handle/marker and reply-operation registry entry atomically before emitting session.state outcome=idle.
Reconcile queue state after abort: if the aborted item is still counted in queue depth, remove or terminally mark it; if another item remains, start it through the normal run-start path instead of leaving the session as idle with active work.
Add a regression test for a channel lane where recovery completes and post-recovery diagnostics must not report idle/embedded_run.
Include a reused channel session key case such as WhatsApp/Slack, not only direct openclaw agent, because direct invocations are much less likely to reproduce this stale lane state.

#85251 tracks the broader embedded-run handoff/stall family.
This issue isolates the post-recovery cleanup invariant: stuck_recovery:aborted must leave the lane clean, not idle/embedded_run.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

More concretely:

The active embedded-run marker should be cleared before or atomically with transition to idle.
Any reply-operation/session lock associated with the aborted run should be released.
If queueDepth > 0 remains, the next queued item should either be explicitly resumed/drained or safely dropped/failed in a way that cannot leave stale active-work state attached to an idle session.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix WhatsApp lane remains idle/embedded_run after abort_embedded_run recovery

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

User-visible symptom

Observed diagnostic sequence

Expected behavior

Suggested fix shape

Related

FAQ

Expected behavior

Still need to ship something?

TRENDING