openclaw - 💡(How to fix) Fix WhatsApp lane remains idle/embedded_run after abort_embedded_run recovery

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

After stuck-session recovery aborts a WhatsApp channel lane embedded_run, the session can transition to idle while still retaining stale embedded-run state and queue depth. The result is a loop where the lane appears idle/healthy at a high level, but diagnostics continue to report idle/embedded_run,q=1 and then re-emit session.stalled for active_work_without_progress.

This is related to #85251, but this issue is narrower: it is not just that an embedded run wedges before recovery; it is that recovery can complete with residual active-work/reply-operation state still attached to an idle channel session.

Root Cause

  1. In the abort_embedded_run recovery path, clear the active embedded-run handle/marker and reply-operation registry entry atomically before emitting session.state outcome=idle.
  2. Reconcile queue state after abort: if the aborted item is still counted in queue depth, remove or terminally mark it; if another item remains, start it through the normal run-start path instead of leaving the session as idle with active work.
  3. Add a regression test for a channel lane where recovery completes and post-recovery diagnostics must not report idle/embedded_run.
  4. Include a reused channel session key case such as WhatsApp/Slack, not only direct openclaw agent, because direct invocations are much less likely to reproduce this stale lane state.

Fix Action

Fix / Workaround

  • OpenClaw: 2026.5.28 (e932160)
  • OS: Debian 12 / Linux systemd user gateway
  • Channel: WhatsApp direct session
  • Runtime/profile: Codex embedded harness with tools.profile: "coding"
  • Local diagnostic mitigation in use:

Code Example

"diagnostics": {
  "stuckSessionWarnMs": 60000,
  "stuckSessionAbortMs": 120000
}

---

session.stalled
  outcome=processing
  reason=active_work_without_progress
  ageMs=79014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=109014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.recovery.requested
  action=abort
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.state
  outcome=idle
  reason=stuck_recovery:aborted
  queueDepth=1

session.recovery.completed
  outcome=aborted
  action=abort_embedded_run
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

---

diagnostic.liveness.warning
  source=agent:main:whatsapp:direct:<redacted>(idle/embedded_run,q=1,age=60s last=embedded_run:started)

session.stalled
  outcome=idle
  reason=active_work_without_progress
  ageMs=89994
  queueDepth=1
  activeWorkKind=embedded_run
RAW_BUFFERClick to expand / collapse

Summary

After stuck-session recovery aborts a WhatsApp channel lane embedded_run, the session can transition to idle while still retaining stale embedded-run state and queue depth. The result is a loop where the lane appears idle/healthy at a high level, but diagnostics continue to report idle/embedded_run,q=1 and then re-emit session.stalled for active_work_without_progress.

This is related to #85251, but this issue is narrower: it is not just that an embedded run wedges before recovery; it is that recovery can complete with residual active-work/reply-operation state still attached to an idle channel session.

Environment

  • OpenClaw: 2026.5.28 (e932160)
  • OS: Debian 12 / Linux systemd user gateway
  • Channel: WhatsApp direct session
  • Runtime/profile: Codex embedded harness with tools.profile: "coding"
  • Local diagnostic mitigation in use:
"diagnostics": {
  "stuckSessionWarnMs": 60000,
  "stuckSessionAbortMs": 120000
}

WhatsApp transport itself was verified separately as healthy: direct WhatsApp send and full agent --deliver --reply-channel whatsapp --reply-account default --reply-to ... succeeded via the gateway after repairing local CLI operator.write scope. The remaining failure is the session lifecycle state after embedded-run recovery.

User-visible symptom

  • Inbound WhatsApp/app messages are accepted and queued.
  • Stuck-session recovery eventually aborts a stale embedded run instead of requiring a manual restart.
  • However, the channel can immediately remain or return to an idle/embedded_run state with queued work, causing another stall window and delayed/lost replies.
  • From an operator perspective, the gateway and WhatsApp connection look healthy while the direct session lane is not actually clean.

Observed diagnostic sequence

Sanitized stability log excerpt:

session.stalled
  outcome=processing
  reason=active_work_without_progress
  ageMs=79014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=109014
  queueDepth=2
  activeWorkKind=embedded_run

session.stalled
  outcome=processing
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.recovery.requested
  action=abort
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

session.state
  outcome=idle
  reason=stuck_recovery:aborted
  queueDepth=1

session.recovery.completed
  outcome=aborted
  action=abort_embedded_run
  ageMs=139014
  queueDepth=2
  activeWorkKind=embedded_run

A later sample showed the residual/inconsistent state:

diagnostic.liveness.warning
  source=agent:main:whatsapp:direct:<redacted>(idle/embedded_run,q=1,age=60s last=embedded_run:started)

session.stalled
  outcome=idle
  reason=active_work_without_progress
  ageMs=89994
  queueDepth=1
  activeWorkKind=embedded_run

The key inconsistency is outcome=idle plus activeWorkKind=embedded_run / last=embedded_run:started / queueDepth=1 after abort_embedded_run already completed.

Expected behavior

After abort_embedded_run completes, the channel session should not be classifiable as idle/embedded_run and should not immediately emit another active_work_without_progress stall unless a new run has actually started.

More concretely:

  • The active embedded-run marker should be cleared before or atomically with transition to idle.
  • Any reply-operation/session lock associated with the aborted run should be released.
  • If queueDepth > 0 remains, the next queued item should either be explicitly resumed/drained or safely dropped/failed in a way that cannot leave stale active-work state attached to an idle session.

Suggested fix shape

  1. In the abort_embedded_run recovery path, clear the active embedded-run handle/marker and reply-operation registry entry atomically before emitting session.state outcome=idle.
  2. Reconcile queue state after abort: if the aborted item is still counted in queue depth, remove or terminally mark it; if another item remains, start it through the normal run-start path instead of leaving the session as idle with active work.
  3. Add a regression test for a channel lane where recovery completes and post-recovery diagnostics must not report idle/embedded_run.
  4. Include a reused channel session key case such as WhatsApp/Slack, not only direct openclaw agent, because direct invocations are much less likely to reproduce this stale lane state.

Related

  • #85251 tracks the broader embedded-run handoff/stall family.
  • This issue isolates the post-recovery cleanup invariant: stuck_recovery:aborted must leave the lane clean, not idle/embedded_run.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

After abort_embedded_run completes, the channel session should not be classifiable as idle/embedded_run and should not immediately emit another active_work_without_progress stall unless a new run has actually started.

More concretely:

  • The active embedded-run marker should be cleared before or atomically with transition to idle.
  • Any reply-operation/session lock associated with the aborted run should be released.
  • If queueDepth > 0 remains, the next queued item should either be explicitly resumed/drained or safely dropped/failed in a way that cannot leave stale active-work state attached to an idle session.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix WhatsApp lane remains idle/embedded_run after abort_embedded_run recovery