openclaw - 💡(How to fix) Fix [Bug]: subagent-orphan-recovery resurrects a wedged subagent that immediately re-wedges the gateway, creating a restart-loop death spiral [2 comments, 3 participants]

openclaw2026-04-30 05:02:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74864•Fetched 2026-05-01 05:40:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2cross-referenced ×2closed ×1mentioned ×1

On gateway restart, `subagent-orphan-recovery` resumes orphaned subagent sessions. If one of those sessions was wedged at the time the previous gateway died (e.g. event-loop-saturating prompt that never completes), the recovery resurrects it into the fresh gateway — which immediately re-wedges under the same load. Manual restart (`launchctl kickstart -k`) re-resurrects the same wedge. Net: gateway goes into a death spiral where each restart cycle survives only seconds before re-dying, requiring manual surgery on `sessions.json` to break the cycle.

This is distinct from #69778 (resurrection of old tasks executing stale work) — my failure mode is resurrection of a wedged task immediately killing the gateway, regardless of task age. It's also distinct from #44687 (closed; session-resume blocking lane=main): that issue is about the session-resume subsystem; this is about the separate `subagent-orphan-recovery` subsystem.

Root Cause

Personal-assistant deployments restart their gateway frequently (config edits, updates, manual debugging). Any wedged subagent today becomes a permanent restart-loop trigger requiring manual jq surgery to break. For a non-technical user that's unrecoverable without external help. The bug also compounds with #74818 (plugin-runtime-deps races) — a flaky restart can both miss plugin loads AND trigger the resurrection cycle, multiplying outage time.

Fix Action

Fix / Workaround

A subagent session `agent:dev-manager:subagent:08202797-0b3b-4f37-a818-c04a5b570585` was working on a non-trivial source-code patching task (CP-114 internal). It went into `state=processing queueDepth=0` for 8+ minutes — a classic deadlock signature. `stuck session` warnings fired every 30s but there's no automatic remediation:

``` [subagent-orphan-recovery] found orphaned subagent session: agent:dev-manager:subagent:08202797-... (run=914dbad1-...) [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (codes=none, reason=request-timeout) [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization eventLoopDelayP99Ms=3563.1 eventLoopDelayMaxMs=34829.5 eventLoopUtilization=0.985 ```

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.4.27 (cbc2ba0)
Node: 25.5.0 (Homebrew)
Host: macOS, LaunchAgent gateway under `gui/$UID/ai.openclaw.gateway`
Gateway is the principal Telegram + BlueBubbles bot host

Symptom timeline (real incident)

``` [diagnostic] stuck session: sessionId=dev-manager sessionKey=agent:dev-manager:subagent:08202797-... state=processing age=499s queueDepth=0 ```

I cancelled the corresponding `task_runs` via `openclaw tasks cancel` (which the CLI confirmed worked), but the session itself stayed wedged in the gateway's in-memory state. `stuck session` warnings continued.

I then restarted the gateway via `launchctl kickstart -k`. On the first fresh startup:

``` [gateway] ready [telegram] [default] starting provider [telegram] [forge] starting provider
[telegram] [talos] starting provider [subagent-orphan-recovery] found orphaned subagent session: agent:dev-manager:subagent:08202797-... [subagent-orphan-recovery] resumed orphaned session: agent:dev-manager:subagent:08202797-... [subagent-orphan-recovery] orphan recovery complete: recovered=1 failed=0 skipped=7 ```

39 seconds later, orphan-recovery found the same session orphaned again (the resurrected run had immediately wedged):

A 34.8-second event-loop block on a fresh gateway, caused entirely by resuming the resurrected wedge. Gateway died shortly after. Each subsequent kickstart re-resurrected the same session. Restart loop.

What broke the cycle

```bash

1. Stop the gateway (or just bootout)

launchctl bootout gui/$UID/ai.openclaw.gateway

2. Surgically remove the cursed session from the store BEFORE bootstrap

jq 'del(."agent:dev-manager:subagent:08202797-0b3b-4f37-a818-c04a5b570585")'
~/.openclaw/agents/dev-manager/sessions/sessions.json > /tmp/s.json mv /tmp/s.json ~/.openclaw/agents/dev-manager/sessions/sessions.json

3. Bootstrap clean — orphan-recovery now has nothing to resurrect for that key

launchctl bootstrap gui/$UID ~/Library/LaunchAgents/ai.openclaw.gateway.plist ```

After the surgical removal, the gateway came up clean and stayed up. The two `task_runs` that pointed to the deleted session auto-transitioned to `status=lost` — that part of the system handled the missing-backing-session case gracefully.

Root cause hypothesis

`subagent-orphan-recovery` treats any orphaned session record as a candidate for resumption, with no signal about why the session became orphaned. Two cases the system can't distinguish today:

Benign orphan — session was healthy when the gateway died (gateway hard-killed mid-run). Resumption is the right call.
Pathological orphan — session was already wedged/stuck when the gateway died (or BECAUSE the wedge contributed to the gateway dying). Resumption resurrects the death cause.

Without an "unhealthy at orphan time" signal — e.g. a `stuck session` mark recorded in the session metadata when the diagnostic fires, or a recovery-attempt counter that bounds re-resurrection — the recovery system happily reanimates the same wedge each restart.

Suggested fix scope

Tombstone wedged sessions. When `stuck session` warning fires repeatedly for a session (e.g. age > 5 min, queueDepth=0), persist a `probablyWedged: true` flag in the session record. `subagent-orphan-recovery` skips records with that flag and either (a) marks them `lost` directly or (b) requires explicit user action via `openclaw tasks maintenance --apply` to retry.
Bound recovery attempts per session. Track `recoveryAttempts` count in session metadata. After N (e.g. 2) consecutive failed recoveries — defined as "session became orphaned again within M seconds of the last resumption" — mark the session `lost` and stop resurrecting.
Health gate around resumption. Before resuming, run a tiny health probe — e.g. check if the underlying transcript is parseable and the last assistant message ended with a clean stopReason. Sessions with abnormal terminal states (truncation, mid-tool-call cutoff, last action older than X min with no completion marker) get marked `lost` instead of resumed.
`openclaw doctor` should detect and offer to break the cycle. Right now there's no automated way to recognize "the gateway is in a resurrection-loop" — you have to read logs and figure out which session is the cursed one. Doctor could detect: `sessions where `subagent-orphan-recovery` has restored them N+ times within the last hour` and offer cleanup.
Cross-reference with #69778. That issue's proposed `agents.defaults.taskRecovery.maxAgeMinutes` is a related axis (age-bound resurrection). My case shows that age alone is insufficient — even fresh orphans (resurrected within 30s) can be pathological if the underlying session was already wedged. The two fixes are complementary.

Why this matters

Cross-references

#69778 (OPEN): age-bound resurrection — related, different axis (age vs. wedge-state)
#44687 (CLOSED): session-resume on startup blocking lane=main — different subsystem (`session-resume` vs `subagent-orphan-recovery`) but similar bigger-picture pathology
#67902 (CLOSED): subagent sessions left as "running" in sessions.json after crash — direct precursor (the absence of cleanup is what feeds my bug; would benefit from being reopened or this issue's fix)
#74818 (OPEN): plugin-runtime-deps races — compounds with this bug; both can fire on the same restart

extent analysis

TL;DR

To prevent the gateway from entering a death spiral due to resurrected wedged subagent sessions, implement a mechanism to detect and skip resuming sessions that were likely wedged when the gateway died.

Guidance

Tombstone wedged sessions: Introduce a probablyWedged flag in session metadata when a stuck session warning fires repeatedly, and have subagent-orphan-recovery skip such sessions.
Bound recovery attempts: Track recoveryAttempts in session metadata and mark a session as lost after a certain number of consecutive failed recoveries.
Health gate around resumption: Implement a health probe before resuming a session to check for abnormal terminal states, and mark it as lost if necessary.
Automate detection and cleanup: Enhance openclaw doctor to detect resurrection loops and offer to break the cycle.
Review related issues: Consider the proposed fixes in conjunction with #69778 and #74818 to ensure a comprehensive solution.

Example

No specific code snippet is provided, as the issue requires a design change rather than a simple code fix. However, the suggested fix scope outlines the necessary steps to address the problem.

Notes

The provided suggestions aim to prevent the gateway from resurrecting wedged sessions, which can cause a death spiral. Implementing these changes will require careful consideration of the session metadata and the subagent-orphan-recovery mechanism.

Recommendation

Apply the suggested fixes, starting with tombstoning wedged sessions and bounding recovery attempts, to prevent the gateway from entering a death spiral. This approach addresses the root cause of the issue and provides a more robust solution than relying on manual intervention.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API rate limit #retriever error #indexing error #inference speed #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: subagent-orphan-recovery resurrects a wedged subagent that immediately re-wedges the gateway, creating a restart-loop death spiral [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Environment

Symptom timeline (real incident)

What broke the cycle

1. Stop the gateway (or just bootout)

2. Surgically remove the cursed session from the store BEFORE bootstrap

3. Bootstrap clean — orphan-recovery now has nothing to resurrect for that key

Root cause hypothesis

Suggested fix scope

Why this matters

Cross-references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: subagent-orphan-recovery resurrects a wedged subagent that immediately re-wedges the gateway, creating a restart-loop death spiral [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Environment

Symptom timeline (real incident)

What broke the cycle

1. Stop the gateway (or just bootout)

2. Surgically remove the cursed session from the store BEFORE bootstrap

3. Bootstrap clean — orphan-recovery now has nothing to resurrect for that key

Root cause hypothesis

Suggested fix scope

Why this matters

Cross-references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING