openclaw - 💡(How to fix) Fix Bug: Subagent child sessions receive no termination signal when parent dies mid-run [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77720Fetched 2026-05-06 06:22:27
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
3
Timeline (top)
commented ×1

When a subagent session dies mid-run (OOM kill, SIGUSR1 restart, or abrupt failed exit), its active child sub-subagent sessions receive no termination signal. Children sit in `stale_running` indefinitely — the gateway marks them "running" but no agent polls them and they never complete. Only manual cancellation or `openclaw maintenance --apply` cleans them up.

Error Message

Task error stale_running 109245ed-... running 1d17h Task error stale_running 6737bbde-... running 1d17h

Root Cause

See also:

  • #11040 (Feature: First-class session/task chain tracking) — related to lineage tracking, does not address death-signal propagation
  • #67943 (Normal sessions can inherit stale subagent lineage) — lineage staleness, different root cause

Code Example

Task  error  stale_running  109245ed-...  running  1d17h
Task  error  stale_running  6737bbde-...  running  1d17h
[...4 more...]
RAW_BUFFERClick to expand / collapse

Summary

When a subagent session dies mid-run (OOM kill, SIGUSR1 restart, or abrupt failed exit), its active child sub-subagent sessions receive no termination signal. Children sit in `stale_running` indefinitely — the gateway marks them "running" but no agent polls them and they never complete. Only manual cancellation or `openclaw maintenance --apply` cleans them up.

Steps to Reproduce

  1. Spawn a depth-1 subagent via `sessions_spawn`
  2. From the depth-1 session, spawn a depth-2 sub-subagent via `sessions_spawn`
  3. Kill the depth-1 parent session abruptly (e.g., `SIGUSR1` to gateway, OOM kill, or `openclaw sessions kill`)
  4. Observe: the depth-2 child remains in `stale_running` forever

Expected Behavior

When a parent session dies, all its descendant sessions should either:

  • Receive a termination signal and clean up gracefully, OR
  • Be listed as orphaned and eligible for automatic cleanup

Actual Behavior

Depth-2 children become orphaned with no lineage connection to a living agent. Gateway shows them as `running` but they are never polled again. `openclaw tasks audit` reports `stale_running` indefinitely.

Version

``` OpenClaw 2026.5.3-1 (2eae30e) ```

OS: Linux 6.8.0-110-generic (x64)

Evidence

Live task IDs from a production instance (anonymized session keys):

Task IDParent SessionLabelTrigger
`109245ed-0305-452c-b953-86507b39e568``agent:main:subagent:79b6b6a8`gap-analysisGateway restart
`6737bbde-f126-45cf-a060-eac36f53d87c``agent:main:subagent:c2d1dcfc`gap-analysisGateway restart
`ddb62685-f353-4ca1-a250-e70682b2df77``agent:main:subagent:44af2dfa`se-gap-analysisParent failed
`899ca059-14f5-4ddb-b185-0377b99cc96a``agent:main:subagent:04ed3568`sre-gap-analysisParent failed
`62f3c65e-ee01-4b9a-8c7d-15cc4045398d``agent:main:subagent:a0e1bd52`rca-fnm-gateway-pathParent failed
`fffe6df3-b729-4801-b805-036160dcf92b``agent:main:subagent:5f3a750f`rca-sessions-transcriptsParent failed

`openclaw tasks audit` output:

Task  error  stale_running  109245ed-...  running  1d17h
Task  error  stale_running  6737bbde-...  running  1d17h
[...4 more...]

Impact

  • Orphaned tasks accumulate in gateway task registry
  • `stale_running` errors trigger automated alerting noise
  • Manual cleanup required to recover
  • Makes depth-2 subagent spawning unsafe in production

Severity

High — reliability/usability of multi-agent orchestration.

Proposed Fix (direction, not prescriptive)

Option A: Gateway tracks parent→child registry; on parent death, iterate children and send termination signal or mark them `orphaned`.

Option B: Subagent children monitor parent session heartbeat; if parent goes dark for N consecutive heartbeats, child self-terminates with a defined exit state.

Either approach closes the lifecycle management gap without changing the spawn permission model.

Related Issues

See also:

  • #11040 (Feature: First-class session/task chain tracking) — related to lineage tracking, does not address death-signal propagation
  • #67943 (Normal sessions can inherit stale subagent lineage) — lineage staleness, different root cause

extent analysis

TL;DR

Implement a mechanism for the gateway to track and terminate orphaned subagent sessions when their parent session dies, or have subagent children monitor their parent's heartbeat to self-terminate if the parent goes dark.

Guidance

  1. Gateway-based solution: Modify the gateway to maintain a parent→child registry, allowing it to iterate through child sessions upon a parent's death and send a termination signal or mark them as orphaned.
  2. Subagent-based solution: Implement a heartbeat mechanism where subagent children monitor their parent session's heartbeat; if the parent's heartbeat fails for a consecutive number of checks, the child subagent self-terminates with a defined exit state.
  3. Verify the fix: After implementing either solution, reproduce the scenario described in the issue to ensure that orphaned subagent sessions are properly terminated or marked as orphaned upon their parent's death.

Example

No specific code example can be provided without knowing the exact implementation details of the OpenClaw system. However, the concept would involve adding a tracking mechanism in the gateway or a heartbeat monitoring system in the subagents.

Notes

The choice between the gateway-based and subagent-based solutions depends on the system's architecture and requirements. Both approaches aim to address the lifecycle management gap without altering the spawn permission model.

Recommendation

Apply a workaround by implementing Option A (Gateway tracks parent→child registry) as it seems to be a more centralized and potentially easier-to-implement solution that directly addresses the issue of orphaned sessions. This approach allows for a more controlled termination or marking of orphaned sessions, potentially reducing the complexity and reliability issues associated with distributed heartbeat monitoring.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING