openclaw - 💡(How to fix) Fix [Bug]: Subagent runs can be stranded or mis-finalized after wait timeout/pending edges, and cleanup can remove the main transcript [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73064Fetched 2026-04-28 06:27:58
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Across multiple recent 2026.4.x releases, subagent runs can fail in two related ways: the child session may be created but remain empty/idle without a useful parent notification, and runs that did make progress can still be finalized as timeout and later cleaned up as if they had completed.

Error Message

Representative redaction-safe error strings observed across the affected runs:

Root Cause

Additional information

This report intentionally avoids installation-local identifiers, session IDs, agent names, hostnames, and local file paths because they are not portable to upstream triage.

Code Example

Representative redaction-safe error strings observed across the affected runs:
- gateway timeout after 10000ms
- This operation was aborted | This operation was aborted
- request timed out | request timed out

Representative symptom classes observed across multiple recent 2026.4.x versions:
- requester-side timeout while the child session is still created
- child session with little or no transcript content and no useful parent notification
- parent-visible timeout/missing-state even though child-side artifacts show substantive progress
- primary transcript later deleted or only preserved indirectly

Current source trace against openclaw/openclaw main:
- src/agents/run-wait.ts:155-160
  agent.wait returning status "pending" is surfaced as pending
- src/agents/subagent-registry-run-manager.ts:116-118
  pending wait result returns immediately without immediate re-arm
- src/agents/subagent-registry-run-manager.ts:173-181
  timeout path still completes the run and triggers cleanup
- src/agents/subagent-registry.ts:242-247
  session status "timeout" maps to outcome timeout + reason COMPLETE
- src/agents/subagent-registry.ts:433-459
  pending lifecycle timeout also finalizes as COMPLETE with cleanup
- src/agents/subagent-registry.ts:826-833
  sweeper later deletes the transcript with sessions.delete(deleteTranscript: true)
- src/agents/subagent-spawn.ts:759, 803-817, 1149-1164
  child session key/store entry exists before the overall spawn/notification path is fully settled

Historical cross-version context:
- earlier 2026.4.x observations included requester-side `gateway timeout after 10000ms`
  even when higher timeout settings were configured
- current main source suggests at least part of that earlier startup-timeout line was changed,
  but the registry/wait/cleanup problem above still appears to explain the remaining
  false-terminal / stranded-session classes
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Across multiple recent 2026.4.x releases, subagent runs can fail in two related ways: the child session may be created but remain empty/idle without a useful parent notification, and runs that did make progress can still be finalized as timeout and later cleaned up as if they had completed.

Steps to reproduce

NOT_ENOUGH_INFO

Expected behavior

If the parent-side wait reaches a timeout or pending edge, the run should remain non-terminal until the child is actually known ended. A created child session should either progress normally or produce a truthful terminal/reportable failure back to the parent. Cleanup should not remove the primary transcript while the run still needs recovery or postmortem inspection.

Actual behavior

Observed behavior across multiple recent 2026.4.x versions includes two recurring classes:

  1. Empty/idle child session class

    • a subagent session is created
    • little or no transcript content ever appears
    • the parent does not receive a useful completion/failure notification
    • the parent can remain waiting for a callback that never arrives
  2. False-terminal timeout class

    • the child does real work and may even reach terminal trajectory events
    • the visible parent-side/session-store outcome still becomes timeout, missing-status, or equivalent terminal-looking failure state
    • later cleanup/archiving can remove the main transcript, leaving only secondary artifacts or no useful artifact at all

OpenClaw version

Observed across multiple recent 2026.4.x releases, including at least 2026.4.5 and 2026.4.25. Exact first-bad / last-good boundary is not known.

Operating system

Linux Mint 22.3 (Linux 6.17.0-22-generic x86_64 GNU/Linux).

Install method

npm global

Model

Mixed; the observed anomaly class is not isolated to a single model.

Provider / routing chain

Mixed; the observed anomaly class is not isolated to a single provider/routing chain.

Additional provider/model setup details

The behavior was observed across multiple subagent workloads and more than one provider/model path, which suggests the bug is more likely in subagent lifecycle/runtime handling than in one provider integration.

Logs, screenshots, and evidence

Representative redaction-safe error strings observed across the affected runs:
- gateway timeout after 10000ms
- This operation was aborted | This operation was aborted
- request timed out | request timed out

Representative symptom classes observed across multiple recent 2026.4.x versions:
- requester-side timeout while the child session is still created
- child session with little or no transcript content and no useful parent notification
- parent-visible timeout/missing-state even though child-side artifacts show substantive progress
- primary transcript later deleted or only preserved indirectly

Current source trace against openclaw/openclaw main:
- src/agents/run-wait.ts:155-160
  agent.wait returning status "pending" is surfaced as pending
- src/agents/subagent-registry-run-manager.ts:116-118
  pending wait result returns immediately without immediate re-arm
- src/agents/subagent-registry-run-manager.ts:173-181
  timeout path still completes the run and triggers cleanup
- src/agents/subagent-registry.ts:242-247
  session status "timeout" maps to outcome timeout + reason COMPLETE
- src/agents/subagent-registry.ts:433-459
  pending lifecycle timeout also finalizes as COMPLETE with cleanup
- src/agents/subagent-registry.ts:826-833
  sweeper later deletes the transcript with sessions.delete(deleteTranscript: true)
- src/agents/subagent-spawn.ts:759, 803-817, 1149-1164
  child session key/store entry exists before the overall spawn/notification path is fully settled

Historical cross-version context:
- earlier 2026.4.x observations included requester-side `gateway timeout after 10000ms`
  even when higher timeout settings were configured
- current main source suggests at least part of that earlier startup-timeout line was changed,
  but the registry/wait/cleanup problem above still appears to explain the remaining
  false-terminal / stranded-session classes

Impact and severity

Affected: any workflow that relies on subagent spawn, wait, and completion notification Severity: High Frequency: Intermittent but repeatedly observed across multiple recent 2026.4.x versions Consequence:

  • parent workflows can stall waiting for a callback that never arrives
  • operators can see false failure/timeout states for work that actually progressed
  • transcript cleanup can remove the easiest artifact needed for debugging and recovery

Additional information

This report intentionally avoids installation-local identifiers, session IDs, agent names, hostnames, and local file paths because they are not portable to upstream triage.

The strongest current-source hypothesis is that more than one related issue may be involved:

  1. a non-rearmed pending wait path that can leave created child sessions stranded without useful parent notification
  2. timeout paths that are still treated as terminal completion
  3. later cleanup/archiving that can delete the main transcript after those terminalization decisions

extent analysis

TL;DR

The most likely fix involves modifying the subagent lifecycle handling to properly rearm pending waits and prevent premature timeout completions.

Guidance

  • Review the subagent-registry-run-manager.ts file, specifically lines 116-118, to ensure that pending wait results are properly rearmed to prevent child sessions from being stranded without useful parent notifications.
  • Investigate the subagent-registry.ts file, lines 242-247 and 433-459, to modify the timeout path handling to prevent false-terminal completions and ensure that cleanup is not triggered prematurely.
  • Examine the subagent-spawn.ts file, lines 759, 803-817, and 1149-1164, to verify that child session key/store entries are properly settled before the overall spawn/notification path is considered complete.
  • Consider adding logging or debugging statements to track the state of child sessions and parent notifications to better understand the issue.

Example

No code snippet is provided due to the complexity of the issue and the need for a thorough review of the codebase.

Notes

The issue appears to be related to multiple factors, including pending wait paths, timeout handling, and cleanup mechanisms. A thorough review of the codebase and testing of different scenarios will be necessary to fully resolve the issue.

Recommendation

Apply a workaround by modifying the subagent lifecycle handling to prevent premature timeout completions and ensure that child sessions are properly notified to the parent. This will likely involve changes to the subagent-registry-run-manager.ts and subagent-registry.ts files.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

If the parent-side wait reaches a timeout or pending edge, the run should remain non-terminal until the child is actually known ended. A created child session should either progress normally or produce a truthful terminal/reportable failure back to the parent. Cleanup should not remove the primary transcript while the run still needs recovery or postmortem inspection.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Subagent runs can be stranded or mis-finalized after wait timeout/pending edges, and cleanup can remove the main transcript [1 participants]