openclaw - 💡(How to fix) Fix Background subagent completion notifications dropped silently — 4 reproductions in 8h session [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#84053Fetched 2026-05-20 03:44:42
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
1
Timeline (top)
labeled ×3cross-referenced ×2closed ×1commented ×1

Background subagent (Agent tool with run_in_background: true, subagent_type: claude, model: opus) completion notifications are not delivered reliably. In a single 8-hour session, 3 independent subagents finished work (verified by file mtime + git status + build pass) but the parent agent never received a completion notification. The pattern:

  1. Subagent runs for N minutes writing tool calls + edits to disk
  2. JSONL last-write stops abruptly
  3. No completion notification fires on the parent
  4. Work is visible on disk (file edits, git rename history, build passes) — agent reached the work-done phase
  5. Parent agent reports "still running" indefinitely until a human asks "is it actually running?"

Error Message

| Stage 2 (PROMETHEUS template) — first try | 19:45 | ~3 min later | Yes (with socket error) | Partial | Likely an Anthropic API socket close during the subagent's streaming response. The harness appears to handle some close events (Stage 2 first try → notification fired with error) but not all (A2a, Stage 3, rename → silent). The agent process itself reaches a clean stop state — work is on disk — but the parent never observes completion.

Root Cause

Hypothesized root cause

Fix Action

Fix / Workaround

AgentSpawnedLast JSONL writeNotification?Work landed?
Stage 2 (PROMETHEUS template) — first try19:45~3 min laterYes (with socket error)Partial
A2a (real_estate template)20:3120:38NoYes — full template + dispatcher + tests + build green
Stage 3 (TCPA fix + Gate 2 verify)21:5021:57NoPartial — TCPA fix landed, Gate 2 verify never started
Nexus rename (full codebase sweep)22:5523:00NoPartial — file renames done, env files + tests left broken

Possibly related to the doctor --fix strips agentRuntime: claude-cli mapping bug also seen in 2026.5.18 (Claude Code patched the config side this morning, but the subagent runtime behavior may still be affected).

Mitigation (already implemented in our project)

RAW_BUFFERClick to expand / collapse

Subagent completion notifications dropped silently on socket close — 3 reproductions in single session

OpenClaw version: 2026.5.18 (50a2481) Date: 2026-05-18 to 2026-05-19 Severity: High — silent failures cause downstream agents to report false progress

Summary

Background subagent (Agent tool with run_in_background: true, subagent_type: claude, model: opus) completion notifications are not delivered reliably. In a single 8-hour session, 3 independent subagents finished work (verified by file mtime + git status + build pass) but the parent agent never received a completion notification. The pattern:

  1. Subagent runs for N minutes writing tool calls + edits to disk
  2. JSONL last-write stops abruptly
  3. No completion notification fires on the parent
  4. Work is visible on disk (file edits, git rename history, build passes) — agent reached the work-done phase
  5. Parent agent reports "still running" indefinitely until a human asks "is it actually running?"

Reproductions today

AgentSpawnedLast JSONL writeNotification?Work landed?
Stage 2 (PROMETHEUS template) — first try19:45~3 min laterYes (with socket error)Partial
A2a (real_estate template)20:3120:38NoYes — full template + dispatcher + tests + build green
Stage 3 (TCPA fix + Gate 2 verify)21:5021:57NoPartial — TCPA fix landed, Gate 2 verify never started
Nexus rename (full codebase sweep)22:5523:00NoPartial — file renames done, env files + tests left broken

Inconsistency: same model, same harness, same workload class, same hour — sometimes notification fires, sometimes silent.

Hypothesized root cause

Likely an Anthropic API socket close during the subagent's streaming response. The harness appears to handle some close events (Stage 2 first try → notification fired with error) but not all (A2a, Stage 3, rename → silent). The agent process itself reaches a clean stop state — work is on disk — but the parent never observes completion.

Possibly related to the doctor --fix strips agentRuntime: claude-cli mapping bug also seen in 2026.5.18 (Claude Code patched the config side this morning, but the subagent runtime behavior may still be affected).

Mitigation (already implemented in our project)

While waiting for upstream fix:

  1. Every subagent brief includes "save your verdict to /path/to/file.md FIRST" — work survives notification loss
  2. Parent polls subagent JSONL file mtime every 15-20 min; stale > 10 min = suspected dead → investigate disk state
  3. Never mark a task in_progress without verifying the Agent tool call returned an agentId

Suggested upstream fixes

  1. Subagent process exit handler should fire notification even on abnormal exit — capture whatever transcript exists + send a "ended-with-partial-state" notification rather than going silent
  2. Notification delivery acks — parent acks receipt; subagent retries if not acked
  3. Heartbeat-based liveness — subagent writes a heartbeat every N seconds while idle; parent detects stale heartbeat and notifies

Reproducibility

Trivial in current environment — 3 of 4 subagent spawns today exhibited this. Not narrowed to a specific task type; happens across template-generation, refactoring, and file-rename workloads.

Related issue worth filing separately

doctor --fix silently strips agentRuntime: claude-cli mapping — patched by Claude Code 2026-05-18 morning. Diagnosis in user's project memory at project_openclaw_2026_5_18_doctor_strips_runtime.md. Worth filing as a parallel issue so the fix lands upstream and survives future OpenClaw releases.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Background subagent completion notifications dropped silently — 4 reproductions in 8h session [1 comments, 2 participants]