openclaw - 💡(How to fix) Fix Subagent runs can be falsely marked failed/lost after clean gateway close or pending wait [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74363Fetched 2026-04-30 06:24:57
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
2
Timeline (top)
commented ×1

Subagent lifecycle recovery can falsely mark a run as failed/lost when the gateway connection closes cleanly, agent.wait returns pending, or the in-memory run context is missing while the backing session is still running.

This showed up as repeated Mission Control/Workshop subagent failures such as:

  • gateway closed (1000 normal closure): no close reason
  • subagent run lost active execution context
  • backing session missing

The underlying child work sometimes continued or produced useful output, but the registry/runtime marked the parent task as failed or lost.

Error Message

On OpenClaw 2026.4.26, with Gateway bound to loopback on macOS:

Root Cause

Without this, a clean gateway reconnect/restart or transient wait state can cause false failed/lost task records, noisy user-facing completion events, and stale recovery tasks in Mission Control.

Fix Action

Fix / Workaround

Local hotfix applied

I applied a local hotfix to the bundled dist as a temporary guard:

After applying the hotfix:

Code Example

const RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS = [
  /gateway closed \(1000/i,
  /gateway closed \(1006/i,
  /transport close/i,
  ...
]

---

if (wait.status === "pending") {
  log.info("subagent wait still pending; scheduling follow-up wait", {...})
  const scheduledEntry = entry
  setTimeout(() => {
    const current = params.runs.get(runId)
    if (!current || current !== scheduledEntry || typeof current.endedAt === "number") return
    waitForSubagentCompletion(runId, waitTimeoutMs, scheduledEntry)
  }, RECOVERABLE_WAIT_RETRY_DELAY_MS).unref?.()
  return
}

---

if (sessionEntry?.status === "running") {
  resumedRuns.delete(runId)
  resumeSubagentRun(runId)
  continue
}
RAW_BUFFERClick to expand / collapse

Summary

Subagent lifecycle recovery can falsely mark a run as failed/lost when the gateway connection closes cleanly, agent.wait returns pending, or the in-memory run context is missing while the backing session is still running.

This showed up as repeated Mission Control/Workshop subagent failures such as:

  • gateway closed (1000 normal closure): no close reason
  • subagent run lost active execution context
  • backing session missing

The underlying child work sometimes continued or produced useful output, but the registry/runtime marked the parent task as failed or lost.

Observed behavior

On OpenClaw 2026.4.26, with Gateway bound to loopback on macOS:

  • Gateway: 127.0.0.1:18789
  • Runtime: Mac mini LaunchAgent
  • Subagent tasks launched from Mission Control / Workshop

Failure modes observed:

  1. gateway closed (1000 normal closure) was treated as terminal instead of recoverable.
  2. waitForSubagentCompletion() returned permanently when agent.wait returned pending, leaving no follow-up waiter attached.
  3. The subagent sweeper marked a run as subagent run lost active execution context even when the backing session store still had status: "running".

Expected behavior

  • Clean websocket close 1000 during gateway reconnect/restart windows should be treated as recoverable, similar to transport close / connection loss.
  • A pending wait result should schedule a follow-up wait rather than dropping lifecycle supervision.
  • If the sweeper loses in-memory run context but the backing session entry still says running, it should reattach/re-wait rather than terminally fail the run.

Local hotfix applied

I applied a local hotfix to the bundled dist as a temporary guard:

dist/run-wait-*.js

Add gateway closed (1000...) to recoverable wait errors:

const RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS = [
  /gateway closed \(1000/i,
  /gateway closed \(1006/i,
  /transport close/i,
  ...
]

dist/subagent-registry-*.js

When agent.wait returns pending, schedule another wait:

if (wait.status === "pending") {
  log.info("subagent wait still pending; scheduling follow-up wait", {...})
  const scheduledEntry = entry
  setTimeout(() => {
    const current = params.runs.get(runId)
    if (!current || current !== scheduledEntry || typeof current.endedAt === "number") return
    waitForSubagentCompletion(runId, waitTimeoutMs, scheduledEntry)
  }, RECOVERABLE_WAIT_RETRY_DELAY_MS).unref?.()
  return
}

In the sweeper, if the backing session is still running, reattach instead of failing:

if (sessionEntry?.status === "running") {
  resumedRuns.delete(runId)
  resumeSubagentRun(runId)
  continue
}

Local verification

After applying the hotfix:

  • node --check passed for the patched runtime files.
  • Gateway restarted cleanly.
  • A controlled subagent lifecycle smoke test completed successfully.
  • Mission Control/Workshop tasks stopped failing with the previous lost-context pattern.

Suggested upstream fix

Implement the above behavior in source, with tests covering:

  1. waitForAgentRun() treats gateway closed (1000...) as recoverable.
  2. waitForSubagentCompletion() continues supervising after pending.
  3. Sweeper reattaches when session store entry is still running even if in-memory run context is missing.
  4. Sweeper still fails closed when the session entry is missing, terminal, stale, or drifted.

Why this matters

Without this, a clean gateway reconnect/restart or transient wait state can cause false failed/lost task records, noisy user-facing completion events, and stale recovery tasks in Mission Control.

extent analysis

TL;DR

Apply a hotfix to treat clean websocket closes as recoverable, schedule follow-up waits for pending subagent completions, and reattach to running sessions when in-memory context is missing.

Guidance

  • Identify and update the RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS to include gateway closed (1000...) for recoverable wait errors.
  • Modify waitForSubagentCompletion() to schedule another wait when agent.wait returns pending.
  • Update the sweeper to reattach to a subagent run when the backing session is still running, even if the in-memory run context is missing.
  • Verify the fix by running a controlled subagent lifecycle smoke test and checking for successful completion.

Example

const RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS = [
  /gateway closed \(1000/i,
  /gateway closed \(1006/i,
  /transport close/i,
  // ...
]

// ...

if (wait.status === "pending") {
  // Schedule follow-up wait
  setTimeout(() => {
    // ...
  }, RECOVERABLE_WAIT_RETRY_DELAY_MS).unref?.()
}

// ...

if (sessionEntry?.status === "running") {
  // Reattach to running session
  resumedRuns.delete(runId)
  resumeSubagentRun(runId)
  continue
}

Notes

The provided hotfix is a temporary solution and should be replaced with a proper upstream fix that includes tests for the desired behavior.

Recommendation

Apply the suggested hotfix as a temporary workaround until a proper upstream fix is implemented, as it addresses the identified issues and prevents false failed/lost task records.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • Clean websocket close 1000 during gateway reconnect/restart windows should be treated as recoverable, similar to transport close / connection loss.
  • A pending wait result should schedule a follow-up wait rather than dropping lifecycle supervision.
  • If the sweeper loses in-memory run context but the backing session entry still says running, it should reattach/re-wait rather than terminally fail the run.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Subagent runs can be falsely marked failed/lost after clean gateway close or pending wait [1 comments, 2 participants]