- Clean websocket close `1000` during gateway reconnect/restart windows should be treated as recoverable, similar to transport close / connection loss. - A `pending` wait result should schedule a follow-up wait rather than dropping lifecycle supervision. - If the sweeper loses in-memory run context but the backing session entry still says `running`, it should reattach/re-wait rather than terminally fail the run.

openclaw - 💡(How to fix) Fix Subagent runs can be falsely marked failed/lost after clean gateway close or pending wait [1 comments, 2 participants]

openclaw2026-04-29 13:41:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74363•Fetched 2026-04-30 06:24:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

nicolasdmolina

Participants

clawsweeper[bot]

nicolasdmolina

Timeline (top)

commented ×1

Subagent lifecycle recovery can falsely mark a run as failed/lost when the gateway connection closes cleanly, agent.wait returns pending, or the in-memory run context is missing while the backing session is still running.

This showed up as repeated Mission Control/Workshop subagent failures such as:

gateway closed (1000 normal closure): no close reason
subagent run lost active execution context
backing session missing

The underlying child work sometimes continued or produced useful output, but the registry/runtime marked the parent task as failed or lost.

Error Message

On OpenClaw 2026.4.26, with Gateway bound to loopback on macOS:

Root Cause

Without this, a clean gateway reconnect/restart or transient wait state can cause false failed/lost task records, noisy user-facing completion events, and stale recovery tasks in Mission Control.

Fix Action

Fix / Workaround

Local hotfix applied

I applied a local hotfix to the bundled dist as a temporary guard:

After applying the hotfix:

Code Example

const RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS = [
  /gateway closed \(1000/i,
  /gateway closed \(1006/i,
  /transport close/i,
  ...
]

---

if (wait.status === "pending") {
  log.info("subagent wait still pending; scheduling follow-up wait", {...})
  const scheduledEntry = entry
  setTimeout(() => {
    const current = params.runs.get(runId)
    if (!current || current !== scheduledEntry || typeof current.endedAt === "number") return
    waitForSubagentCompletion(runId, waitTimeoutMs, scheduledEntry)
  }, RECOVERABLE_WAIT_RETRY_DELAY_MS).unref?.()
  return
}

---

if (sessionEntry?.status === "running") {
  resumedRuns.delete(runId)
  resumeSubagentRun(runId)
  continue
}

RAW_BUFFERClick to expand / collapse

Summary

This showed up as repeated Mission Control/Workshop subagent failures such as:

gateway closed (1000 normal closure): no close reason
subagent run lost active execution context
backing session missing

The underlying child work sometimes continued or produced useful output, but the registry/runtime marked the parent task as failed or lost.

Observed behavior

On OpenClaw 2026.4.26, with Gateway bound to loopback on macOS:

Gateway: 127.0.0.1:18789
Runtime: Mac mini LaunchAgent
Subagent tasks launched from Mission Control / Workshop

Failure modes observed:

gateway closed (1000 normal closure) was treated as terminal instead of recoverable.
waitForSubagentCompletion() returned permanently when agent.wait returned pending, leaving no follow-up waiter attached.
The subagent sweeper marked a run as subagent run lost active execution context even when the backing session store still had status: "running".

Expected behavior

Clean websocket close 1000 during gateway reconnect/restart windows should be treated as recoverable, similar to transport close / connection loss.
A pending wait result should schedule a follow-up wait rather than dropping lifecycle supervision.
If the sweeper loses in-memory run context but the backing session entry still says running, it should reattach/re-wait rather than terminally fail the run.

Local hotfix applied

I applied a local hotfix to the bundled dist as a temporary guard:

`dist/run-wait-*.js`

Add gateway closed (1000...) to recoverable wait errors:

const RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS = [
  /gateway closed \(1000/i,
  /gateway closed \(1006/i,
  /transport close/i,
  ...
]

`dist/subagent-registry-*.js`

When agent.wait returns pending, schedule another wait:

if (wait.status === "pending") {
  log.info("subagent wait still pending; scheduling follow-up wait", {...})
  const scheduledEntry = entry
  setTimeout(() => {
    const current = params.runs.get(runId)
    if (!current || current !== scheduledEntry || typeof current.endedAt === "number") return
    waitForSubagentCompletion(runId, waitTimeoutMs, scheduledEntry)
  }, RECOVERABLE_WAIT_RETRY_DELAY_MS).unref?.()
  return
}

In the sweeper, if the backing session is still running, reattach instead of failing:

if (sessionEntry?.status === "running") {
  resumedRuns.delete(runId)
  resumeSubagentRun(runId)
  continue
}

Local verification

After applying the hotfix:

node --check passed for the patched runtime files.
Gateway restarted cleanly.
A controlled subagent lifecycle smoke test completed successfully.
Mission Control/Workshop tasks stopped failing with the previous lost-context pattern.

Suggested upstream fix

Implement the above behavior in source, with tests covering:

waitForAgentRun() treats gateway closed (1000...) as recoverable.
waitForSubagentCompletion() continues supervising after pending.
Sweeper reattaches when session store entry is still running even if in-memory run context is missing.
Sweeper still fails closed when the session entry is missing, terminal, stale, or drifted.

Why this matters

Without this, a clean gateway reconnect/restart or transient wait state can cause false failed/lost task records, noisy user-facing completion events, and stale recovery tasks in Mission Control.

extent analysis

TL;DR

Apply a hotfix to treat clean websocket closes as recoverable, schedule follow-up waits for pending subagent completions, and reattach to running sessions when in-memory context is missing.

Guidance

Identify and update the RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS to include gateway closed (1000...) for recoverable wait errors.
Modify waitForSubagentCompletion() to schedule another wait when agent.wait returns pending.
Update the sweeper to reattach to a subagent run when the backing session is still running, even if the in-memory run context is missing.
Verify the fix by running a controlled subagent lifecycle smoke test and checking for successful completion.

Example

const RECOVERABLE_AGENT_WAIT_ERROR_PATTERNS = [
  /gateway closed \(1000/i,
  /gateway closed \(1006/i,
  /transport close/i,
  // ...
]

// ...

if (wait.status === "pending") {
  // Schedule follow-up wait
  setTimeout(() => {
    // ...
  }, RECOVERABLE_WAIT_RETRY_DELAY_MS).unref?.()
}

// ...

if (sessionEntry?.status === "running") {
  // Reattach to running session
  resumedRuns.delete(runId)
  resumeSubagentRun(runId)
  continue
}

Notes

The provided hotfix is a temporary solution and should be replaced with a proper upstream fix that includes tests for the desired behavior.

Recommendation

Apply the suggested hotfix as a temporary workaround until a proper upstream fix is implemented, as it addresses the identified issues and prevents false failed/lost task records.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Clean websocket close 1000 during gateway reconnect/restart windows should be treated as recoverable, similar to transport close / connection loss.
A pending wait result should schedule a follow-up wait rather than dropping lifecycle supervision.
If the sweeper loses in-memory run context but the backing session entry still says running, it should reattach/re-wait rather than terminally fail the run.

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Subagent runs can be falsely marked failed/lost after clean gateway close or pending wait [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Local hotfix applied

Code Example

Summary

Observed behavior

Expected behavior

Local hotfix applied

`dist/run-wait-*.js`

`dist/subagent-registry-*.js`

Local verification

Suggested upstream fix

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Subagent runs can be falsely marked failed/lost after clean gateway close or pending wait [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Local hotfix applied

Code Example

Summary

Observed behavior

Expected behavior

Local hotfix applied

dist/run-wait-*.js

dist/subagent-registry-*.js

Local verification

Suggested upstream fix

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`dist/run-wait-*.js`

`dist/subagent-registry-*.js`