openclaw - 💡(How to fix) Fix [Bug]: ACP runTurn can remain pending forever when child process exits before terminal event

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
  • When OpenClaw dispatches an agent and the spawned child process dies before saying it's finished, the dispatch promise hangs forever — neither succeeds nor fails.

    • Side effect: the session shows up as 'running' in the registry permanently, with no log file ever written. Operators can't tell the agent failed.

    • Doesn't break new dispatches (we tested) — just leaves stale state and silent failures.

    • We added a watchdog on our side to alert. Real fix should be upstream: wire the child-process-exit handler to also fail the in-flight promise.

    • We think auth failures (401s) trigger this — but we're flagging that as a guess, not proof.

Error Message

structured runtime error. The catch arm in agent-command-DCy1nGSb.js:1113 runs, calls emitAcpLifecycleError, the gateway emits a phase: "error" event, When the child exits before producing a terminal ACP done or error message, runTurn neither resolves nor rejects — it just sits pending. Neither emitAcpLifecycleEnd nor Searching the full gateway log for any subsequent matching session-key / agent-runId terminating event (phase: "end", phase: "error", exit, error, completion, abort) returns zero results. The dispatch was accepted; nothing ever closed it out.

Root Cause

Observed: 2 orphan sessions in our deployment over a multi-day window — gollum f2fc280a stuck for ~3 days, bilbo 9e77164b stuck ~33+ minutes before manual cleanup. Both arose at or shortly after windows where the local model-provider proxy was returning 401 because the upstream OAuth access token had expired. True rate (per dispatch, per day, per agent, etc.): NOT_ENOUGH_INFO. 2 data points isn't enough to extrapolate. Whether the bug fires in the absence of an upstream auth failure: NOT_ENOUGH_INFO (correlation with auth failure observed; lifecycle-invariant gap is independent of trigger but we have not provoked it via other triggers).

Fix Action

Fix / Workaround

  • When OpenClaw dispatches an agent and the spawned child process dies before saying it's finished, the dispatch promise hangs forever — neither succeeds nor fails.

  • Doesn't break new dispatches (we tested) — just leaves stale state and silent failures.

Dispatch a turn against that agent via the gateway's /hooks/... endpoint or directly through acpManager.runTurn(...).

Code Example

"agent:gollum:hook:<test-key-redacted>-20260505-103742": {
  "abortedLastRun": false,
  "model": "claude-sonnet-4-6",
  "modelProvider": "corporate-claude-proxy",
  "sessionId": "f2fc280a-de06-4be7-ad69-80f0f83daa72",
  "skillsSnapshot": { /* dict, omitted */ },
  "startedAt": 1777977516415,
  "status": "running",
  "systemSent": true,
  "updatedAt": 1777977516469
}

"agent:bilbo:hook:<dispatch-key-redacted>-20260508-082910": {
  "abortedLastRun": false,
  "model": "claude-opus-4-7",
  "modelProvider": "corporate-claude-proxy",
  "sessionId": "9e77164b-a44d-4109-a02e-b1462a335aba",
  "skillsSnapshot": { /* dict, omitted */ },
  "startedAt": 1778228951771,
  "status": "running",
  "systemSent": true,
  "updatedAt": 1778228951850
}

The gollum orphan was stuck for ~3 days (startedAt 2026-05-05 until manual cleanup 2026-05-08). The bilbo orphan was stuck ~33+ minutes from dispatch until detection. Both have systemSent: true, normal model / modelProvider, and no corresponding <sessionId>.jsonl file in the agent's sessions directory — i.e. registration completed, but the lifecycle-end half never fired.
2. Gateway log line for the bilbo dispatch — dispatch_accepted only, no terminating event (gateway / fellowship-transform plugin log):
2026-05-08T08:29:10.472+00:00 {"time":"2026-05-08T08:29:10.471Z","level":"info","component":"fellowship-transform","event":"dispatch_accepted","agentId":"bilbo","sessionKey":"hook:<dispatch-key-redacted>-20260508-082910","hasDelivery":false,"timeoutSeconds":1500}
Searching the full gateway log for any subsequent matching session-key / agent-runId terminating event (phase: "end", phase: "error", exit, error, completion, abort) returns zero results. The dispatch was accepted; nothing ever closed it out.
3. Empirical non-blocking proof — fresh sessions completed against the same agent during the orphan window:
Listing agents/gollum/sessions/*.jsonl after the gollum orphan f2fc280a's startedAt:
f8157ed9-33e1-4d0e-9e38-7ebd4318acd8.jsonl
f809b447-0434-4796-8034-16a153e6ee22.jsonl
5c4c7e4c-94b7-4891-8446-fdae27dcf038.jsonl
7cd7bc68-69d8-4286-9d2a-7e016cb0e46b.jsonl
b6ade4b8-ac12-4256-95a0-0ea8cf814167.jsonl
a54a5bde-0d46-4829-b626-5da2d4c51162.jsonl
212dfa30-6b2d-4400-af7b-179156b14c67.jsonl
246a7c8e-7363-4821-9be0-686c01cf83b1.jsonl
9ffe93d9-7c58-41ec-a3de-5e24aa4853c1.jsonl
41844a48-bad6-4bd7-850d-03cd6aa58d59.jsonl
7 of these were created after the orphan's startedAt (rest are unrelated). Each is a complete agent session jsonl. So OpenClaw's lane mechanism did not refuse new dispatches against gollum despite f2fc280a being marked running. Same observation for bilbo: a fresh dispatch (sessionId 184acebb...) ran cleanly while 9e77164b was still running.
4. Externally-detectable signature from our boundary watchdog:
Our PocketBase cron checks every 2 minutes for status === "running" + startedAt > 90s ago + no <sessionId>.jsonl file present. Detection log:
2026/05/08 09:08:08 WATCHDOG agent_session_orphan: bilbo 9e77164b (age=39min, no jsonl) — alerted
2026/05/08 09:08:10 WATCHDOG agent_session_orphan: gollum f2fc280a (age=4229min, no jsonl) — alerted
2026/05/08 09:08:11 CRON agent_session_orphan_watchdog: detected 2 new orphan(s) this tick
This watchdog is an external workaround in our orchestration; it confirms that orphans are reliably distinguishable from real running sessions by the artifacts on disk, but does not replace the upstream lifecycle fix.
5. Recovery procedure (verified): stop the OpenClaw container, edit each affected agents/<id>/sessions/sessions.json to mark the orphan entry's status: "failed" (with a failureReason field for audit trail), restart the container. Plain restart without the edit does NOT cull stale running entries — verified by restarting OpenClaw twice with the orphans still present in the registry both times.
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

  • When OpenClaw dispatches an agent and the spawned child process dies before saying it's finished, the dispatch promise hangs forever — neither succeeds nor fails.

  • Side effect: the session shows up as 'running' in the registry permanently, with no log file ever written. Operators can't tell the agent failed.

  • Doesn't break new dispatches (we tested) — just leaves stale state and silent failures.

  • We added a watchdog on our side to alert. Real fix should be upstream: wire the child-process-exit handler to also fail the in-flight promise.

  • We think auth failures (401s) trigger this — but we're flagging that as a guess, not proof.

Steps to reproduce

Configure an OpenClaw agent whose --agent command exits with non-zero status immediately on first stdin write (e.g. bash -c 'cat > /dev/null; exit, or any binary that closes stdout without producing an ACP done message). The claude CLI subprocess hitting an upstream 401 from its model provider produces the same shape — exits before any ACP message is written.

Dispatch a turn against that agent via the gateway's /hooks/... endpoint or directly through acpManager.runTurn(...).

Watch agents/<id>/sessions/sessions.json and agents/<id>/sessions/<sessionId>.jsonl.

Expected behavior

When the agent's child process exits or its stdout closes — for any reason (auth failure, missing binary, OOM, sandbox kill) — acpManager.runTurn(...) rejects with a structured runtime error. The catch arm in agent-command-DCy1nGSb.js:1113 runs, calls emitAcpLifecycleError, the gateway emits a phase: "error" event, persistGatewaySessionLifecycleEvent patches the session entry to status: "failed", and an operator looking at sessions.json sees the session ended.

Actual behavior

When the child exits before producing a terminal ACP done or error message, runTurn neither resolves nor rejects — it just sits pending. Neither emitAcpLifecycleEnd nor emitAcpLifecycleError fires. The session stays at status: "running" forever. No <sessionId>.jsonl is ever written. The dispatch was accepted by the gateway (operator sees dispatch_accepted in logs), and from the gateway's view nothing has gone wrong, but the agent never actually ran. There is no signal that anything failed.

The diff in one sentence: the try/catch around runTurn only fires the catch arm if the promise rejects; if the promise hangs, neither arm runs.

OpenClaw version

OpenClaw version:** 2026.4.5 (@openclaw/acpx extension at the same version)

Operating system

Host OS:** Ubuntu 24.04.4 LTS, kernel 6.8.0-110-generic

Install method

Install method:* Docker (Compose-managed). Custom fellowship-openclaw:latest image built from a Dockerfile extending the OpenClaw base, container

Model

claude-sonnet-4-6 & claude-opus-4-7

Provider / routing chain

claude-sonnet-4-6 & claude-opus-4-7

Additional provider/model setup details

No response

Logs, screenshots, and evidence

"agent:gollum:hook:<test-key-redacted>-20260505-103742": {
  "abortedLastRun": false,
  "model": "claude-sonnet-4-6",
  "modelProvider": "corporate-claude-proxy",
  "sessionId": "f2fc280a-de06-4be7-ad69-80f0f83daa72",
  "skillsSnapshot": { /* dict, omitted */ },
  "startedAt": 1777977516415,
  "status": "running",
  "systemSent": true,
  "updatedAt": 1777977516469
}

"agent:bilbo:hook:<dispatch-key-redacted>-20260508-082910": {
  "abortedLastRun": false,
  "model": "claude-opus-4-7",
  "modelProvider": "corporate-claude-proxy",
  "sessionId": "9e77164b-a44d-4109-a02e-b1462a335aba",
  "skillsSnapshot": { /* dict, omitted */ },
  "startedAt": 1778228951771,
  "status": "running",
  "systemSent": true,
  "updatedAt": 1778228951850
}

The gollum orphan was stuck for ~3 days (startedAt 2026-05-05 until manual cleanup 2026-05-08). The bilbo orphan was stuck ~33+ minutes from dispatch until detection. Both have systemSent: true, normal model / modelProvider, and no corresponding <sessionId>.jsonl file in the agent's sessions directory — i.e. registration completed, but the lifecycle-end half never fired.
2. Gateway log line for the bilbo dispatch — dispatch_accepted only, no terminating event (gateway / fellowship-transform plugin log):
2026-05-08T08:29:10.472+00:00 {"time":"2026-05-08T08:29:10.471Z","level":"info","component":"fellowship-transform","event":"dispatch_accepted","agentId":"bilbo","sessionKey":"hook:<dispatch-key-redacted>-20260508-082910","hasDelivery":false,"timeoutSeconds":1500}
Searching the full gateway log for any subsequent matching session-key / agent-runId terminating event (phase: "end", phase: "error", exit, error, completion, abort) returns zero results. The dispatch was accepted; nothing ever closed it out.
3. Empirical non-blocking proof — fresh sessions completed against the same agent during the orphan window:
Listing agents/gollum/sessions/*.jsonl after the gollum orphan f2fc280a's startedAt:
f8157ed9-33e1-4d0e-9e38-7ebd4318acd8.jsonl
f809b447-0434-4796-8034-16a153e6ee22.jsonl
5c4c7e4c-94b7-4891-8446-fdae27dcf038.jsonl
7cd7bc68-69d8-4286-9d2a-7e016cb0e46b.jsonl
b6ade4b8-ac12-4256-95a0-0ea8cf814167.jsonl
a54a5bde-0d46-4829-b626-5da2d4c51162.jsonl
212dfa30-6b2d-4400-af7b-179156b14c67.jsonl
246a7c8e-7363-4821-9be0-686c01cf83b1.jsonl
9ffe93d9-7c58-41ec-a3de-5e24aa4853c1.jsonl
41844a48-bad6-4bd7-850d-03cd6aa58d59.jsonl
7 of these were created after the orphan's startedAt (rest are unrelated). Each is a complete agent session jsonl. So OpenClaw's lane mechanism did not refuse new dispatches against gollum despite f2fc280a being marked running. Same observation for bilbo: a fresh dispatch (sessionId 184acebb...) ran cleanly while 9e77164b was still running.
4. Externally-detectable signature from our boundary watchdog:
Our PocketBase cron checks every 2 minutes for status === "running" + startedAt > 90s ago + no <sessionId>.jsonl file present. Detection log:
2026/05/08 09:08:08 WATCHDOG agent_session_orphan: bilbo 9e77164b (age=39min, no jsonl) — alerted
2026/05/08 09:08:10 WATCHDOG agent_session_orphan: gollum f2fc280a (age=4229min, no jsonl) — alerted
2026/05/08 09:08:11 CRON agent_session_orphan_watchdog: detected 2 new orphan(s) this tick
This watchdog is an external workaround in our orchestration; it confirms that orphans are reliably distinguishable from real running sessions by the artifacts on disk, but does not replace the upstream lifecycle fix.
5. Recovery procedure (verified): stop the OpenClaw container, edit each affected agents/<id>/sessions/sessions.json to mark the orphan entry's status: "failed" (with a failureReason field for audit trail), restart the container. Plain restart without the edit does NOT cull stale running entries — verified by restarting OpenClaw twice with the orphans still present in the registry both times.

Impact and severity

Affected users / systems / channels

Operators of OpenClaw deployments who: (a) dispatch agents via the gateway hooks endpoint, (b) run agent backends that spawn subprocesses speaking ACP over stdio, and (c) inspect sessions.json to reason about agent state. Specifically observed on: a single Docker-Compose-managed install of OpenClaw 2026.4.5 with agents configured against a custom OpenAI-completions provider that wraps the Anthropic claude CLI (the corporate-claude-proxy described in the routing chain above). Whether this affects deployments using different model providers, different agent backends, or different install methods: NOT_ENOUGH_INFO (we have only this deployment's data).

Severity

Low–medium. Observability / lifecycle-correctness, not operational deadlock. Not a blocker for new dispatches: verified empirically. A gollum orphan sat at status: "running" for 3 days; in that window 7 fresh gollum jsonl sessions ran successfully against the same agent. A bilbo orphan was running when a subsequent bilbo dispatch ran cleanly minutes later. OpenClaw's lane mechanism does not appear to consult the registry status field for new-dispatch admission. Severity for downstream tools that consult sessions.json to answer "is agent X busy?" or "what's currently running?": NOT_ENOUGH_INFO. Depends on the consumer; we do not consume sessions.json for that purpose, so we cannot speak to that case.

Frequency

Observed: 2 orphan sessions in our deployment over a multi-day window — gollum f2fc280a stuck for ~3 days, bilbo 9e77164b stuck ~33+ minutes before manual cleanup. Both arose at or shortly after windows where the local model-provider proxy was returning 401 because the upstream OAuth access token had expired. True rate (per dispatch, per day, per agent, etc.): NOT_ENOUGH_INFO. 2 data points isn't enough to extrapolate. Whether the bug fires in the absence of an upstream auth failure: NOT_ENOUGH_INFO (correlation with auth failure observed; lifecycle-invariant gap is independent of trigger but we have not provoked it via other triggers).

Practical consequence

Stale running registry entries that never transition to terminal status without operator action. No failure signal back to the dispatcher when an agent backend exits before producing a terminal ACP event. The dispatcher sees dispatch_accepted in the gateway log and then nothing further. Whatever orchestration depends on "this dispatch will eventually produce a result" is left waiting silently. Recovery in our deployment: (1) stop the OpenClaw container, (2) edit the affected agents/<id>/sessions/sessions.json to mark the orphan entry terminal (status: "failed" + failureReason), (3) restart. We verified that a plain container restart does NOT cull stale running entries — they persist. Operational consequence for our specific orchestration: an external PocketBase cron now detects the externally-visible signature (status === "running" + age > 90s + no <sessionId>.jsonl) and alerts. Without that watchdog, dispatches that hit this bug would be invisible to the orchestrator — agents would appear "still working" forever. Whether this causes missed user-facing messages, failed onboarding, extra cost, etc., in other deployments: NOT_ENOUGH_INFO. For our deployment specifically, the only practical cost so far has been operator time for cleanup.

Additional information

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When the agent's child process exits or its stdout closes — for any reason (auth failure, missing binary, OOM, sandbox kill) — acpManager.runTurn(...) rejects with a structured runtime error. The catch arm in agent-command-DCy1nGSb.js:1113 runs, calls emitAcpLifecycleError, the gateway emits a phase: "error" event, persistGatewaySessionLifecycleEvent patches the session entry to status: "failed", and an operator looking at sessions.json sees the session ended.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: ACP runTurn can remain pending forever when child process exits before terminal event