openclaw - 💡(How to fix) Fix MissingAgentHarnessError after idle + concurrent CLI spawn pile-up (claude-cli harness)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

After an agent session sits idle (~10 min with no inbound message), the next dispatch to that agent fails with:

MissingAgentHarnessError: harness "claude-cli" is not registered for agent <name>

A second, compounding problem then makes the gateway visibly degrade:

When messages arrive while the harness is in this broken state, OpenClaw still spawns claude subprocesses, but they never produce output and never complete. They sit there for minutes (one stayed alive 273s before being terminated). The gateway then refuses subsequent dispatches because zombies are holding slots/state, so a burst of user messages turns one failure into a cascade.

Error Message

MissingAgentHarnessError: harness "claude-cli" is not registered for agent <name>

Root Cause

When messages arrive while the harness is in this broken state, OpenClaw still spawns claude subprocesses, but they never produce output and never complete. They sit there for minutes (one stayed alive 273s before being terminated). The gateway then refuses subsequent dispatches because zombies are holding slots/state, so a burst of user messages turns one failure into a cascade.

Fix Action

Fix / Workaround

Summary

After an agent session sits idle (~10 min with no inbound message), the next dispatch to that agent fails with:

When messages arrive while the harness is in this broken state, OpenClaw still spawns claude subprocesses, but they never produce output and never complete. They sit there for minutes (one stayed alive 273s before being terminated). The gateway then refuses subsequent dispatches because zombies are holding slots/state, so a burst of user messages turns one failure into a cascade.

  1. Harness deregistration on idle. The claude-cli harness appears to be torn down per-agent after an idle window, but the agent itself remains "alive" from the dispatcher's perspective. The next dispatch finds an agent without a harness instead of either (a) re-registering it lazily or (b) returning a clean error that prevents subprocess spawn.

Code Example

MissingAgentHarnessError: harness "claude-cli" is not registered for agent <name>

---

PID 66525  09:18  claude --resume ccb9f2bf...
PID 66594  09:19  claude --resume ccb9f2bf...   ← duplicate for same session
PID 65771  09:09  claude --resume 3b684074...10+ min, no output
PID 65817  09:09  claude --resume ddbf4b04...10+ min, no output
RAW_BUFFERClick to expand / collapse

MissingAgentHarnessError after idle period + concurrent CLI spawn pile-up

Environment

  • OpenClaw: 2026.5.22 (a374c3a)
  • Harness: claude-cli (Claude Code, model claude-opus-4-7)
  • Node: v22.22.1
  • OS: Linux 7.0.0-15-generic (Ubuntu/Debian-based)
  • Setup: 4 agents (main, aptagente, odooagente, prospector) running under a single gateway, each bound to its own Telegram bot via the Telegram provider.

Summary

After an agent session sits idle (~10 min with no inbound message), the next dispatch to that agent fails with:

MissingAgentHarnessError: harness "claude-cli" is not registered for agent <name>

A second, compounding problem then makes the gateway visibly degrade:

When messages arrive while the harness is in this broken state, OpenClaw still spawns claude subprocesses, but they never produce output and never complete. They sit there for minutes (one stayed alive 273s before being terminated). The gateway then refuses subsequent dispatches because zombies are holding slots/state, so a burst of user messages turns one failure into a cascade.

Reproduction

  1. Run OpenClaw 2026.5.22 with at least one agent backed by claude-cli.
  2. Have the agent answer a message, then leave it idle for ~10 minutes (no inbound).
  3. Send a new message via the bound Telegram bot.
  4. Observe: the message is acknowledged by the provider, but the harness errors with MissingAgentHarnessError. No reply is delivered.
  5. Send 2–3 more messages quickly. Each spawns a new claude --resume <sessionId> process that hangs indefinitely (visible in ps).

Observed evidence (2026-05-27 session)

Cadence of failures across a single day on this host:

Time (CEST)Event
03:31First harness failure after idle
04:01Nightly gateway restart → recovered
07:10Failure again
07:11Watchdog auto-heal restart → recovered
07:19Failure 8 min after auto-heal
08:55–09:145 restarts in 19 min while user was actively chatting
09:08–09:1915 harness errors in 11 min
09:03Claude CLI subprocess killed after 180s with no stdout

Snapshot of zombie claude subprocesses (same gateway PID, multiple --resume invocations for distinct session IDs hanging concurrently):

PID 66525  09:18  claude --resume ccb9f2bf...
PID 66594  09:19  claude --resume ccb9f2bf...   ← duplicate for same session
PID 65771  09:09  claude --resume 3b684074...   ← 10+ min, no output
PID 65817  09:09  claude --resume ddbf4b04...   ← 10+ min, no output

claude --version and claude -p "ok" invoked directly from the same shell as the gateway return instantly. The hang only manifests for processes spawned by OpenClaw.

Memory under load reached 1.45 GB; after restarting with heartbeats disabled it dropped to 643 MB (the spawn pile-up is leaking memory or holding refs).

Suspected cause

Two interacting bugs:

  1. Harness deregistration on idle. The claude-cli harness appears to be torn down per-agent after an idle window, but the agent itself remains "alive" from the dispatcher's perspective. The next dispatch finds an agent without a harness instead of either (a) re-registering it lazily or (b) returning a clean error that prevents subprocess spawn.

  2. Spawn without a registered harness. Even when the harness is missing, OpenClaw still launches claude --resume <sessionId> for the dispatch. These children never wire up to a working pipe and never exit, holding gateway state hostage. Concurrent inbound messages multiply the orphans.

A secondary factor that aggravates the issue: heartbeats configured at 30 min intervals across 4 agents = a constant trickle of spawns competing with user messages, increasing the chance of hitting (1) and (2) at the same time.

Expected behavior

  • The harness should be lazily re-registered (or kept warm) so that an idle agent answers the first message after a quiet period exactly like the second.
  • If the harness genuinely cannot be registered, dispatch should fail loudly without spawning an orphan claude subprocess.
  • Concurrent dispatches to the same agent/session should be serialized, not spawn parallel --resume children.

Workarounds applied locally

  • Disabled all per-agent HEARTBEAT.md (reduces baseline spawn rate).
  • Nightly gateway restart + a watchdog that detects MissingAgentHarnessError and triggers a restart with auto-heal.
  • After ~20 min on a clean gateway with heartbeats off: 0 harness errors. With heartbeats on and active chat: ~1 failure every 30–60 min.

Additional notes

  • A benign-looking but recurring warning that appears alongside the issue: [diag] isolated polling spool drain failed: ENOENT — looks like a race between dispatch and spool cleanup. Not a functional failure on its own, but it shows up clustered around the harness errors.
  • Logs for sendMessage stopped appearing in journalctl after one of the isolated polling arch changes, which made these failures harder to diagnose from the user side (deliveries still happen, but the audit trail is thinner).

Happy to provide more logs, configs, or a minimal repro setup if it helps.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • The harness should be lazily re-registered (or kept warm) so that an idle agent answers the first message after a quiet period exactly like the second.
  • If the harness genuinely cannot be registered, dispatch should fail loudly without spawning an orphan claude subprocess.
  • Concurrent dispatches to the same agent/session should be serialized, not spawn parallel --resume children.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING