openclaw - 💡(How to fix) Fix [Bug]: MissingAgentHarnessError on inbound dispatch under event-loop starvation — harness IS registered (2026.5.22, distinct root cause from #86227)

openclaw2026-05-25 00:06:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

[diagnostic] message dispatch completed: channel=telegram sessionId=unknown sessionKey=agent:main:telegram:direct:<user> source=replyResolver outcome=error duration=17547ms error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered." [diagnostic] message processed: channel=telegram chatId=<chat> messageId=<id> sessionId=unknown sessionKey=agent:main:telegram:direct:<user> outcome=error duration=19617ms error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered." [telegram] dispatch failed: MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered.

Root Cause

This is distinct from #86227 (lazy cold-boot registration). Same surface error, different root cause. Cross-ref below.

Fix Action

Fix / Workaround

Three consecutive inbound Telegram messages over a 3-minute window failed with MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered. — but the harness was registered the entire time. Cron-triggered cli exec calls fired in the same minutes used provider=claude-cli successfully. The failures correlate exactly with event-loop starvation (eventLoopUtilization=0.999, eventLoopDelayP99Ms=7159.7, fetch-timeout firing with explicit likely event-loop starvation annotation). The dispatch resolver lookup is racing or timing out under load and translating the failure into a fundamentally misleading error message.

[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17547ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19617ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.

Per-failure dispatch time was ~17–28 seconds before erroring — the resolver was clearly waiting on something before giving up. User received the canned "Something went wrong" reply for each.

Code Example

[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17547ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19617ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.

---

T-7h        gateway boot (2026.5.22), healthy
T+0:00:00   inbound msg A → 19.6s wait → MissingAgentHarnessError, canned reply
T+0:01:13   inbound msg B → 20.7s wait → MissingAgentHarnessError, canned reply
T+0:02:24   inbound msg C → 27.9s wait → MissingAgentHarnessError, canned reply
            ↳ same second: [fetch-timeout] fetch timeout after 10000ms
              (elapsed 16117ms) timer delayed 6117ms, likely event-loop starvation
              operation=fetchWithTimeout url=https://api.telegram.org/.../getMe
T+0:03:25   inbound msg D → SUCCESS (next turn drained, normal dispatch)

---

T+0:01:30   [agent/cli-backend] cli exec: provider=claude-cli model=claude-haiku-4-5
            promptChars=1690 trigger=cron useResume=false ...
T+0:01:47   [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-haiku-4-5 durationMs=16553 rawLines=76
T+0:03:12   [agent/cli-backend] cli exec: provider=claude-cli model=sonnet
            promptChars=404 trigger=cron useResume=false ...
T+0:03:19   [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-sonnet-4-6 durationMs=6843 rawLines=54

---

[agent/cli-backend] claude live session turn: provider=claude-cli
  model=claude-opus-4-7 durationMs=205933 rawLines=2079

RAW_BUFFERClick to expand / collapse

Environment

OpenClaw: 2026.5.22 (a374c3a) — gateway uptime 7h 3m at incident start
Backend: cliBackends.claude-cli pointing at Anthropic's Claude Code CLI binary
Channel: Telegram (polling mode), single DM lane
Agent: single agents.list[main] with model claude-cli/claude-opus-4-7, fallbacks claude-cli/claude-sonnet-4-6, claude-cli/claude-haiku-4-5
Host: single-VPS Linux deployment, no container
Concurrent activity: multiple parallel cron-triggered cli exec calls + a long-running resume-session opus turn

TL;DR

This is distinct from #86227 (lazy cold-boot registration). Same surface error, different root cause. Cross-ref below.

Symptom

After 7h of healthy operation, three consecutive inbound messages within a 3-minute window all failed identically:

[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17547ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19617ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.

Per-failure dispatch time was ~17–28 seconds before erroring — the resolver was clearly waiting on something before giving up. User received the canned "Something went wrong" reply for each.

Timeline (UTC)

T-7h        gateway boot (2026.5.22), healthy
T+0:00:00   inbound msg A → 19.6s wait → MissingAgentHarnessError, canned reply
T+0:01:13   inbound msg B → 20.7s wait → MissingAgentHarnessError, canned reply
T+0:02:24   inbound msg C → 27.9s wait → MissingAgentHarnessError, canned reply
            ↳ same second: [fetch-timeout] fetch timeout after 10000ms
              (elapsed 16117ms) timer delayed 6117ms, likely event-loop starvation
              operation=fetchWithTimeout url=https://api.telegram.org/.../getMe
T+0:03:25   inbound msg D → SUCCESS (next turn drained, normal dispatch)

Total user-visible failures: 3. Self-healed without intervention.

Smoking gun: harness was registered throughout

Parallel cron-triggered cli exec calls during the same failure window succeeded with provider=claude-cli:

T+0:01:30   [agent/cli-backend] cli exec: provider=claude-cli model=claude-haiku-4-5
            promptChars=1690 trigger=cron useResume=false ...
T+0:01:47   [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-haiku-4-5 durationMs=16553 rawLines=76
T+0:03:12   [agent/cli-backend] cli exec: provider=claude-cli model=sonnet
            promptChars=404 trigger=cron useResume=false ...
T+0:03:19   [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-sonnet-4-6 durationMs=6843 rawLines=54

If the registry were actually empty, these would also fail. They did not. The harness was present; only the inbound-message dispatch resolver path could not see it.

Concurrent load (what was starving the loop)

Just prior to the failure window, a long resume-session opus turn returned:

[agent/cli-backend] claude live session turn: provider=claude-cli
  model=claude-opus-4-7 durationMs=205933 rawLines=2079

3.4 minutes, 2,079 raw event lines — combined with simultaneous cron-driven cli execs (model-prewarm + scheduled scans) and the resolver's own work, the loop was at 99.9% utilization with 7s tail latencies. The dispatch resolver's path through whatever harness lookup it performs was the loser of the race.

What's wrong

1. The error message is structurally misleading

MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered. is identical to #86227's lazy-cold-boot case where the registry was actually empty. Here the registry is fine — but the operator and the autonomous agent triaging this incident have to reverse-engineer that distinction from interleaved log timestamps. The resolver should distinguish:

"harness is genuinely absent from the registry"
"harness lookup timed out / threw / lost a race"

These are different bugs with different fixes. Conflating them in the error string makes monitoring and alerting useless (same error → wildly different remediation).

2. The resolver should not be losable under event-loop pressure

Whatever mechanism the dispatch resolver uses to look up a declared agent harness — Map lookup, promise await, registry RPC — it is unambiguously not robust to a hot loop. A registered harness should not become temporarily-unfindable just because something else is keeping the loop busy. If the path is await someAsyncRegistryRead() with an implicit timeout, that's the bug; if it's racing the harness reference against a concurrent rebuild, that's the bug.

3. 17–28s of CPU per failure is itself a problem

Each failing dispatch consumed 17–28 seconds of wall-clock before erroring. During an already-starved loop, three of those stacked on top of the existing pressure — actively delaying recovery rather than fast-failing and letting the loop drain. Whatever the resolver is awaiting, it needs a much tighter timeout (sub-second) with a structured retry, not a 20s blind wait followed by a misleading error.

4. Sessions delete may be participating in the race

Three sessions.delete events fired in the minutes leading up to the failures (cleanup after each completed turn). After tonight's incident, the persistent claude-cli resumeSession <uuid> was gone — the next turn cold-started. Possibly the resolver is reading session state mid-cleanup. Worth investigating whether deletes can transiently invalidate a harness reference held by the resolver, even though the harness itself is registered.

Ask

Differentiate the error. Split MissingAgentHarnessError into at minimum two cases: "harness declared but absent from registry" (the #86227 case) and "harness lookup failed under load / timed out" (this case). Distinct error classes, distinct log prefixes, distinct user-facing handling.
Tighten the resolver's lookup path. If the dispatch resolver awaits anything, give it a sub-second timeout and a clear "retry vs fail" decision tree. Spending 28s blocked while reporting "not registered" is the worst of both worlds.
Make the resolver loop-pressure-tolerant. A registered harness should resolve from a synchronous in-memory reference. If today's code introduces async hops (RPC, file read, secret refresh) in the per-message dispatch hot path, lift those out of the hot path.
Audit sessions.delete for resolver races. If cleanup can invalidate a reference the resolver is mid-lookup against, fix the lifetime invariant.
Surface event-loop starvation to operators in real time. The [diagnostic] liveness warning log line already exists — when the loop is at >0.99 utilization with multi-second p99, downstream dispatch errors during that window should be tagged with that context (or correlated automatically in the operator alert).

#86227 — same error string, different root cause (cold-boot lazy registration). This issue is the second-known way to produce MissingAgentHarnessError. Together they argue strongly for ask #1: the current error class is overloaded.
#86226 — gateway-shutdown SyntaxError, co-filed with #86227 as the upstream cause of that 96-minute outage. Not related to tonight's incident (this gateway uptime was 7h at failure).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering