openclaw - 💡(How to fix) Fix [Bug]: claude-cli harness registers lazily after gateway boot — silently drops inbound traffic with MissingAgentHarnessError for 96 minutes

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

[diagnostic] message dispatch completed: channel=telegram sessionId=unknown sessionKey=agent:main:telegram:direct:<user> source=replyResolver outcome=error duration=17696ms error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered." [diagnostic] message processed: channel=telegram chatId=<chat> messageId=<id> sessionId=unknown sessionKey=agent:main:telegram:direct:<user> outcome=error duration=19833ms error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered." [telegram] dispatch failed: MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered.

Fix Action

Fix / Workaround

After a gateway restart, the claude-cli agent harness was not present in the dispatch registry for ~96 minutes despite the gateway reporting all plugins healthy and listening. Every inbound user message during that window failed with MissingAgentHarnessError, and the user got a generic "Something went wrong" reply. The state self-healed on the first subsequent inbound — strongly suggesting the harness registers lazily on first dispatch attempt rather than synchronously at boot, and that the registration race won at 16:57 but lost at 18:33 for some intermediate reason (or vice versa). Either way: no log line warned that a critical harness was missing, and the gateway considered itself healthy throughout.

[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17696ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19833ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.

Each failure took ~17–19s of CPU time before erroring (suggests the dispatch path was retrying or waiting on something internal before giving up). Telegram still sent the canned "Something went wrong while processing your request. Please try again." back to the user for each — so from the user's perspective the bot looked online and responsive, just broken.

Code Example

[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17696ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19833ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.

---

[gateway] http server listening (10 plugins: anthropic, browser, canvas,
  device-pair, file-transfer, memory-core, phone-control, slack, talk-voice,
  telegram; 6.5s)

---

T=00:00:00  prior gateway shutdown errored (separate issue, filed as peer)
T=00:00:08  new gateway PID listening, 10 plugins reported, healthy
T=00:00:36  inbound message #1MissingAgentHarnessError, 19s, canned reply sent
T=00:02:36  inbound message #2MissingAgentHarnessError, 19s, canned reply sent
T=00:24:02  inbound message #3MissingAgentHarnessError, 19s, canned reply sent
T=01:36:15  inbound message #4MissingAgentHarnessError, 19s, canned reply sent
T=01:36:47  inbound message #5SUCCESS, cli-backend cold-started
T=01:36:56  [agent/cli-backend] cli exec: provider=claude-cli model=opus
            promptChars=447 trigger=user useResume=false session=none
            resumeSession=none reuse=none historyPrompt=none
T=01:37:16  [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-opus-4-7 durationMs=19782 rawLines=53
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw: 2026.5.4 (gateway process that exhibited the bug)
  • Backend: cliBackends.claude-cli pointing at Anthropic's Claude Code CLI binary
  • Channel: Telegram (polling mode), single DM lane
  • Agent: single agents.list[main] with model claude-cli/claude-opus-4-7, fallbacks claude-cli/claude-sonnet-4-6, claude-cli/claude-haiku-4-5
  • Plugins enabled: anthropic, browser, canvas, device-pair, file-transfer, memory-core, phone-control, slack, talk-voice, telegram (10 total)
  • Host: single-VPS Linux deployment, no container

TL;DR

After a gateway restart, the claude-cli agent harness was not present in the dispatch registry for ~96 minutes despite the gateway reporting all plugins healthy and listening. Every inbound user message during that window failed with MissingAgentHarnessError, and the user got a generic "Something went wrong" reply. The state self-healed on the first subsequent inbound — strongly suggesting the harness registers lazily on first dispatch attempt rather than synchronously at boot, and that the registration race won at 16:57 but lost at 18:33 for some intermediate reason (or vice versa). Either way: no log line warned that a critical harness was missing, and the gateway considered itself healthy throughout.

Symptom

After gateway restart at T=00:00:08 (process up and listening), every inbound Telegram message for the next 96 minutes failed identically:

[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17696ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19833ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.

Each failure took ~17–19s of CPU time before erroring (suggests the dispatch path was retrying or waiting on something internal before giving up). Telegram still sent the canned "Something went wrong while processing your request. Please try again." back to the user for each — so from the user's perspective the bot looked online and responsive, just broken.

The gateway, meanwhile, was reporting itself healthy:

[gateway] http server listening (10 plugins: anthropic, browser, canvas,
  device-pair, file-transfer, memory-core, phone-control, slack, talk-voice,
  telegram; 6.5s)

No warn/error line was emitted about the missing harness.

Timeline (UTC, anonymized; 4 messages affected)

T=00:00:00  prior gateway shutdown errored (separate issue, filed as peer)
T=00:00:08  new gateway PID listening, 10 plugins reported, healthy
T=00:00:36  inbound message #1  → MissingAgentHarnessError, 19s, canned reply sent
T=00:02:36  inbound message #2  → MissingAgentHarnessError, 19s, canned reply sent
T=00:24:02  inbound message #3  → MissingAgentHarnessError, 19s, canned reply sent
T=01:36:15  inbound message #4  → MissingAgentHarnessError, 19s, canned reply sent
T=01:36:47  inbound message #5  → SUCCESS, cli-backend cold-started
T=01:36:56  [agent/cli-backend] cli exec: provider=claude-cli model=opus
            promptChars=447 trigger=user useResume=false session=none
            resumeSession=none reuse=none historyPrompt=none
T=01:37:16  [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-opus-4-7 durationMs=19782 rawLines=53

Total user-visible outage: 96 minutes. 4 user messages got the canned error before the gateway healed itself.

What went wrong

Two distinct problems, both worth fixing:

1. Harness registration is lazy / racy, not synchronous-at-boot

The gateway reports http server listening (10 plugins: ..., 6.5s) before the claude-cli agent harness has actually registered into the dispatch registry. Dispatch then races: if the first inbound arrives before registration completes, dispatch hits the "harness not registered" branch and the user sees an error. After 96 minutes of dispatches failing this way on this host, the next inbound succeeded — which means something about that 5th dispatch (or the slot of time it ran in) finally caused the registration to complete. The cli exec ... session=none resumeSession=none reuse=none log line on the first successful exec confirms it was a cold cli-backend start, consistent with the registry having been empty until that moment.

The exact race condition needs investigation — possible candidates: harness register() waiting on an async secret-resolution that never completed on the first 4 dispatches, or a one-shot retry that exhausted retries on each dispatch without registering. But the fix is the same regardless: the gateway should not declare itself ready to dispatch until the registry is populated with every declared agent harness.

2. Missing-harness condition is silent

The dispatch path knows perfectly well, on each failed attempt, that an expected harness (declared in agents.list[main]) is missing. But it only logs the per-message dispatch error — there is no:

  • boot-time invariant check that fails the gateway start if a declared harness fails to register
  • periodic health-line that reports "registered harnesses: [...]" so operators can spot a missing one
  • distinct warn/error message on missing-harness vs other dispatch errors

The user-facing canned reply also doesn't distinguish "transient network glitch" from "gateway is structurally broken." Both look identical to the user, so the operator has to be reading journals to discover the outage exists.

Ask

  1. Fix lazy registration. Harness registration should be synchronous (or properly awaited) during plugin init, so that http server listening only fires after the dispatch registry can resolve every declared agent. Treat a declared-but-unregistered harness as a fatal boot error.

  2. Boot-time invariant check. Even after #1, add a startup assertion that compares declared agents (from agents.list) against the registry — fail fast at boot if any declared harness failed to register.

  3. Loud logging on missing-harness dispatch. When dispatch hits a MissingAgentHarnessError for a harness that's in agents.list but not in the registry, escalate the log to a warn/error level with a distinctive prefix (e.g. [gateway] CRITICAL: declared harness "claude-cli" missing from registry on dispatch). This is a different class of failure from a user typo'ing an agent name.

  4. Distinguish user-facing error. Either keep the canned "Something went wrong" message but include a follow-up operator-facing alert (Telegram DM to the deployment owner / system-bus event / etc.), or upgrade the message to "The gateway is unhealthy; check journals" so a human gets a clear signal vs. just retrying into the same wall.

Related

  • Co-occurred 2026-05-24 with #86226 (synthetic-auth.runtime.js shutdown SyntaxError). Together they produced the 96-minute outage: the SyntaxError fired during the prior gateway's shutdown, and the new gateway then came up with this lazy-registration bug, so by the time the dispatch path saw user traffic the registry was empty.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: claude-cli harness registers lazily after gateway boot — silently drops inbound traffic with MissingAgentHarnessError for 96 minutes