openclaw - 💡(How to fix) Fix [Bug]: Isolated agent runs stall in runtime-plugins phase before execution start — recurring across hourly crons on 2026.5.22, no named-plugin diagnostic

openclaw2026-05-27 15:09:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

This is in the same family as #86239 (pre-turn dispatch failures) but a distinct error surface — that one throws MissingAgentHarnessError, this one names runtime-plugins as the stalled phase with no further detail. "status": "error", "error": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)", "severity": "error", Counted by grepping "status":"error" lines mentioning runtime-plugins / stalled before execution in each cron's run log under <workspace>/cron/runs/:

Name the failing plugin. Add per-plugin diagnostics to the runtime-plugins phase so the in-flight plugin at stall time appears in the error. This is the single highest-value fix.
Confirm relationship to #86239. Same family (failures before agent turn starts) but a distinct error surface. Likely separate root causes but worth a sanity check from someone with the gateway internals in head.

#86239 — MissingAgentHarnessError on inbound dispatch under event-loop starvation. Same family (pre-turn failure), different error string, different code path. Open.

Root Cause

Crons A and B each see roughly one failure every other hour. Because the daily health-audit cron flags only "last run failed" (not "any run in the last 24h failed"), the operator-visible signal under-reports: the morning digest shows 5 failing crons, but the underlying run logs show ~180 total failed runs across the fleet in 5 days.

Fix Action

Fix / Workaround

#86239 — MissingAgentHarnessError on inbound dispatch under event-loop starvation. Same family (pre-turn failure), different error string, different code path. Open.
#86227 — claude-cli harness lazy registration silent drop. Closed as fixed in 2026.5.22. This issue is on 2026.5.22 (a374c3a), so the harness fix is in — this is a different pre-turn failure mode.

Code Example

{
  "ts": <ts>,
  "action": "finished",
  "status": "error",
  "error": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
  "diagnostics": {
    "summary": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
    "entries": [
      {
        "ts": <ts>,
        "source": "cron-setup",
        "severity": "error",
        "message": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)"
      }
    ]
  },
  "runAtMs": <ts>,
  "durationMs": 78628,
  "nextRunAtMs": <ts>
}

RAW_BUFFERClick to expand / collapse

Environment

OpenClaw: 2026.5.22 (a374c3a)
Backend: cliBackends.claude-cli
Host: single-VPS Linux deployment, no container
Affected runs: all use sessionTarget: "isolated" and payload.kind: "agentTurn"
Models seen across affected runs: claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-7 (via anthropic/ and claude-cli/ prefixes) — failure is not model-specific

TL;DR

Cron-triggered isolated agent runs intermittently stall in the gateway's runtime-plugins phase, before the model is invoked. The failure terminates the run with a generic diagnostic that names only the phase, not the plugin that hung. Affected runs have no model, provider, or usage recorded — confirming the agent turn never started.

Pattern is recurring (not one-off) and post-2026.5.22-install: 4 distinct hourly crons have racked up 11–77 failed runs each over the 5 days since 2026-05-22. Other crons are hit 1–2 times. The same crons also have intermixed green runs in the same window, suggesting a race or external-dependency timeout rather than a deterministic config bug.

Symptom

Failed isolated-session runs end with this diagnostic in the cron run log:

{
  "ts": <ts>,
  "action": "finished",
  "status": "error",
  "error": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
  "diagnostics": {
    "summary": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
    "entries": [
      {
        "ts": <ts>,
        "source": "cron-setup",
        "severity": "error",
        "message": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)"
      }
    ]
  },
  "runAtMs": <ts>,
  "durationMs": 78628,
  "nextRunAtMs": <ts>
}

Notable: no model, provider, usage, or sessionId fields on failed runs — these are present on every successful run. The agent turn was never reached.

durationMs on failures clusters around 78–80s, suggesting an internal phase timeout (single observed value to date — would appreciate confirmation of what that timeout is).

Frequency (5-day window: 2026-05-22 → 2026-05-27)

Counted by grepping "status":"error" lines mentioning runtime-plugins / stalled before execution in each cron's run log under <workspace>/cron/runs/:

Cron	Cadence	Failed runs	Most recent failure
Hourly cron A	hourly	77	2026-05-27 12:23 UTC
Hourly cron B	hourly	74	2026-05-27 14:15 UTC
Hourly cron C	hourly	12	recent
Hourly cron D	event-driven	11	recent
Misc daily/weekly crons	—	1–2 each	various

Affected crons span unrelated payload types, tool surfaces (some have payload.toolsAllow locks, some don't), and models — the only common factor is sessionTarget: "isolated".

Hypothesis (low-confidence, leave for upstream)

The runtime-plugins phase appears to load MCP plugins / runtime adapters for the isolated child session. Either:

A specific plugin's init is hanging (network call without timeout, file handle contention, mutex), and the phase has no per-plugin timeout — so the whole phase times out generically after ~78s with no indication of which plugin stalled.
There's a race in the plugin-loader's startup ordering when multiple isolated sessions spin up concurrently (hourly crons firing at :00 of every hour would line up several at once).

Both are consistent with the observed mix of green and red runs on the same cron. I can't narrow further from the operator side because the diagnostic doesn't name a plugin.

What's wrong

1. The diagnostic is unactionable

Knowing the phase failed isn't enough to fix anything. There are presumably N plugins/adapters loaded per isolated session — one of them is the culprit, but the operator (and downstream tooling) can't tell which. The minimum useful diagnostic would include:

The list of plugins the phase was attempting to load.
Which ones completed before the stall.
The plugin that was in-flight when the phase timed out.

2. No per-plugin timeout (apparent)

If the phase aggregate timeout is ~78s and individual plugins have no timeouts of their own, a single slow plugin (e.g. an MCP server doing a network call without a timeout of its own) takes down the whole phase. A bounded per-plugin init budget would isolate the failure to the offending plugin and let the rest of the phase succeed.

3. No retry-on-phase-stall

A transient stall in plugin init kills the entire cron run — the operator's only recourse is to wait for the next scheduled run. A bounded retry on phase-stall (configurable, default off) would significantly reduce user-visible failures for transient races without papering over real bugs.

4. Under-reporting in cron status

The cron state machine records lastRunStatus only. A cron that fails 3 of 5 runs in an hour and succeeds on the 4th shows lastRunStatus: ok — and the daily health audit treats it as healthy. Adding a recentRunFailureRate (or similar) over a rolling window would surface chronic-but-self-recovering failures like this one without requiring operators to grep run logs by hand.

Asks

Name the failing plugin. Add per-plugin diagnostics to the runtime-plugins phase so the in-flight plugin at stall time appears in the error. This is the single highest-value fix.
Add per-plugin init timeouts. Bound each plugin's init independently so a single slow plugin can't take down the whole phase.
Consider retry-on-phase-stall. Configurable, default off. Helps transient-race scenarios without masking real bugs.
Expose a failure-rate metric on cron state. lastRunStatus alone hides chronic flaky crons that intermittently self-recover.
Confirm relationship to #86239. Same family (failures before agent turn starts) but a distinct error surface. Likely separate root causes but worth a sanity check from someone with the gateway internals in head.

#86239 — MissingAgentHarnessError on inbound dispatch under event-loop starvation. Same family (pre-turn failure), different error string, different code path. Open.
#86227 — claude-cli harness lazy registration silent drop. Closed as fixed in 2026.5.22. This issue is on 2026.5.22 (a374c3a), so the harness fix is in — this is a different pre-turn failure mode.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering