openclaw - 💡(How to fix) Fix [Bug]: openclaw CLI subprocess startup ~25s on idle macOS; worker-tick fan-out under launchd compounds to 5+ min ticks [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73743Fetched 2026-04-29 06:15:42
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×1mentioned ×1subscribed ×1

openclaw <subcommand> CLI subprocesses take ~25 s of wall time even on an idle macOS host with no concurrent activity. When the same machine fans out 21 launchd-driven workflow ticks every 60 s (recipes workflows worker-tick × 19 + runner-tick × 2), the per-tick runtime stretches from ~25 s in isolation to 5–6 min under fan-out, well beyond the 60 s StartInterval. ClawKitchen's 120 s runOpenClaw timeout then fires on its own recipes list calls and caches {ok:true, stdout: "<deprecation warning only>"} for 30 min, breaking the recipes/team editor with Recipe not found: <teamId>.

The fan-out shape is a classic thundering herd, but the underlying single-call startup time is what makes it tip over: at <5 s per CLI subprocess, 21 parallel ticks would complete inside the 60 s budget. At ~25 s minimum it cannot, and contention only makes it worse.

The user reports this behavior began after upgrading to 2026.4.26.

Root Cause

openclaw <subcommand> CLI subprocesses take ~25 s of wall time even on an idle macOS host with no concurrent activity. When the same machine fans out 21 launchd-driven workflow ticks every 60 s (recipes workflows worker-tick × 19 + runner-tick × 2), the per-tick runtime stretches from ~25 s in isolation to 5–6 min under fan-out, well beyond the 60 s StartInterval. ClawKitchen's 120 s runOpenClaw timeout then fires on its own recipes list calls and caches {ok:true, stdout: "<deprecation warning only>"} for 30 min, breaking the recipes/team editor with Recipe not found: <teamId>.

The fan-out shape is a classic thundering herd, but the underlying single-call startup time is what makes it tip over: at <5 s per CLI subprocess, 21 parallel ticks would complete inside the 60 s budget. At ~25 s minimum it cannot, and contention only makes it worse.

The user reports this behavior began after upgrading to 2026.4.26.

Fix Action

Fix / Workaround

The user reports — and it's consistent with the project's history of running this same fleet of 21 launchd plists for weeks — that workflow ticks completed comfortably inside the 60 s window before the upgrade to 2026.4.26. After the upgrade, ticks visibly stack and the system becomes unresponsive for minutes on every gateway restart.

Workaround applied locally

Code Example

$ time openclaw recipes list
plugin runtime config.loadConfig() is deprecated (runtime-config-load-write); use config.current().
[ ... 17 recipes ... ]
openclaw recipes list  13.66s user 13.14s system 105% cpu 25.336 total

---

PID    ELAPSED  RSS_KB    COMMAND
24811  06:18    61760     openclaw
24812  06:18    774944    openclaw-recipes
24839  05:27    61696     openclaw
24840  05:27    771376    openclaw-recipes
24889  05:16    61696     openclaw
24890  05:16    729424    openclaw-recipes
24897  05:12    61744     openclaw
24898  05:12    785632    openclaw-recipes
24917  05:05    61712     openclaw
...

---

$ time openclaw recipes list
... 2:39 wall ...
RAW_BUFFERClick to expand / collapse

Summary

openclaw <subcommand> CLI subprocesses take ~25 s of wall time even on an idle macOS host with no concurrent activity. When the same machine fans out 21 launchd-driven workflow ticks every 60 s (recipes workflows worker-tick × 19 + runner-tick × 2), the per-tick runtime stretches from ~25 s in isolation to 5–6 min under fan-out, well beyond the 60 s StartInterval. ClawKitchen's 120 s runOpenClaw timeout then fires on its own recipes list calls and caches {ok:true, stdout: "<deprecation warning only>"} for 30 min, breaking the recipes/team editor with Recipe not found: <teamId>.

The fan-out shape is a classic thundering herd, but the underlying single-call startup time is what makes it tip over: at <5 s per CLI subprocess, 21 parallel ticks would complete inside the 60 s budget. At ~25 s minimum it cannot, and contention only makes it worse.

The user reports this behavior began after upgrading to 2026.4.26.

Environment

  • OpenClaw CLI: 2026.4.26
  • Host OS: macOS Darwin 25.4.0 arm64 (Apple Silicon, residential network)
  • Node.js: v25.9.0
  • Homebrew install at /opt/homebrew/lib/node_modules/openclaw
  • Plugins: 5 enabled (kitchen, llm-task, memory-core, recipes, telegram)
  • Configured providers: openai-codex/gpt-5.4 (primary) + anthropic/claude-opus-4-7 (fallback). No OpenRouter or LiteLLM as providers.
  • Workflow plists: 21 launchd StartInterval: 60 jobs under ~/Library/LaunchAgents/com.hairmx.workflow-*.plist, each shlock-guarded per label (no same-agent pile-up).

What we observed

Single-call CLI cold start (idle system, all workflow plists booted out)

$ time openclaw recipes list
plugin runtime config.loadConfig() is deprecated (runtime-config-load-write); use config.current().
[ ... 17 recipes ... ]
openclaw recipes list  13.66s user 13.14s system 105% cpu 25.336 total

stderr is empty. recipes list only walks builtin + workspace markdown directories — there is no I/O work that should take 25 s.

Same call under launchd fan-out

With 21 workflow plists active, each minute every label fires (recipes workflows worker-tick × 19, runner-tick × 2, approval-repair × 2). Snapshot from ps -ef:

PID    ELAPSED  RSS_KB    COMMAND
24811  06:18    61760     openclaw
24812  06:18    774944    openclaw-recipes
24839  05:27    61696     openclaw
24840  05:27    771376    openclaw-recipes
24889  05:16    61696     openclaw
24890  05:16    729424    openclaw-recipes
24897  05:12    61744     openclaw
24898  05:12    785632    openclaw-recipes
24917  05:05    61712     openclaw
...

17 openclaw-recipes workers alive simultaneously, each running 2–6 min for a single worker-tick invocation that should claim at most one task. Aggregate ~13 GB resident.

A separate openclaw recipes list invocation while this fan-out is active:

$ time openclaw recipes list
... 2:39 wall ...

That's 6.3× slower than the idle-system measurement, purely due to contention.

Downstream poisoning of ClawKitchen's subprocess cache

ClawKitchen's cachedRunOpenClaw invokes runtime.system.runCommandWithTimeout(["openclaw", ...args], { timeoutMs: 120000 }). When the 120 s timeout fires:

  • The runtime returns a partial stdout (just the deprecation banner emitted at the top of CLI startup — no JSON)
  • t.exitCode appears to be falsy/undefined, so kitchen's ("number" === typeof exitCode ? exitCode : "number" === typeof code ? code : "number" === typeof status ? status : 0) chain defaults to 0
  • Kitchen treats the result as ok: true and caches it with a 30 min TTL

After that, any kitchen route that calls findRecipeById (e.g. the recipes/team editor under /teams/<teamId>) sees JSON.parse(stdout) throw, falls back to [], and returns Recipe not found: <id>. The poisoned entry persists for 30 min (or until a /api/recipes/clone or PUT /api/recipes/[id] invalidates it — both of which require findRecipeById to succeed first, so the cache cannot heal itself once poisoned).

Why this looks like a regression

The user reports — and it's consistent with the project's history of running this same fleet of 21 launchd plists for weeks — that workflow ticks completed comfortably inside the 60 s window before the upgrade to 2026.4.26. After the upgrade, ticks visibly stack and the system becomes unresponsive for minutes on every gateway restart.

We don't have a clean before/after measurement of the bare openclaw recipes list time prior to upgrade. But:

  • 25 s for a no-op markdown directory walk is a lot of work to attribute to plugin loading alone.
  • Even if 25 s is the new baseline, every recipes workflows worker-tick parents an openclaw-recipes child and the combined RSS pinned to ~770 MB per worker suggests both processes are doing significant initialization, not just the parent.
  • 17–21 such workers all booting on the same minute boundary is the likely tipping point, but the launchd schedule didn't change — only the CLI startup cost did.

Reproduction

  1. Install OpenClaw 2026.4.26.
  2. Configure 19+ workflow agents under launchd StartInterval: 60 plists, each invoking openclaw recipes workflows worker-tick --team-id ... --agent-id .... (shlock guards optional; they prevent same-label pile-up but not the cross-agent fan-out problem.)
  3. Run time openclaw recipes list in two contexts: with launchd plists booted out (idle baseline) and active (fan-out). Compare.
  4. Refresh ClawKitchen /teams/<teamId> while fan-out is active. Observe the kitchen subprocess cache at ~/.openclaw/.kitchen-subprocess-cache/ — entries for ["recipes","list"] end up with stdout containing only the runtime-config-load-write deprecation banner.

Suggested investigation areas

  1. CLI startup cost. Trace what the openclaw <subcommand> process is doing during its 25 s pre-output window on an idle host. Plugin discovery? Plugin initialization? Heartbeat probes? Adding --no-plugins for read-only commands like recipes list (which only needs the recipes plugin) could turn 25 s into something single-digit.
  2. Worker-tick lightweight path. openclaw recipes workflows worker-tick spawns an openclaw-recipes child at ~770 MB RSS just to claim and possibly run one task. Could the worker-tick read the queue directly from the gateway via a thin RPC instead of full plugin-runtime initialization?
  3. launchd jitter. Stagger StartInterval by label hash to avoid the per-minute thundering herd, or document a recommended deployment pattern for users with N>10 workflow plists.
  4. Kitchen runOpenClaw resilience. When runtime.system.runCommandWithTimeout returns due to timeout, surface that as ok: false (e.g. exitCode: -1 + a timedOut: true flag) so callers don't cache partial output as success. Optionally add a "must-be-JSON" hint that cachedRunOpenClaw can use to refuse to cache unparseable stdout.

Workaround applied locally

  • launchctl bootout gui/$UID/com.hairmx.workflow-* for all 21 plists. Drained the 17 in-flight workers; idle-system recipes list came back to 25 s.
  • Pre-wrote a fresh recipes list cache file to ClawKitchen's subprocess cache so the in-memory poisoned entry self-heals to good data on TTL expiry.

Workflows are paused while we wait for the team editor to recover. Once the in-memory cache flushes we'll re-bootstrap launchd plists, but with N≥19 agents on StartInterval: 60 we expect the fan-out to recur until the underlying CLI startup cost is addressed.

extent analysis

TL;DR

The most likely fix is to reduce the CLI startup cost by optimizing plugin discovery and initialization, and implementing a lightweight path for worker-tick commands.

Guidance

  • Investigate the CLI startup cost by tracing what the openclaw <subcommand> process is doing during its 25 s pre-output window on an idle host.
  • Consider adding --no-plugins for read-only commands like recipes list to reduce the startup time.
  • Implement a lightweight path for worker-tick commands, such as reading the queue directly from the gateway via a thin RPC instead of full plugin-runtime initialization.
  • Stagger StartInterval by label hash to avoid the per-minute thundering herd.

Example

No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

The issue is likely caused by the increased CLI startup cost introduced in version 2026.4.26, which is exacerbated by the thundering herd problem when 21 workflow plists are launched simultaneously.

Recommendation

Apply a workaround by reducing the number of workflow plists or staggering their launch intervals until the underlying CLI startup cost is addressed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING