openclaw - ✅(Solved) Fix Plugin loader hot loop: prepareBundledPluginRuntimeDistMirror + JSON5 manifest parsing saturate gateway and starve event loop [1 pull requests, 6 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73532Fetched 2026-04-29 06:18:37
View on GitHub
Comments
6
Participants
6
Timeline
13
Reactions
0
Author
Timeline (top)
commented ×6cross-referenced ×5mentioned ×1subscribed ×1

The gateway enters a hot loop in the plugin loader / bundled-runtime-deps mirror that pegs a single core at ~100% indefinitely after startup. The loop starves the Node event loop, which in turn breaks Telegram long-polling (getUpdates HTTPS reads time out before the response can be parsed) and prevents the bot from processing any incoming messages. WS RPC paths (TUI connect, openclaw channels status --deep) also time out at 10 s, but openclaw status --json (which doesn't go through the WS handler) still responds.

This looks distinct from #72338 (which describes the WS self-loopback timeout pattern). Disabling memory search, removing legacy tools.web keys, individually disabling each enabled plugin, and quarantining a runaway agent session (the visible loop-detection warnings) all left the spin in place. Profiling showed the spin is in the plugin loader, not in WS handling.

Private identifiers (hostnames, tokens, chat IDs, absolute user paths, peer Tailnet) omitted.

Error Message

  • Telegram polling logs Polling stall detected (no completed getUpdates for 148.49s); forcing restart. error=Network request for 'getUpdates' failed. Telegram updates queued on the upstream API are never drained, even after a clean gateway restart and a 90 s warmup window

Root Cause

The gateway enters a hot loop in the plugin loader / bundled-runtime-deps mirror that pegs a single core at ~100% indefinitely after startup. The loop starves the Node event loop, which in turn breaks Telegram long-polling (getUpdates HTTPS reads time out before the response can be parsed) and prevents the bot from processing any incoming messages. WS RPC paths (TUI connect, openclaw channels status --deep) also time out at 10 s, but openclaw status --json (which doesn't go through the WS handler) still responds.

This looks distinct from #72338 (which describes the WS self-loopback timeout pattern). Disabling memory search, removing legacy tools.web keys, individually disabling each enabled plugin, and quarantining a runaway agent session (the visible loop-detection warnings) all left the spin in place. Profiling showed the spin is in the plugin loader, not in WS handling.

Private identifiers (hostnames, tokens, chat IDs, absolute user paths, peer Tailnet) omitted.

Fix Action

Fixed

PR fix notes

PR #73678: fix: cache plugin manifest loads by file signature

Description (problem / solution / changelog)

Summary

  • cache loadPluginManifest results by manifest path, hardlink policy, optional real root, and file signature
  • reuse the normalized manifest result when the manifest file is unchanged
  • add regression coverage for repeated unchanged manifest loads

Context

This is the smallest upstreamable slice from a v2026.4.26 CPU investigation on a Docker/Synology deployment. The gateway was repeatedly spending CPU in plugin manifest/runtime discovery during channel startup; caching normalized manifest loads reduced repeated strict JSON/JSON5 parse and normalization work in bursty startup paths.

Related issue: #73647

Validation

  • node --import tsx <manifest-cache-smoke> => manifest cache smoke ok
  • corepack pnpm exec oxfmt --check --threads=1 src/plugins/manifest.ts src/plugins/manifest.json5-tolerance.test.ts
  • corepack pnpm run tsgo:test:src

Note: direct vitest run src/plugins/manifest.json5-tolerance.test.ts was not used as final validation on the NAS because the default Vitest invocation spent several minutes in project bundling. The source typecheck and a direct runtime smoke both passed.

Changed files

  • src/plugins/manifest.json5-tolerance.test.ts (modified, +35/-1)
  • src/plugins/manifest.ts (modified, +68/-7)

Code Example

8.9%  Builtin: KeyedLoadIC_Megamorphic
 7.5%  Builtin: StrictEqual
 7.4%  json5/lib/parse.js  *default
 4.7%  json5/lib/parse.js  *string
 4.5%  json5/lib/parse.js  *parse
 3.2%  node:path  *resolve
 2.9%  Builtin: StringEqual
 2.4%  dist/manifest-*loadPluginManifest
 1.4%  Builtin: CallFunction_ReceiverIsNotNullOrUndefined

---

withBundledRuntimeDepsFilesystemLock        (dist/bundled-runtime-deps-L258)
└─ prepareBundledPluginRuntimeDistMirror   (dist/loader-L2109)
   └─ mirrorBundledRuntimeDistRootEntries  (dist/loader-L2149)
      └─ shouldMaterializeBundledRuntimeMirrorDistFile  (dist/bundled-runtime-deps-L81)11.0% of total ticks

---

loadPluginManifestRegistry        (dist/manifest-registry-L317)
└─ loadPluginManifest             (dist/manifest-L1013)
   └─ json5/lib/parse.js *parse   (3 330 ticks → ~5% of total)
       └─ string / value / read / beforePropertyValue / afterPropertyValue …
RAW_BUFFERClick to expand / collapse

Summary

The gateway enters a hot loop in the plugin loader / bundled-runtime-deps mirror that pegs a single core at ~100% indefinitely after startup. The loop starves the Node event loop, which in turn breaks Telegram long-polling (getUpdates HTTPS reads time out before the response can be parsed) and prevents the bot from processing any incoming messages. WS RPC paths (TUI connect, openclaw channels status --deep) also time out at 10 s, but openclaw status --json (which doesn't go through the WS handler) still responds.

This looks distinct from #72338 (which describes the WS self-loopback timeout pattern). Disabling memory search, removing legacy tools.web keys, individually disabling each enabled plugin, and quarantining a runaway agent session (the visible loop-detection warnings) all left the spin in place. Profiling showed the spin is in the plugin loader, not in WS handling.

Private identifiers (hostnames, tokens, chat IDs, absolute user paths, peer Tailnet) omitted.

Environment

  • OpenClaw: 2026.4.25 (aa36ee6) — also reproduced on 2026.4.26 before downgrade
  • Node: v22.22.0 (LTS)
  • OS: Linux 5.15 x86_64
  • Gateway mode: local loopback systemd user service, port 18789

Symptoms

  • Gateway process is in R (running) state with wchan=0 — pure user-space spin, never blocks on a syscall
  • ~100–115% CPU on a single core indefinitely after restart; doesn't decay after warmup
  • RSS climbs ~120 MB/min from a clean restart; we observed up to 1.4 GB after ~5 min
  • VIRT reaches ~32 GB (V8 reservation pressure)
  • openclaw doctor, openclaw channels status --deep, openclaw tui all time out with gateway timeout after 10000ms against ws://127.0.0.1:18789
  • openclaw status --json (no WS round-trip) still responds correctly
  • Telegram polling logs Polling stall detected (no completed getUpdates for 148.49s); forcing restart. error=Network request for 'getUpdates' failed. Telegram updates queued on the upstream API are never drained, even after a clean gateway restart and a 90 s warmup window
  • The Telegram REST API is reachable and the bot token is valid (verified with curl: getMe, deleteWebhook, getUpdates all succeed within ms when called directly from the same host)

Triggers ruled out

  • Channel logins: never invoked in this session (rules out the trigger hypothesized in #72338)
  • Plugins: 11 enabled / 102 disabled bundled, also tested with each user plugin disabled individually
  • User plugin entries: only 12 in plugins.entries; 11 enabled
  • Agents: 1 entry in agents.list; runaway sessions on disk quarantined; orphan dirs flagged by doctor
  • Cron jobs: clean queue, spin observed at minute boundaries with zero scheduled work
  • Memory search: agents.defaults.memorySearch.enabled = false
  • Legacy keys: tools.web / tools.alsoAllow removed
  • Telegram native commands: channels.telegram.commands.native = false (this did eliminate a separate retry storm but did not affect the CPU spin)
  • Delivery queue: emptied prior to restart
  • openclaw doctor --fix: only archived an orphan transcript; CPU spin returns within ~30 s

CPU profile (V8 --prof over 60 s, 74 301 ticks captured)

Top JS leaves (tick share):

 8.9%  Builtin: KeyedLoadIC_Megamorphic
 7.5%  Builtin: StrictEqual
 7.4%  json5/lib/parse.js  *default
 4.7%  json5/lib/parse.js  *string
 4.5%  json5/lib/parse.js  *parse
 3.2%  node:path  *resolve
 2.9%  Builtin: StringEqual
 2.4%  dist/manifest-…  *loadPluginManifest
 1.4%  Builtin: CallFunction_ReceiverIsNotNullOrUndefined

Bottom-up (heavy parents) — the dominant call stack:

withBundledRuntimeDepsFilesystemLock        (dist/bundled-runtime-deps-…  L258)
└─ prepareBundledPluginRuntimeDistMirror   (dist/loader-…  L2109)
   └─ mirrorBundledRuntimeDistRootEntries  (dist/loader-…  L2149)
      └─ shouldMaterializeBundledRuntimeMirrorDistFile  (dist/bundled-runtime-deps-…  L81)   ← 11.0% of total ticks

Parallel hot stack (JSON5 manifest parsing):

loadPluginManifestRegistry        (dist/manifest-registry-…  L317)
└─ loadPluginManifest             (dist/manifest-…  L1013)
   └─ json5/lib/parse.js *parse   (3 330 ticks → ~5% of total)
       └─ string / value / read / beforePropertyValue / afterPropertyValue …

Other contributors visible in the bottom-up:

  • compileSourceTextModule / getModuleJobForImport / package_json_reader — ~7% of ticks. Suggests dynamic ESM imports happening on the hot path.
  • collectRuntimePackageWildcardImportTargetsregisterBundledRuntimeDependencyJitiAliases
  • buildInstalledPluginIndex / safeHashFile → repeated hashing of plugin index files
  • ensureOpenClawPluginSdkAlias → re-entering prepareBundledPluginRuntimeDistMirror

Interpretation

The bundled-runtime-deps mirror at ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/dist/ is well-formed (2667 entries vs 2666 in the source dist/; the only diff is the mirror's own package.json; no broken symlinks). The mirror does not need rebuilding, yet prepareBundledPluginRuntimeDistMirror is invoked repeatedly enough to dominate CPU. Combined with loadPluginManifestRegistry re-parsing manifests with JSON5 on the same hot path, this looks like the plugin-registry cache is missing on every call and re-doing the full load.

loadOpenClawPlugins does have a cache (getCachedPluginRegistry(cacheKey)), so the suspicion is one of:

  1. cacheKey varies between callers (runtimeSubagentMode, workspaceDir, etc.), so different code paths each get their own cache and each thrashes
  2. The cache is invalidated/cleared too aggressively (e.g., on every WS connection, every health probe, every tools-invoke)
  3. cacheEnabled === false is set somewhere on the hot path

Whatever the trigger, the inner work (filesystem lock + per-file shouldMaterialize… decisions + JSON5 re-parse) runs frequently enough to saturate one core and starve the event loop.

Reproducer

  1. Linux x86_64, Node 22 LTS, OpenClaw 2026.4.25 from npm, default install layout
  2. ~10 enabled plugins, ~100 disabled bundled, Telegram channel enabled
  3. systemctl --user restart openclaw-gateway
  4. Within 30 s the gateway is at ~100% CPU, RSS climbing, openclaw doctor health check times out at 10 s, Telegram updates begin to queue without being drained

Suggested next steps

  • Instrument getCachedPluginRegistry / setCachedPluginRegistry to log cache key + hit/miss for a few seconds and confirm whether the cache is missing on every call
  • If yes, normalize the cacheKey derivation (or share the cache across runtimeSubagentMode/workspaceDir permutations) so steady-state callers hit the cache instead of re-mirroring + re-parsing
  • Rate-limit / debounce prepareBundledPluginRuntimeDistMirror — the mirror is a steady-state directory, it should never need to run more than once per process lifetime unless the source dist/ changes

Happy to capture additional traces (heap snapshot, longer profile, instrumented log) if helpful.

extent analysis

TL;DR

The plugin loader's repeated invocation of prepareBundledPluginRuntimeDistMirror and re-parsing of manifests with JSON5 is likely causing the CPU spin, suggesting a missing or inefficiently used cache.

Guidance

  • Instrument getCachedPluginRegistry and setCachedPluginRegistry to log cache key and hit/miss rates to confirm if the cache is missing on every call.
  • Normalize the cacheKey derivation to ensure steady-state callers hit the cache instead of re-mirroring and re-parsing.
  • Consider rate-limiting or debouncing prepareBundledPluginRuntimeDistMirror to prevent excessive invocations.
  • Review the cache invalidation strategy to prevent aggressive clearing of the cache.

Example

No code snippet is provided as the issue is more related to the understanding of the plugin loader's behavior and cache usage.

Notes

The provided CPU profile and reproducer steps suggest that the issue is related to the plugin loader and cache usage. However, without further instrumentation and logging, it's difficult to pinpoint the exact cause. The suggested next steps should help in identifying the root cause and potential solutions.

Recommendation

Apply a workaround by instrumenting the cache and normalizing the cacheKey derivation to improve cache hits and reduce the CPU spin. This should help in mitigating the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Plugin loader hot loop: prepareBundledPluginRuntimeDistMirror + JSON5 manifest parsing saturate gateway and starve event loop [1 pull requests, 6 comments, 6 participants]