openclaw - 💡(How to fix) Fix loadPluginMetadataSnapshot rebuilds the full installed-plugin index synchronously on every call — 25–140s event-loop freezes on multi-agent gateways

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On a long-running gateway serving multiple agents, the Node event loop freezes for 25–140 seconds at a time, repeatedly. A CPU profile traced it to loadPluginMetadataSnapshot() running a full, synchronous installed-plugin filesystem crawl on every call — it has no memoization, and the one intended cache (getCurrentPluginMetadataSnapshot, a single slot) structurally misses on multi-agent gateways.

Root Cause

Dominant CPU-profile stack:

loadCatalog
└ normalizeProviderModelIdWithManifest      (manifest-model-id-normalization.ts)
 └ resolveMetadataSnapshotForPolicies
  └ loadPluginMetadataSnapshot               (plugin-metadata-snapshot.ts)
   └ loadPluginRegistrySnapshotWithMetadata
    └ loadInstalledPluginIndex → buildInstalledPluginIndex
     └ resolveInstalledPluginIndexRegistry → discoverOpenClawPlugins → discoverInDirectory

loadPluginMetadataSnapshot() has no memoization — every call runs loadPluginMetadataSnapshotImplloadInstalledPluginIndex, and loadInstalledPluginIndex() is literally return buildInstalledPluginIndex(params) (a full rebuild despite the name). buildInstalledPluginIndex synchronously discovers every installed plugin from disk: realpathSync/statSync/lstatSync/existsSync (~170s cumulative in a 41-minute profile), openSync+readSync of every plugin manifest, hashing each. ~65% of all non-idle CPU in the profile was this single path.

There is an intended cache: resolveMetadataSnapshotForPolicies first tries getCurrentPluginMetadataSnapshot() (a single-slot snapshot the gateway sets at startup) and only falls back to loadPluginMetadataSnapshot() on a miss. But getCurrentPluginMetadataSnapshot() only returns the slot when workspaceDir + config fingerprint + policyHash all match. A gateway serving N agents with distinct workspaces/configs can match at most one of them, so for every other agent the lookup misses and falls through to the uncached full crawl — on every turn. (loadCatalog normalizes several model ids per catalog load, so one turn crawls repeatedly.)

loadPluginMetadataSnapshot is imported at 30+ call sites; none memoizes.

Code Example

loadCatalog
normalizeProviderModelIdWithManifest      (manifest-model-id-normalization.ts)
 └ resolveMetadataSnapshotForPolicies
loadPluginMetadataSnapshot               (plugin-metadata-snapshot.ts)
   └ loadPluginRegistrySnapshotWithMetadata
    └ loadInstalledPluginIndex → buildInstalledPluginIndex
     └ resolveInstalledPluginIndexRegistry → discoverOpenClawPlugins → discoverInDirectory
RAW_BUFFERClick to expand / collapse

Summary

On a long-running gateway serving multiple agents, the Node event loop freezes for 25–140 seconds at a time, repeatedly. A CPU profile traced it to loadPluginMetadataSnapshot() running a full, synchronous installed-plugin filesystem crawl on every call — it has no memoization, and the one intended cache (getCurrentPluginMetadataSnapshot, a single slot) structurally misses on multi-agent gateways.

Environment

  • openclaw 2026.5.12, Node v25.5.0, Linux x64
  • gateway serving 8 agents with distinct workspaces; ~150 bundled plugins

Symptom

  • Event loop blocked 25–140s (built-in liveness monitor: eventLoopDelayMaxMs up to ~138000, eventLoopUtilization ≈ 1, cpuCoreRatio ≈ 1). One instance logged 18 such freezes in 88 minutes of normal use.
  • While frozen the whole gateway is unresponsive: streaming model responses are not read, so embedded runs fail with LLM idle timeout (120s): no response from model; outbound HTTP (e.g. a feishu ping) hits axios timeouts.

Root cause

Dominant CPU-profile stack:

loadCatalog
└ normalizeProviderModelIdWithManifest      (manifest-model-id-normalization.ts)
 └ resolveMetadataSnapshotForPolicies
  └ loadPluginMetadataSnapshot               (plugin-metadata-snapshot.ts)
   └ loadPluginRegistrySnapshotWithMetadata
    └ loadInstalledPluginIndex → buildInstalledPluginIndex
     └ resolveInstalledPluginIndexRegistry → discoverOpenClawPlugins → discoverInDirectory

loadPluginMetadataSnapshot() has no memoization — every call runs loadPluginMetadataSnapshotImplloadInstalledPluginIndex, and loadInstalledPluginIndex() is literally return buildInstalledPluginIndex(params) (a full rebuild despite the name). buildInstalledPluginIndex synchronously discovers every installed plugin from disk: realpathSync/statSync/lstatSync/existsSync (~170s cumulative in a 41-minute profile), openSync+readSync of every plugin manifest, hashing each. ~65% of all non-idle CPU in the profile was this single path.

There is an intended cache: resolveMetadataSnapshotForPolicies first tries getCurrentPluginMetadataSnapshot() (a single-slot snapshot the gateway sets at startup) and only falls back to loadPluginMetadataSnapshot() on a miss. But getCurrentPluginMetadataSnapshot() only returns the slot when workspaceDir + config fingerprint + policyHash all match. A gateway serving N agents with distinct workspaces/configs can match at most one of them, so for every other agent the lookup misses and falls through to the uncached full crawl — on every turn. (loadCatalog normalizes several model ids per catalog load, so one turn crawls repeatedly.)

loadPluginMetadataSnapshot is imported at 30+ call sites; none memoizes.

Impact

Any multi-agent / multi-workspace gateway: the event loop is periodically frozen for tens of seconds to over two minutes, making the whole process unresponsive and causing spurious model idle-timeouts and dropped/late outbound requests.

Proposed fix

Memoize loadPluginMetadataSnapshot with a small, process-lifetime, multi-entry cache keyed on a cheap in-memory fingerprint (resolvePluginMetadataControlPlaneFingerprint + workspaceDir/stateDir/preferPersisted + installed-index fingerprint — no disk I/O), invalidated wherever the installed-plugin index changes (the existing clearCurrentPluginMetadataSnapshotState() sites in installed-plugin-index-store.ts, plus clearCurrentPluginMetadataSnapshot()).

This collapses the repeated crawls to one per distinct config/workspace and benefits all 30+ call sites. Tested on the affected deployment: recurring freezes eliminated (one ~26s crawl after a restart, then none).

A PR with this fix follows.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix loadPluginMetadataSnapshot rebuilds the full installed-plugin index synchronously on every call — 25–140s event-loop freezes on multi-agent gateways