openclaw - 💡(How to fix) Fix [Perf] Cascade: loadInstalledPluginIndex re-walks plugin manifests sync per RPC, blocks event loop 5-30 s under concurrent load [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80345Fetched 2026-05-11 03:15:51
View on GitHub
Comments
3
Participants
3
Timeline
9
Reactions
2
Timeline (top)
commented ×3mentioned ×3subscribed ×3

Under concurrent-RPC load (Control UI session-switch fan-out + an active agent turn + a few channel probes), the OpenClaw gateway main thread can spend 5-30 s synchronously walking plugin manifest paths and ancestor boundary directories. Wall-clock cascades of 17-33 s for sessions.list, chat.history, node.list, device.pair.list, and other RPCs that transitively call loadInstalledPluginIndex are reproducible.

Root Cause

loadInstalledPluginIndex(params) (visible in dist as installed-plugin-index-store-DH9sPamj.js) is a bare wrapper that synchronously calls buildInstalledPluginIndex on every invocation:

function loadInstalledPluginIndex(params = {}) {
    return buildInstalledPluginIndex(params);
}

buildInstalledPluginIndex calls resolveInstalledPluginIndexRegistry, which always forwards candidates, diagnostics, and installRecords into loadPluginManifestRegistry. In manifest-registry-BiAsJcRZ.js, that combination structurally bypasses the existing manifest-registry result cache:

if (params.candidates || params.diagnostics ||
    params.bundledChannelConfigCollector || params.installRecords) {
    return null; // bypasses cache, walks all plugin manifests
}

The bypass guard's intent — don't cache when the caller wants side-effect collection callbacks or controls the discovery surface — is sound. The defect is that resolveInstalledPluginIndexRegistry always supplies candidates, diagnostics, and installRecords, so every loadInstalledPluginIndex call structurally bypasses the cache. Hot per-RPC callers that consequently re-walk the full manifest tree on the main thread:

  • codex-route-warnings-DgzADONp.js:797isCodexPluginInstalledAndEnabled (per Codex-routed turn, plus chat.history / sessions.list / models.list paths)
  • installed-plugin-index-store-DH9sPamj.js:1499inspectPersistedInstalledPluginIndex (status / doctor / Control-UI snapshots)
  • plugin-registry-Cut-MFnk.js:191snapshot: loadInstalledPluginIndex(...)

sample(1) of the gateway during a cascade window shows ~88 % of main-thread CPU samples in fs.{lstat,stat,open,fstat,exists}Sync under Builtins_FlattenIntoArrayWithMapFn, with v8::Isolate::CaptureAndSetErrorStackOptimizedJSFrame::SummarizeSafepointTable::FindEntry dominating the ENOENT paths (each missing candidate throws → V8 walks optimised frames; cost amplifies materially on Node ≥ 22). An fs-syscall counter recorded 802,041 sync fs syscalls in a 10 second window during one cascade, dominated by resolveBoundaryPathLexicalSync (~138k), safe-open-sync.openVerifiedFileSync (~68k), and discovery.resolvePluginManifestPath / discovery.safeStatSync (~33k each).

Fix Action

Fix / Workaround

Mitigation we shipped locally

We patched loadInstalledPluginIndex with a TTL result cache (default 30 s, OPENCLAW_INSTALLED_INDEX_CACHE_TTL_MS=0 disables) keyed on workspaceDir | stateDir | indexFilePath | usesGlobalEnv | HOME | OPENCLAW_STATE_DIR | installs.json mtime+size | installRecords-fingerprint | JSON(plugins config). Skips cache for refreshReason / now (test injection) / bundledChannelConfigCollector callers. FIFO eviction at 16 entries.

Happy to share on request: the local reference patch (~110 lines), a 164 KB sample(1) capture from a 10.5 s startup and 5.2 s in-turn block, the fs-syscall counter shim, and before/after trace log captures.

Code Example

function loadInstalledPluginIndex(params = {}) {
    return buildInstalledPluginIndex(params);
}

---

if (params.candidates || params.diagnostics ||
    params.bundledChannelConfigCollector || params.installRecords) {
    return null; // bypasses cache, walks all plugin manifests
}
RAW_BUFFERClick to expand / collapse

Summary

Under concurrent-RPC load (Control UI session-switch fan-out + an active agent turn + a few channel probes), the OpenClaw gateway main thread can spend 5-30 s synchronously walking plugin manifest paths and ancestor boundary directories. Wall-clock cascades of 17-33 s for sessions.list, chat.history, node.list, device.pair.list, and other RPCs that transitively call loadInstalledPluginIndex are reproducible.

Symptoms

  • [trace:event-loop-block] fires repeatedly with 5-30 s lag.
  • sessions.list, chat.history, node.list, device.pair.list, models.list, agents.list all stretch together — the common factor is loadInstalledPluginIndex.
  • WS handshakes can time out during the cascade; Control UI feels frozen for 10-30 s while a single agent turn is in flight.
  • Severity scales with installed-plugin count, concurrent in-flight RPCs, and (visibly) Node version.

Reproduction

  • OpenClaw 2026.5.7, Node v25.5.0, macOS APFS, 8 installed plugins (codex + 7 channel/extension plugins), Codex plugin enabled.
  • Open Control UI tab + a large per-agent session turn in flight + the standard Control-UI session-switch RPC bundle (commands.list, chat.history, sessions.list, node.list, models.list, device.pair.list, agents.list) plus 3 concurrent probes.

This reliably reproduces 17-33 s cascades. A single-channel single-plugin CLI user likely never sees it; a Control-UI-fan-out + multi-channel + Codex-enabled deployment will.

Root cause

loadInstalledPluginIndex(params) (visible in dist as installed-plugin-index-store-DH9sPamj.js) is a bare wrapper that synchronously calls buildInstalledPluginIndex on every invocation:

function loadInstalledPluginIndex(params = {}) {
    return buildInstalledPluginIndex(params);
}

buildInstalledPluginIndex calls resolveInstalledPluginIndexRegistry, which always forwards candidates, diagnostics, and installRecords into loadPluginManifestRegistry. In manifest-registry-BiAsJcRZ.js, that combination structurally bypasses the existing manifest-registry result cache:

if (params.candidates || params.diagnostics ||
    params.bundledChannelConfigCollector || params.installRecords) {
    return null; // bypasses cache, walks all plugin manifests
}

The bypass guard's intent — don't cache when the caller wants side-effect collection callbacks or controls the discovery surface — is sound. The defect is that resolveInstalledPluginIndexRegistry always supplies candidates, diagnostics, and installRecords, so every loadInstalledPluginIndex call structurally bypasses the cache. Hot per-RPC callers that consequently re-walk the full manifest tree on the main thread:

  • codex-route-warnings-DgzADONp.js:797isCodexPluginInstalledAndEnabled (per Codex-routed turn, plus chat.history / sessions.list / models.list paths)
  • installed-plugin-index-store-DH9sPamj.js:1499inspectPersistedInstalledPluginIndex (status / doctor / Control-UI snapshots)
  • plugin-registry-Cut-MFnk.js:191snapshot: loadInstalledPluginIndex(...)

sample(1) of the gateway during a cascade window shows ~88 % of main-thread CPU samples in fs.{lstat,stat,open,fstat,exists}Sync under Builtins_FlattenIntoArrayWithMapFn, with v8::Isolate::CaptureAndSetErrorStackOptimizedJSFrame::SummarizeSafepointTable::FindEntry dominating the ENOENT paths (each missing candidate throws → V8 walks optimised frames; cost amplifies materially on Node ≥ 22). An fs-syscall counter recorded 802,041 sync fs syscalls in a 10 second window during one cascade, dominated by resolveBoundaryPathLexicalSync (~138k), safe-open-sync.openVerifiedFileSync (~68k), and discovery.resolvePluginManifestPath / discovery.safeStatSync (~33k each).

Mitigation we shipped locally

We patched loadInstalledPluginIndex with a TTL result cache (default 30 s, OPENCLAW_INSTALLED_INDEX_CACHE_TTL_MS=0 disables) keyed on workspaceDir | stateDir | indexFilePath | usesGlobalEnv | HOME | OPENCLAW_STATE_DIR | installs.json mtime+size | installRecords-fingerprint | JSON(plugins config). Skips cache for refreshReason / now (test injection) / bundledChannelConfigCollector callers. FIFO eviction at 16 entries.

Verification, steady state with a large per-agent turn active:

  • Before: sessions.list totalMs=31606, chat.history 17665, node.list 17659; repeated 5-30 s event-loop blocks during the turn.
  • After: [trace:installed-index-cache] hits=2070-3700 misses=3-7 bypass=0 per 30 s; probe max < 200 ms; no [trace:slow-rpc] warnings.

Happy to PR this if it matches the maintainers' preferred direction, but flagging it as a choice between (a) caching loadInstalledPluginIndex (our approach), or (b) the cleaner architectural fix of changing resolveInstalledPluginIndexRegistry so it doesn't trip the existing bypass guards — e.g. distinguish "caller supplied installRecords for staleness control" → cacheable with installRecords hash in the key, from "caller supplied bundledChannelConfigCollector" → genuinely uncacheable.

Environment

  • openclaw 2026.5.7, node v25.5.0, macOS 26.4 (APFS).
  • Reproducible on Node ≥ 18; per-cascade duration roughly doubles between Node 18 and 25 because of deeper V8 stack-capture walks on ENOENT, but the structural sync-fs storm is the same.

Related context

The active maintainer of the recent "manifest registry pass" perf line (recent CHANGELOG entries routing cold manifest/capability lookups through the installed plugin index, the PluginLookUpTable startup pass, and the models list --all --provider <id> speedup) is best placed to choose the fix shape. This feels like the next step in that arc: callers were correctly routed into loadInstalledPluginIndex to avoid manifest scans, but the index loader itself was never memoised — the cost just moved up one layer.

Offer

Happy to share on request: the local reference patch (~110 lines), a 164 KB sample(1) capture from a 10.5 s startup and 5.2 s in-turn block, the fs-syscall counter shim, and before/after trace log captures.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Perf] Cascade: loadInstalledPluginIndex re-walks plugin manifests sync per RPC, blocks event loop 5-30 s under concurrent load [3 comments, 3 participants]