openclaw - ✅(Solved) Fix Gateway sync-FS plugin discovery blocks event loop on every channels.status / channel restart [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78100Fetched 2026-05-06 06:16:54
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
2
Timeline (top)
cross-referenced ×2commented ×1

Root Cause

loadPluginRegistrySnapshotWithMetadata (src/plugins/plugin-registry-snapshot.ts) is invoked from channels.status polls and channel-restart paths. Each invocation re-walks the filesystem synchronously via realpathSync, statSync, readFileSync to verify manifest hashes, package.json signatures, and bundled-root paths.

23-minute CPU profile (node --cpu-prof on the gateway process):

Total CPU time: 1380s
  661s  native::(idle)         48%  (waiting on sync syscalls)
  380s  native::lstat          27.6%
  105s  native::stat            7.6%
   46s  native::fstat           3.3%
   46s  native::close           3.3%

Top JS callers of those syscalls:

65s   resolveBoundaryPathSync       (boundary-path resolver)
47s   safeRealpathSync via discovery checkSourceEscapesRoot
41s   installed-plugin-index resolvePackageJsonPath
38s   installed-plugin-index resolvePackageJsonRelativePath
37s   discoverInDirectory (plugin walker)
17s   getPathKindSync
15s   readLexicalStat
30s   openVerifiedFileSync (3 sites)
11s   isFilesystemMountPoint
13s   safeFileSignature / safeHashFile (manifest re-hashing)
4s    resolveComparablePath in loadPluginRegistrySnapshotWithMetadata

Fix Action

Fix / Workaround

On slow filesystems (WSL2 9P, NFS, container overlays, encrypted volumes) the gateway exhibits long event-loop blocks and inbound dispatch latencies of 17–25s. WhatsApp Web channels also disconnect repeatedly with 408 timeouts, leading to a self-reinforcing reconnect loop.

  • WSL2 with locally-built pnpm gateway:watch
  • Linked WhatsApp account
  • Send a self DM via pnpm openclaw message send --channel whatsapp --target +<self> --message "hi"
  • Observe inbound web messagecli exec gap of 17–25s in /tmp/openclaw/openclaw-*.log
  • Observe eventLoopDelayMaxMs=6446ms in liveness warnings during dispatch
  • Observe WhatsApp Web connection closed (status 408) shortly after

Workarounds

PR fix notes

PR #78101: fix(plugins): coalesce loadPluginRegistrySnapshotWithMetadata within 2s

Description (problem / solution / changelog)

Closes #78100

Summary

The persisted plugin registry rebuild walks the filesystem synchronously on every call (realpathSync, statSync, readFileSync for manifest hashes, package.json signatures, bundled-root paths). On slow filesystems (WSL2 9P, NFS, container overlays) one pass costs 5–10s. The gateway hammers it from channels.status polls and channel-restart paths several times per minute, blocking the event loop in tight bursts.

This PR coalesces repeated calls with the same params within a 2-second window. The first caller pays the full cost; subsequent callers within 2s reuse the cached PluginRegistrySnapshotResult. Real plugin changes are still picked up within a couple of seconds.

Cache key includes stateDir, filePath, pluginIndexFilePath, preferPersisted, and the policy hash of the active config so distinct callers do not cross-pollute. resetPluginRegistrySnapshotCache() is exported for tests.

Why this approach

The repo's src/plugins/CLAUDE.md says "Do not add persistent metadata caches for discovery, manifest registries, installed-index reconstruction". This 2-second TTL is intentionally short — it coalesces redundant calls within the same operation (e.g., a single channels.status request that re-resolves the snapshot multiple times via different code paths) without holding stale data across user-driven plugin changes. The deeper fix in #78100 is async-ifying the underlying sync I/O, but this is a low-risk near-term win.

Real behavior proof

  • behavior: Inbound WhatsApp DM dispatch + channels.status polls re-walk plugin discovery synchronously, blocking the event loop and causing WhatsApp Web 408 disconnects.
  • environment: Linux 6.6.87 (WSL2), Node 22.22.2, openclaw 2026.5.5 from local pnpm gateway:watch. Gateway running a linked WhatsApp account, agent default claude-cli/claude-sonnet-4-6.
  • steps:
    1. Fresh gateway: OPENCLAW_GATEWAY_WATCH_ATTACH=0 pnpm gateway:watch
    2. Wait for WhatsApp default: enabled, configured, linked, running, connected
    3. Time five sequential channels.status calls
    4. Send self-DM via pnpm openclaw message send --channel whatsapp --target +<self> --message "warm test"
    5. Tail /tmp/openclaw/openclaw-*.log and measure inbound web messagecli exec gap
  • evidence: gateway logs /tmp/openclaw/openclaw-2026-05-05.log, before/after measurements captured below.
  • observedResult:
    • Gateway-side channels.status durations parsed from log:
      • before: [5992, 6014, 7234, 8235, 7517, 8221, 6110, 5416, 7966, 5591, 7855, 5549, 5938, 5387, 5911, 5542, 5341, 5536, 5477] (avg 6964ms)
      • after: [3471, 3130, 3401, 3431, 3115, 3281, 3475] (avg 3329ms)
    • Inbound dispatch gap (warm gateway):
      • before: 17.3s (run 1), 25.3s (run 2), with eventLoopDelayMaxMs=6446ms
      • after: 9.9s (warm DM), 11.5s (manual DM), with eventLoopDelayMaxMs=3940ms
    • Auto-reply delivery: previously raced socket reconnect and returned auto-reply delivery failed; after patch, consistently delivers (auto-reply sent (text))
  • notTested: native Linux/macOS impact (cache should be a no-op cost; FS is fast enough that the cache hit/miss difference is sub-millisecond)

Test plan

  • pnpm test src/plugins/plugin-registry-snapshot.test.ts — 6 tests pass
  • pnpm test:changed — all targeted lanes pass on rerun
  • Live gateway dogfood on WSL2 with linked WhatsApp account, multiple DM round-trips

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/plugins/plugin-registry-snapshot.ts (modified, +49/-2)

PR #78134: fix: cache plugin registry snapshot loads

Description (problem / solution / changelog)

Fixes #78100.

Summary

  • add a 1s process-local TTL cache for equivalent loadPluginRegistrySnapshotWithMetadata calls
  • key the cache by state/index paths, persisted-read mode, relevant plugin env flags, bundled dir/version, and policy hash
  • keep explicit caller-provided index / installRecords exact by bypassing the cache
  • clear the cache before refreshPluginRegistry

Tests

  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/run-vitest.mjs run --config test/vitest/vitest.plugins.config.ts src/plugins/plugin-registry-snapshot.test.ts
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm exec oxfmt --check src/plugins/plugin-registry-snapshot.ts src/plugins/plugin-registry-snapshot.test.ts
  • git diff --check
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs

Notes

The cache deliberately permits at most a short stale window for implicit registry loads. Explicit data paths and registry refresh remain uncached/invalidating.

Changed files

  • src/plugins/plugin-registry-snapshot.test.ts (modified, +63/-1)
  • src/plugins/plugin-registry-snapshot.ts (modified, +64/-4)

Code Example

Total CPU time: 1380s
  661s  native::(idle)         48%  (waiting on sync syscalls)
  380s  native::lstat          27.6%
  105s  native::stat            7.6%
   46s  native::fstat           3.3%
   46s  native::close           3.3%

---

65s   resolveBoundaryPathSync       (boundary-path resolver)
47s   safeRealpathSync via discovery checkSourceEscapesRoot
41s   installed-plugin-index resolvePackageJsonPath
38s   installed-plugin-index resolvePackageJsonRelativePath
37s   discoverInDirectory (plugin walker)
17s   getPathKindSync
15s   readLexicalStat
30s   openVerifiedFileSync (3 sites)
11s   isFilesystemMountPoint
13s   safeFileSignature / safeHashFile (manifest re-hashing)
4s    resolveComparablePath in loadPluginRegistrySnapshotWithMetadata
RAW_BUFFERClick to expand / collapse

Symptom

On slow filesystems (WSL2 9P, NFS, container overlays, encrypted volumes) the gateway exhibits long event-loop blocks and inbound dispatch latencies of 17–25s. WhatsApp Web channels also disconnect repeatedly with 408 timeouts, leading to a self-reinforcing reconnect loop.

Repro

  • WSL2 with locally-built pnpm gateway:watch
  • Linked WhatsApp account
  • Send a self DM via pnpm openclaw message send --channel whatsapp --target +<self> --message "hi"
  • Observe inbound web messagecli exec gap of 17–25s in /tmp/openclaw/openclaw-*.log
  • Observe eventLoopDelayMaxMs=6446ms in liveness warnings during dispatch
  • Observe WhatsApp Web connection closed (status 408) shortly after

Root cause

loadPluginRegistrySnapshotWithMetadata (src/plugins/plugin-registry-snapshot.ts) is invoked from channels.status polls and channel-restart paths. Each invocation re-walks the filesystem synchronously via realpathSync, statSync, readFileSync to verify manifest hashes, package.json signatures, and bundled-root paths.

23-minute CPU profile (node --cpu-prof on the gateway process):

Total CPU time: 1380s
  661s  native::(idle)         48%  (waiting on sync syscalls)
  380s  native::lstat          27.6%
  105s  native::stat            7.6%
   46s  native::fstat           3.3%
   46s  native::close           3.3%

Top JS callers of those syscalls:

65s   resolveBoundaryPathSync       (boundary-path resolver)
47s   safeRealpathSync via discovery checkSourceEscapesRoot
41s   installed-plugin-index resolvePackageJsonPath
38s   installed-plugin-index resolvePackageJsonRelativePath
37s   discoverInDirectory (plugin walker)
17s   getPathKindSync
15s   readLexicalStat
30s   openVerifiedFileSync (3 sites)
11s   isFilesystemMountPoint
13s   safeFileSignature / safeHashFile (manifest re-hashing)
4s    resolveComparablePath in loadPluginRegistrySnapshotWithMetadata

Cascading WhatsApp disconnects

Baileys' keepAliveIntervalMs is 25s. When the event loop blocks for 30s+ during channels.whatsapp.start-account or other discovery-heavy phases, the WS keepalive misses its window and WA returns 408. The disconnect triggers a channel restart, which re-runs start-account, which re-runs discovery, which blocks again, and so on. Block-to-disconnect correlation in one session:

disconnectpreceding event-loop block
17:19:0457s
17:41:5267s
18:07:3156s
16:08:03107s

Proposed fixes

  1. (this PR/follow-up) Coalesce loadPluginRegistrySnapshotWithMetadata calls within a short TTL window
  2. Async-ify the sync-FS work in plugin discovery + boundary-path resolution + installed-plugin-index rebuild so the event loop is not blocked
  3. Process-lifetime memoization for safeRealpathSync / boundary-path resolution where path topology is stable

The repo's src/plugins/CLAUDE.md currently says "Do not add persistent metadata caches for discovery, manifest registries, installed-index reconstruction" — this guidance keeps the design freshness-first, but on slow filesystems the literal cost of "always fresh" is the event loop. Worth an architectural conversation about mtime-gated caching vs. no caching.

Workarounds

  • Run on native Linux/macOS (FS calls 10–50× faster than WSL2 9P)
  • Use Telegram or other push-based channels (no WS keepalive sensitivity)

Affects

All channels and CLI commands on slow filesystems. WhatsApp Web is the most visible symptom because of its keepalive sensitivity.

extent analysis

TL;DR

The most likely fix is to coalesce loadPluginRegistrySnapshotWithMetadata calls and async-ify the sync-FS work in plugin discovery to prevent event loop blocks.

Guidance

  • Identify and coalesce loadPluginRegistrySnapshotWithMetadata calls within a short TTL window to reduce the number of synchronous filesystem calls.
  • Async-ify the sync-FS work in plugin discovery, boundary-path resolution, and installed-plugin-index rebuild to prevent event loop blocks.
  • Consider implementing process-lifetime memoization for safeRealpathSync and boundary-path resolution where path topology is stable.
  • Review the trade-offs of introducing caching mechanisms, such as mtime-gated caching, to improve performance on slow filesystems.

Example

No code snippet is provided as the issue does not contain sufficient information to generate a specific example.

Notes

The proposed fixes require careful consideration of the trade-offs between performance and design freshness. The current guidance in src/plugins/CLAUDE.md emphasizes a freshness-first approach, but this may need to be revisited to accommodate slow filesystems.

Recommendation

Apply workaround: Run on native Linux/macOS or use Telegram or other push-based channels to mitigate the issue until a permanent fix is implemented. This is because the current implementation is not optimized for slow filesystems, and running on a faster filesystem or using a different channel can help alleviate the symptoms.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway sync-FS plugin discovery blocks event loop on every channels.status / channel restart [2 pull requests, 1 comments, 2 participants]