openclaw - 💡(How to fix) Fix Gateway 7-second event-loop stalls on 2026.4.27 — createOpenClawTools synchronously rebuilds plugin registry + provider-auth env-var resolution per tool-creation pass [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74860Fetched 2026-05-01 05:40:40
View on GitHub
Comments
3
Participants
3
Timeline
11
Reactions
2
Author
Timeline (top)
cross-referenced ×8commented ×3

After upgrading from 2026.4.152026.4.27, the gateway exhibits ~7-second synchronous event-loop stalls during normal agent activity, blocking Telegram polling and forcing watchdog-triggered container restarts. 24h post-upgrade window: 27 gateway crashes, 25 Tier 2 (full container restart). Pre-upgrade had been operating without this pattern.

.cpuprofile evidence (6 stalls aggregated, 230s combined CPU) shows the dominant hot path is createOpenClawTools / createOpenClawCodingTools invoking loadPluginRegistrySnapshotWithMetadata and resolveProviderAuthEnvVarCandidates synchronously on every tool-creation pass. Top self-time frames are lstat, readFileSync, path.resolve — classic uncached fs scan.

This may share root cause with #73532 (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is prepareBundledPluginRuntimeDistMirror (mirror-init); mine is createOpenClawTools (tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.

This is distinct from #74345 (ACP session leak / 480s task-registry-maintenance event-loop holds, also on 4.27) — different fingerprint (~7s stalls, no ACP cleanup loop, no session-write-lock warnings).

Root Cause

This may share root cause with #73532 (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is prepareBundledPluginRuntimeDistMirror (mirror-init); mine is createOpenClawTools (tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.

Fix Action

Fix / Workaround

While stalled: Telegram long-poll stays alive (update-offset-*.json mtime advances), getChat/getChatMember confirm bot-side state is fine, but channel-post-routed agent invocations don't dispatch — the agent isn't getting CPU.

This blocks: Telegram long-poll dispatch, agent response generation, all PM2 worker IPC.

RAW_BUFFERClick to expand / collapse

Summary

After upgrading from 2026.4.152026.4.27, the gateway exhibits ~7-second synchronous event-loop stalls during normal agent activity, blocking Telegram polling and forcing watchdog-triggered container restarts. 24h post-upgrade window: 27 gateway crashes, 25 Tier 2 (full container restart). Pre-upgrade had been operating without this pattern.

.cpuprofile evidence (6 stalls aggregated, 230s combined CPU) shows the dominant hot path is createOpenClawTools / createOpenClawCodingTools invoking loadPluginRegistrySnapshotWithMetadata and resolveProviderAuthEnvVarCandidates synchronously on every tool-creation pass. Top self-time frames are lstat, readFileSync, path.resolve — classic uncached fs scan.

This may share root cause with #73532 (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is prepareBundledPluginRuntimeDistMirror (mirror-init); mine is createOpenClawTools (tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.

This is distinct from #74345 (ACP session leak / 480s task-registry-maintenance event-loop holds, also on 4.27) — different fingerprint (~7s stalls, no ACP cleanup loop, no session-write-lock warnings).

Environment

  • OpenClaw 2026.4.27 (c27ae31043c7) — installed via global npm
  • Node v22.22.1
  • Linux x86_64, host-mode container (Docker), gateway port 54401
  • Channels enabled: Telegram (multi-bot), WhatsApp
  • ~10 user agents bound to lanes via bindings[]; many tool-call-heavy

Symptom fingerprint

/data/.openclaw/logs/gateway-eventloop.jsonl — continuous ~7s stalls during normal load:

Timestamp (UTC)Stall (ms)
04:17:28Z6937
04:17:43Z7130
04:18:49Z7151
04:19:19Z7365
04:20:34Z6942
04:20:49Z7038
04:23:04Z7650

ELU peaked at 0.81 across two consecutive samples, with maxMs=6941–7650ms. CPU profiles auto-saved per stall.

gateway-monitor.log — 27 gateway crashes / 24h, 25 Tier 2 (full container restart), concentrated in the post-upgrade window. Watchdog kills gateway, container restarts, load resumes, stalls return.

While stalled: Telegram long-poll stays alive (update-offset-*.json mtime advances), getChat/getChatMember confirm bot-side state is fine, but channel-post-routed agent invocations don't dispatch — the agent isn't getting CPU.

CPU profile analysis (6 stalls aggregated, 230s combined)

Top 5 by inclusive time (caller + descendants):

Inclusive ms%FunctionSource
51,6373.2%runEmbeddedAttemptselection-8xKkwZC_.js:5792
46,6882.9%createOpenClawCodingToolspi-tools-DvRlK7Mt.js:805
46,6682.9%createOpenClawToolsopenclaw-tools-CHlPDJlv.js:8767
32,1602.0%loadPluginRegistrySnapshotWithMetadataplugin-registry-Bjn9rPwy.js:115
24,2401.5%resolveProviderAuthEnvVarCandidatesprovider-env-vars-DOy0Czuc.js:60

Top 5 by self time (where CPU actually burns):

Self ms%Function
13,3435.8%lstat
4,8842.1%post (inspector)
4,0901.8%(garbage collector)
3,2411.4%path.resolve
3,2191.4%readFileSync

The hot-path pattern is unambiguous: createOpenClawTools + createOpenClawCodingTools running synchronous plugin-registry rebuild + provider-auth-env-var resolution on every tool-creation pass, dominated by lstat / readFileSync / existsSync / readdir.

loadInstalledPluginIndex (22,638ms inclusive) and loadPluginManifest (905ms self) confirm the plugin index is being rebuilt, not cache-served.

resolveProviderAuthEnvVarCandidatesresolveManifestProviderAuthEnvVarCandidatesisProviderApiKeyConfiguredhasAuthForProvider indicates per-tool-call provider-auth resolution that walks process.env against every model provider's possible env var name.

Inferred bug class

In 2026.4.27, the tool-creation path appears to:

  1. Reload the entire plugin registry from disk synchronously on every invocation rather than serving from the cache that loadOpenClawPlugins is supposed to provide.
  2. Re-resolve provider-auth env-var candidates per tool synchronously without caching — this scales with (number of tools × number of providers × number of env-var aliases per provider).

Each pass takes ~7s of CPU time, blocking the event loop. Under any non-trivial agent activity (multiple tool calls per minute), the gateway never recovers and watchdog kills it.

This blocks: Telegram long-poll dispatch, agent response generation, all PM2 worker IPC.

Likely overlaps with #73532's "cache key varies between callers" / "cache invalidated too aggressively" hypotheses, but expressed via the tool-creation entry point rather than the mirror-prep one.

Reproduction

  1. Install [email protected] (or pull c27ae31043c7)
  2. Configure ≥2 enabled user plugins and ≥2 model providers with auth profiles
  3. Bind ≥1 Telegram bot to an agent and send agent any tool-call-heavy prompt under steady load (~1–2 invocations/minute)
  4. Within minutes, gateway-eventloop.jsonl will show 6000–7700ms stalls per sample window
  5. Within ~1h, watchdog Tier 2 restart will fire

CPU profiles auto-write to <DATA>/.openclaw/logs/gateway-stall-profiles/<pid>-<ts>/*.cpuprofile. Loading any in Chrome DevTools → Performance shows the flame graph matching the table above.

Suggested next steps

  • Instrument getCachedPluginRegistry / setCachedPluginRegistry (per #73532's suggestion) to log cache key + hit/miss; verify whether createOpenClawTools is cache-missing every call.
  • Audit resolveProviderAuthEnvVarCandidates for memoization — if there's no per-process or per-tool cache, the per-invocation env walk is the easy half of this regression to fix.
  • Consider whether the 4.15 → 4.27 plugin-loader refactor moved registry-load from startup-time to tool-factory-time, and whether memoization survived the refactor.

Artifacts available

  • 6 .cpuprofile files (10MB total) — happy to attach a redacted bundle on request
  • gateway-eventloop.jsonl excerpt
  • gateway-monitor.log crash log
  • Internal finding doc with full top-10 inclusive/self frame analysis

Related

  • #73532 — plugin loader hot loop (prepareBundledPluginRuntimeDistMirror + JSON5 parse). Adjacent code path, possibly same root cause.
  • #74345 — ACP session leak / 480s task-registry-maintenance holds on 4.27. Distinct fingerprint.
  • #73467 — main thread stalls under orchestration load on 4.26. Possibly upstream of the same regression.

extent analysis

TL;DR

The most likely fix is to implement caching for the plugin registry and provider auth env-var candidates to prevent synchronous reloads and re-resolutions on every tool creation.

Guidance

  • Instrument getCachedPluginRegistry and setCachedPluginRegistry to log cache key and hit/miss metrics to verify if the cache is being utilized correctly.
  • Audit resolveProviderAuthEnvVarCandidates for memoization opportunities to prevent per-invocation env walks.
  • Review the plugin-loader refactor from 4.15 to 4.27 to ensure memoization was preserved and registry-load was not moved from startup-time to tool-factory-time.
  • Consider implementing a cache for the plugin registry and provider auth env-var candidates to reduce synchronous reloads and re-resolutions.

Example

No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

The provided information suggests that the issue is related to the plugin-loader refactor and the lack of memoization for the plugin registry and provider auth env-var candidates. However, without more context or code, it is difficult to provide a definitive solution.

Recommendation

Apply a workaround by implementing caching for the plugin registry and provider auth env-var candidates to prevent synchronous reloads and re-resolutions on every tool creation. This should help mitigate the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Gateway 7-second event-loop stalls on 2026.4.27 — createOpenClawTools synchronously rebuilds plugin registry + provider-auth env-var resolution per tool-creation pass [3 comments, 3 participants]