openclaw - ✅(Solved) Fix 60s startup hang in sidecars.channels — synchronous plugin manifest re-discovery on every cold start (v2026.4.26) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73353Fetched 2026-04-29 06:20:37
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
closed ×1commented ×1cross-referenced ×1subscribed ×1

Root Cause

Config that triggers it: channels.telegram.enabled = true (though telegram is not the root cause — see bisect below).

Fix Action

Workaround

Add to the systemd unit:

Environment=OPENCLAW_SKIP_CHANNELS=1

Drops sidecars.channels from 54 829ms to 1.9ms; total cold start goes from ~68s to ~14s. Channels and telegram are disabled.

PR fix notes

PR #73420: fix(gateway): avoid blocking channels on model prewarm

Description (problem / solution / changelog)

Summary

  • schedule primary model prewarm in the background instead of awaiting it before channel startup
  • keep the existing bounded prewarm timeout and startup trace span
  • add coverage that model prewarm scheduling does not wait for completion

Why

On my VPS, startup tracing showed Slack was not the slow part. The gateway logged ready, then spent 134s in sidecars.channels before Slack provider startup because primary model prewarm was still ahead of startChannels().

Current main already scopes and bounds that prewarm to 5s, which is a good improvement. This PR takes the next step and makes channel startup independent from prewarm latency.

Expected behavior: chat channels can come online promptly, while model prewarm still runs as best-effort startup work.

Related work

  • #60027 adds an opt-in env escape hatch to skip startup model warmup. This PR is different: it keeps warmup enabled by default but stops it from blocking channels.
  • #71203 refreshes configured agent models.json caches during startup. That is a separate cache correctness problem and should still be reviewable on its own.
  • #73276, #73353, #73411, and #72846 describe the same broad startup stall class that recent main fixes reduced. This PR closes the remaining ordering gap where channel startup still waits for model prewarm.
  • #73298 is still open as a broader slowdown report. This PR may help if its stall includes the same prewarm-before-channel ordering.

Tests

  • pnpm exec oxfmt --check src/gateway/server-startup-post-attach.ts src/gateway/server-startup-post-attach.test.ts CHANGELOG.md
  • pnpm exec vitest run --config test/vitest/vitest.gateway.config.ts src/gateway/server-startup-post-attach.test.ts src/gateway/server-startup.test.ts
  • pnpm check:test-types

Real-world validation

Before the local production mitigation, startup trace on my OpenClaw VPS showed sidecars.channels at 134006ms and Slack socket connected around 189s after gateway ready. After skipping blocking prewarm locally, sidecars.channels dropped to 571.8ms and Slack connected about 3.8s after gateway ready.

AI-assisted: yes. I reviewed the code and tests before opening this PR.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/server-startup-post-attach.test.ts (modified, +44/-1)
  • src/gateway/server-startup-post-attach.ts (modified, +28/-7)

Code Example

time systemctl --user restart openclaw-gateway
# watch journalctl; ~55s gap between "[hooks] loaded" and "[gateway] ready"

---

2877 ticks (4.3%)  json5/lib/parse.js *parse
2670 ticks (4.0%)  json5/lib/parse.js *beforePropertyValue
2328 ticks (3.5%)  json5/lib/parse.js *string

---

loadPluginManifest             dist/manifest-DkU_xlZi.js:1166
  ← loadPluginManifestRegistry dist/manifest-registry-CXpW6f0a.js:341  (57.7%)
  ← discoverInDirectory        dist/discovery-CRcfnviq.js:481
      ← loadOpenClawPlugins    dist/loader--FR-1ZCZ.js:2903

---

Environment=OPENCLAW_SKIP_CHANNELS=1
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw: 2026.4.26 (be8c246)
  • Node: v24.15.0
  • OS: Ubuntu 24.04 (6.8.0-110-generic, x86_64)
  • Deployment: openclaw-gateway.service via systemd user unit

Symptom

openclaw gateway start takes ~67s from systemd start to [gateway] ready. The 55-60s window is silent (no logs). App is unusable during this period.

Reproduction

Any cold start of the gateway with channels configured:

time systemctl --user restart openclaw-gateway
# watch journalctl; ~55s gap between "[hooks] loaded" and "[gateway] ready"

Config that triggers it: channels.telegram.enabled = true (though telegram is not the root cause — see bisect below).

Instrumentation

Built-in startup trace (OPENCLAW_GATEWAY_STARTUP_TRACE=1)

StageDurationeventLoopMax
plugins.bootstrap2721ms
sidecars.session-locks4.5ms0ms
sidecars.gmail-watch0.1ms0ms
sidecars.gmail-model0.2ms0ms
sidecars.internal-hooks1882ms36ms
sidecars.channels54 829ms22 029ms
sidecars.plugin-services379ms372ms
sidecars.memory0.1ms0ms
sidecars.total57 128ms
ready1.3ms0ms

eventLoopMax = 22 029ms means the JS event loop was synchronously blocked for 22 seconds at one point — not a network timeout.

Bisect

Runsidecars.channelseventLoopMax
baseline54 829ms22 029ms
OPENCLAW_SKIP_CHANNELS=11.9ms0ms
OPENCLAW_TELEGRAM_DISABLE_AUTO_SELECT_FAMILY=1 + OPENCLAW_TELEGRAM_DNS_RESULT_ORDER=ipv4first54 456ms22 029ms
channels.telegram.enabled = false (in config)54 406ms21 760ms

Telegram is not the cause. Disabling telegram or any DNS hardening has zero effect. Skipping the entire channels block (OPENCLAW_SKIP_CHANNELS=1) eliminates the hang.

V8 CPU profile (node --prof)

Top JavaScript hot frames (ticks = share of 74s profiling window):

2877 ticks (4.3%)  json5/lib/parse.js *parse
2670 ticks (4.0%)  json5/lib/parse.js *beforePropertyValue
2328 ticks (3.5%)  json5/lib/parse.js *string

Bottom-up callchain through the hot json5 frames:

loadPluginManifest             dist/manifest-DkU_xlZi.js:1166
  ← loadPluginManifestRegistry dist/manifest-registry-CXpW6f0a.js:341  (57.7%)
  ← discoverInDirectory        dist/discovery-CRcfnviq.js:481
      ← loadOpenClawPlugins    dist/loader--FR-1ZCZ.js:2903

Also significant:

  • collectRuntimePackageWildcardImportTargets / isPathInside / boundary-path — synchronous path resolution inside the discovery loop
  • 2275 ticks in node:path resolve driven by boundary checks

Top C++ (syscall view)

SyscallTicks% of C++
syscall563912.0%
__open21894.7%
access21264.5%
__read18053.8%
getdents641980.4%

Heavy synchronous filesystem walk — opening, statting, and reading many files on the critical path.

Root Cause Hypothesis

sidecars.channels calls prewarmConfiguredPrimaryModel before startChannels(). prewarmConfiguredPrimaryModel calls ensureOpenClawModelsJsongetCurrentPluginMetadataSnapshot → triggers a full plugin manifest discovery walk (the same work plugins.bootstrap already did 50s earlier). Discovery synchronously opens every plugin's package.json/manifest, json5-parses it, and canonicalizes paths — blocking the event loop for ~22s and taking ~55s wall time.

The prewarm is also active even when the primary model (google/gemini-3.1-flash-lite-preview) passes through a non-pi harness. The three early-exit guards (isConfiguredCliBackendPrimary, isCliProvider, selectAgentHarness().id !== "pi") are checked after the 7-module Promise.all import and the discovery-triggering ensureOpenClawModelsJson, so non-pi models still pay the full cost.

Things That Didn't Help

  • OPENCLAW_TELEGRAM_DISABLE_AUTO_SELECT_FAMILY=1 / OPENCLAW_TELEGRAM_DNS_RESULT_ORDER=ipv4first — no effect
  • Disabling the telegram channel in config — no effect
  • Node 24 (upgraded from v22) — no effect

Workaround

Add to the systemd unit:

Environment=OPENCLAW_SKIP_CHANNELS=1

Drops sidecars.channels from 54 829ms to 1.9ms; total cold start goes from ~68s to ~14s. Channels and telegram are disabled.

Suggested Fixes

  1. Re-order gates in prewarmConfiguredPrimaryModel (server.impl-*:8428): check isConfiguredCliBackendPrimary / isCliProvider / selectAgentHarness().id !== "pi" before the Promise.all import block and before calling ensureOpenClawModelsJson. Non-pi providers (google, openai, custom) should return immediately with zero discovery work.

  2. Reuse the plugins.bootstrap snapshot in getCurrentPluginMetadataSnapshot: the full discovery already ran once (2.7s at plugins.bootstrap). The result should be cached in a process-singleton that ensureOpenClawModelsJson reads rather than re-discovering. The MODELS_JSON_STATE.readyCache fingerprint cache is keyed per targetPath, but the underlying plugin metadata scan runs unconditionally on a cache miss.

  3. Break the sync discovery loop: discoverInDirectory + loadPluginManifest pin the event loop for 22s in a tight synchronous loop. Inserting await new Promise(r => setImmediate(r)) between manifest reads, or moving discovery to a worker thread, would allow the rest of startup to interleave and would prevent starving incoming WS connections.

extent analysis

TL;DR

Reorder the gates in prewarmConfiguredPrimaryModel to check for non-pi providers before triggering the discovery walk.

Guidance

  1. Verify the discovery walk: Confirm that the full plugin manifest discovery walk is indeed the cause of the 55s delay by checking the sidecars.channels duration with and without the OPENCLAW_SKIP_CHANNELS=1 environment variable.
  2. Check the event loop blockage: Use the OPENCLAW_GATEWAY_STARTUP_TRACE=1 flag to monitor the event loop blockage and verify that reordering the gates in prewarmConfiguredPrimaryModel reduces the blockage time.
  3. Test the suggested fixes: Apply the suggested fixes (reordering gates, reusing the plugins.bootstrap snapshot, or breaking the sync discovery loop) and measure their impact on the cold start time.
  4. Monitor the startup time: Use the time systemctl --user restart openclaw-gateway command to measure the startup time and verify that the fixes have reduced the delay.

Example

No code snippet is provided as the issue is more related to the logic and sequence of events rather than a specific code block.

Notes

The provided information suggests that the issue is related to the discovery walk and the event loop blockage. However, without more context or code, it's difficult to provide a more detailed solution. The suggested fixes should be tested and verified to ensure they resolve the issue.

Recommendation

Apply the first suggested fix: Re-order gates in prewarmConfiguredPrimaryModel. This fix seems to be the most straightforward and has the potential to significantly reduce the cold start time.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix 60s startup hang in sidecars.channels — synchronous plugin manifest re-discovery on every cold start (v2026.4.26) [1 pull requests, 1 comments, 2 participants]