openclaw - ✅(Solved) Fix 60s startup hang in sidecars.channels — synchronous plugin manifest re-discovery on every cold start (v2026.4.26) [1 pull requests, 1 comments, 2 participants]

openclaw2026-04-28 07:00:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73353•Fetched 2026-04-29 06:20:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

chsbusch-dot

Participants

chsbusch-dot

clawsweeper[bot]

Timeline (top)

closed ×1commented ×1cross-referenced ×1subscribed ×1

Root Cause

Config that triggers it: channels.telegram.enabled = true (though telegram is not the root cause — see bisect below).

Fix Action

Workaround

Add to the systemd unit:

Environment=OPENCLAW_SKIP_CHANNELS=1

Drops sidecars.channels from 54 829ms to 1.9ms; total cold start goes from ~68s to ~14s. Channels and telegram are disabled.

PR fix notes

PR #73420: fix(gateway): avoid blocking channels on model prewarm

Repository: openclaw/openclaw
Author: dorukardahan
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/73420

Description (problem / solution / changelog)

Summary

schedule primary model prewarm in the background instead of awaiting it before channel startup
keep the existing bounded prewarm timeout and startup trace span
add coverage that model prewarm scheduling does not wait for completion

Why

On my VPS, startup tracing showed Slack was not the slow part. The gateway logged ready, then spent 134s in sidecars.channels before Slack provider startup because primary model prewarm was still ahead of startChannels().

Current main already scopes and bounds that prewarm to 5s, which is a good improvement. This PR takes the next step and makes channel startup independent from prewarm latency.

Expected behavior: chat channels can come online promptly, while model prewarm still runs as best-effort startup work.

Related work

#60027 adds an opt-in env escape hatch to skip startup model warmup. This PR is different: it keeps warmup enabled by default but stops it from blocking channels.
#71203 refreshes configured agent models.json caches during startup. That is a separate cache correctness problem and should still be reviewable on its own.
#73276, #73353, #73411, and #72846 describe the same broad startup stall class that recent main fixes reduced. This PR closes the remaining ordering gap where channel startup still waits for model prewarm.
#73298 is still open as a broader slowdown report. This PR may help if its stall includes the same prewarm-before-channel ordering.

Tests

pnpm exec oxfmt --check src/gateway/server-startup-post-attach.ts src/gateway/server-startup-post-attach.test.ts CHANGELOG.md
pnpm exec vitest run --config test/vitest/vitest.gateway.config.ts src/gateway/server-startup-post-attach.test.ts src/gateway/server-startup.test.ts
pnpm check:test-types

Real-world validation

Before the local production mitigation, startup trace on my OpenClaw VPS showed sidecars.channels at 134006ms and Slack socket connected around 189s after gateway ready. After skipping blocking prewarm locally, sidecars.channels dropped to 571.8ms and Slack connected about 3.8s after gateway ready.

AI-assisted: yes. I reviewed the code and tests before opening this PR.

Changed files

CHANGELOG.md (modified, +1/-0)
src/gateway/server-startup-post-attach.test.ts (modified, +44/-1)
src/gateway/server-startup-post-attach.ts (modified, +28/-7)

Code Example

time systemctl --user restart openclaw-gateway
# watch journalctl; ~55s gap between "[hooks] loaded" and "[gateway] ready"

---

2877 ticks (4.3%)  json5/lib/parse.js *parse
2670 ticks (4.0%)  json5/lib/parse.js *beforePropertyValue
2328 ticks (3.5%)  json5/lib/parse.js *string

---

loadPluginManifest             dist/manifest-DkU_xlZi.js:1166
  ← loadPluginManifestRegistry dist/manifest-registry-CXpW6f0a.js:341  (57.7%)
  ← discoverInDirectory        dist/discovery-CRcfnviq.js:481
      ← loadOpenClawPlugins    dist/loader--FR-1ZCZ.js:2903

---

Environment=OPENCLAW_SKIP_CHANNELS=1

RAW_BUFFERClick to expand / collapse

Environment

OpenClaw: 2026.4.26 (be8c246)
Node: v24.15.0
OS: Ubuntu 24.04 (6.8.0-110-generic, x86_64)
Deployment: openclaw-gateway.service via systemd user unit

Symptom

openclaw gateway start takes ~67s from systemd start to [gateway] ready. The 55-60s window is silent (no logs). App is unusable during this period.

Reproduction

Any cold start of the gateway with channels configured:

time systemctl --user restart openclaw-gateway
# watch journalctl; ~55s gap between "[hooks] loaded" and "[gateway] ready"

Config that triggers it: channels.telegram.enabled = true (though telegram is not the root cause — see bisect below).

Instrumentation

Built-in startup trace (`OPENCLAW_GATEWAY_STARTUP_TRACE=1`)

Stage	Duration	eventLoopMax
`plugins.bootstrap`	2721ms	—
`sidecars.session-locks`	4.5ms	0ms
`sidecars.gmail-watch`	0.1ms	0ms
`sidecars.gmail-model`	0.2ms	0ms
`sidecars.internal-hooks`	1882ms	36ms
`sidecars.channels`	54 829ms	22 029ms
`sidecars.plugin-services`	379ms	372ms
`sidecars.memory`	0.1ms	0ms
`sidecars.total`	57 128ms	—
`ready`	1.3ms	0ms

eventLoopMax = 22 029ms means the JS event loop was synchronously blocked for 22 seconds at one point — not a network timeout.

Bisect

Run	`sidecars.channels`	eventLoopMax
baseline	54 829ms	22 029ms
`OPENCLAW_SKIP_CHANNELS=1`	1.9ms	0ms
`OPENCLAW_TELEGRAM_DISABLE_AUTO_SELECT_FAMILY=1` + `OPENCLAW_TELEGRAM_DNS_RESULT_ORDER=ipv4first`	54 456ms	22 029ms
`channels.telegram.enabled = false` (in config)	54 406ms	21 760ms

Telegram is not the cause. Disabling telegram or any DNS hardening has zero effect. Skipping the entire channels block (OPENCLAW_SKIP_CHANNELS=1) eliminates the hang.

V8 CPU profile (`node --prof`)

Top JavaScript hot frames (ticks = share of 74s profiling window):

2877 ticks (4.3%)  json5/lib/parse.js *parse
2670 ticks (4.0%)  json5/lib/parse.js *beforePropertyValue
2328 ticks (3.5%)  json5/lib/parse.js *string

Bottom-up callchain through the hot json5 frames:

loadPluginManifest             dist/manifest-DkU_xlZi.js:1166
  ← loadPluginManifestRegistry dist/manifest-registry-CXpW6f0a.js:341  (57.7%)
  ← discoverInDirectory        dist/discovery-CRcfnviq.js:481
      ← loadOpenClawPlugins    dist/loader--FR-1ZCZ.js:2903

Also significant:

collectRuntimePackageWildcardImportTargets / isPathInside / boundary-path — synchronous path resolution inside the discovery loop
2275 ticks in node:path resolve driven by boundary checks

Top C++ (syscall view)

Syscall	Ticks	% of C++
`syscall`	5639	12.0%
`__open`	2189	4.7%
`access`	2126	4.5%
`__read`	1805	3.8%
`getdents64`	198	0.4%

Heavy synchronous filesystem walk — opening, statting, and reading many files on the critical path.

Root Cause Hypothesis

sidecars.channels calls prewarmConfiguredPrimaryModel before startChannels(). prewarmConfiguredPrimaryModel calls ensureOpenClawModelsJson → getCurrentPluginMetadataSnapshot → triggers a full plugin manifest discovery walk (the same work plugins.bootstrap already did 50s earlier). Discovery synchronously opens every plugin's package.json/manifest, json5-parses it, and canonicalizes paths — blocking the event loop for ~22s and taking ~55s wall time.

The prewarm is also active even when the primary model (google/gemini-3.1-flash-lite-preview) passes through a non-pi harness. The three early-exit guards (isConfiguredCliBackendPrimary, isCliProvider, selectAgentHarness().id !== "pi") are checked after the 7-module Promise.all import and the discovery-triggering ensureOpenClawModelsJson, so non-pi models still pay the full cost.

Things That Didn't Help

OPENCLAW_TELEGRAM_DISABLE_AUTO_SELECT_FAMILY=1 / OPENCLAW_TELEGRAM_DNS_RESULT_ORDER=ipv4first — no effect
Disabling the telegram channel in config — no effect
Node 24 (upgraded from v22) — no effect

Workaround

Add to the systemd unit:

Environment=OPENCLAW_SKIP_CHANNELS=1

Drops sidecars.channels from 54 829ms to 1.9ms; total cold start goes from ~68s to ~14s. Channels and telegram are disabled.

Suggested Fixes

Re-order gates in prewarmConfiguredPrimaryModel (server.impl-*:8428): check isConfiguredCliBackendPrimary / isCliProvider / selectAgentHarness().id !== "pi" before the Promise.all import block and before calling ensureOpenClawModelsJson. Non-pi providers (google, openai, custom) should return immediately with zero discovery work.
Reuse the plugins.bootstrap snapshot in getCurrentPluginMetadataSnapshot: the full discovery already ran once (2.7s at plugins.bootstrap). The result should be cached in a process-singleton that ensureOpenClawModelsJson reads rather than re-discovering. The MODELS_JSON_STATE.readyCache fingerprint cache is keyed per targetPath, but the underlying plugin metadata scan runs unconditionally on a cache miss.
Break the sync discovery loop: discoverInDirectory + loadPluginManifest pin the event loop for 22s in a tight synchronous loop. Inserting await new Promise(r => setImmediate(r)) between manifest reads, or moving discovery to a worker thread, would allow the rest of startup to interleave and would prevent starving incoming WS connections.

extent analysis

TL;DR

Reorder the gates in prewarmConfiguredPrimaryModel to check for non-pi providers before triggering the discovery walk.

Guidance

Verify the discovery walk: Confirm that the full plugin manifest discovery walk is indeed the cause of the 55s delay by checking the sidecars.channels duration with and without the OPENCLAW_SKIP_CHANNELS=1 environment variable.
Check the event loop blockage: Use the OPENCLAW_GATEWAY_STARTUP_TRACE=1 flag to monitor the event loop blockage and verify that reordering the gates in prewarmConfiguredPrimaryModel reduces the blockage time.
Test the suggested fixes: Apply the suggested fixes (reordering gates, reusing the plugins.bootstrap snapshot, or breaking the sync discovery loop) and measure their impact on the cold start time.
Monitor the startup time: Use the time systemctl --user restart openclaw-gateway command to measure the startup time and verify that the fixes have reduced the delay.

Example

No code snippet is provided as the issue is more related to the logic and sequence of events rather than a specific code block.

Notes

The provided information suggests that the issue is related to the discovery walk and the event loop blockage. However, without more context or code, it's difficult to provide a more detailed solution. The suggested fixes should be tested and verified to ensure they resolve the issue.

Recommendation

Apply the first suggested fix: Re-order gates in prewarmConfiguredPrimaryModel. This fix seems to be the most straightforward and has the potential to significantly reduce the cold start time.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#logging issue #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix 60s startup hang in sidecars.channels — synchronous plugin manifest re-discovery on every cold start (v2026.4.26) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #73420: fix(gateway): avoid blocking channels on model prewarm

Description (problem / solution / changelog)

Summary

Why

Related work

Tests

Real-world validation

Changed files

Code Example

Environment

Symptom

Reproduction

Instrumentation

Built-in startup trace (OPENCLAW_GATEWAY_STARTUP_TRACE=1)

Bisect

V8 CPU profile (node --prof)

Top C++ (syscall view)

Root Cause Hypothesis

Things That Didn't Help

Workaround

Suggested Fixes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Built-in startup trace (`OPENCLAW_GATEWAY_STARTUP_TRACE=1`)

V8 CPU profile (`node --prof`)