openclaw - ✅(Solved) Fix Cold-boot freeze (~3 min) on K8s when only one channel plugin is allowlisted; gateway /readyz returns 504 throughout [1 pull requests, 3 comments, 2 participants]

rudy-ihealth · 2026-04-28T04:46:59Z

[openclaw] On openclaw 2026.4.26 running in a Kubernetes pod with a single custom channel plugin and a single proxy-style model provider configured plugins.all… On `openclaw 2026.4.26` running in a Kubernetes pod with a single custom channel plugin and a single proxy-style model provider configured (`plugins.allow = ["my-channel", "my-proxy-provider"]`), the gateway takes **~2m42s** between `[gateway] starting channels and sidecars...` and the gateway calling the channel's `gateway.startAccount` callback. During this window, `/readyz` and `/healthz` time out. The Node.js event loop is synchronously blocked for stretches of up to ~37 s at a time; an nginx sidecar in front of the gateway returns `504 Gateway Timeout` to kubelet probes throughout. Boot timing on `2026.4.21` for the same deployment was **~32 s** (~80× faster). The regression appeared on first bump to `2026.4.26` (haven't bisected the intermediate releases yet). # PR #73420: fix(gateway): avoid blocking channels on model prewarm - Repository: openclaw/openclaw - Author: dorukardahan - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/73420 ## Description (problem / solution / changelog) ## Summary - schedule primary model prewarm in the background instead of awaiting it before channel startup - keep the existing bounded prewarm timeout and startup trace span - add coverage that model prewarm scheduling does not wait for completion ## Why On my VPS, startup tracing showed Slack was not the slow part. The gateway logged ready, then spent 134s in sidecars.channels before Slack provider startup because primary model prewarm was still ahead of startChannels(). Current main already scopes and bounds that prewarm to 5s, which is a good improvement. This PR takes the next step and makes channel startup independent from prewarm latency. Expected behavior: chat channels can come online promptly, while model prewarm still runs as best-effort startup work. ## Related work - #60027 adds an opt-in env escape hatch to skip startup model warmup. This PR is different: it keeps warmup enabled by default but stops it from blocking channels. - #71203 refreshes configured agent models.json caches during startup. That is a separate cache correctness problem and should still be reviewable on its own. - #73276, #73353, #73411, and #72846 describe the same broad startup stall class that recent main fixes reduced. This PR closes the remaining ordering gap where channel startup still waits for model prewarm. - #73298 is still open as a broader slowdown report. This PR may help if its stall includes the same prewarm-before-channel ordering. ## Tests - pnpm exec oxfmt --check src/gateway/server-startup-post-attach.ts src/gateway/server-startup-post-attach.test.ts CHANGELOG.md - pnpm exec vitest run --config test/vitest/vitest.gateway.config.ts src/gateway/server-startup-post-attach.test.ts src/gateway/server-startup.test.ts - pnpm check:test-types ## Real-world validation Before the local production mitigation, startup trace on my OpenClaw VPS showed sidecars.channels at 134006ms and Slack socket connected around 189s after gateway ready. After skipping blocking prewarm locally, sidecars.channels dropped to 571.8ms and Slack connected about 3.8s after gateway ready. AI-assisted: yes. I reviewed the code and tests before opening this PR. ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/gateway/server-startup-post-attach.test.ts` (modified, +44/-1) - `src/gateway/server-startup-post-attach.ts` (modified, +28/-7) ## Fix / Workaround I patched the gateway dist with `console.time`-style markers to bisect the silent `sidecars.channels` step. With `OPENCLAW_GATEWAY_STARTUP_TRACE=1` plus the patch: Happy to share verbose / instrumented runs or the diagnostic patch on request. ## Summary On `openclaw 2026.4.26` running in a Kubernetes pod with a single custom channel plugin and a single proxy-style model provider configured (`plugins.allow = ["my-channel", "my-proxy-provider"]`), the gateway takes **~2m42s** between `[gateway] starting channels and sidecars...` and the gateway calling the channel's `gateway.startAccount` callback. During this window, `/readyz` and `/healthz` time out. The Node.js event loop is synchronously blocked for stretches of up to ~37 s at a time; an nginx sidecar in front of the gateway returns `504 Gateway Timeout` to kubelet probes throughout. Boot timing on `2026.4.21` for the same deployment was **~32 s** (~80× faster). The regression appeared on first bump to `2026.4.26` (haven't bisected the intermediate releases yet). ## Reproduction K8s StatefulSet with: - 1 custom HTTP-webhook channel plugin (no persistent connection) - A single proxy-style model provider (the model is a ` / / ` triplet) - nginx sidecar in front of the gateway, proxying `/readyz` and `/healthz` - workspace PVC mounted at `/home/node/.openclaw` - StartupProbe with `5s × 36 = 180 s` ceiling (anything < 300 s misfires

openclaw2026-04-28 04:46:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73276•Fetched 2026-04-29 06:21:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

rudy-ihealth

Participants

rudy-ihealth

steipete

Timeline (top)

commented ×3closed ×2cross-referenced ×2subscribed ×2

On openclaw 2026.4.26 running in a Kubernetes pod with a single custom channel plugin and a single proxy-style model provider configured (plugins.allow = ["my-channel", "my-proxy-provider"]), the gateway takes ~2m42s between [gateway] starting channels and sidecars... and the gateway calling the channel's gateway.startAccount callback.

During this window, /readyz and /healthz time out. The Node.js event loop is synchronously blocked for stretches of up to ~37 s at a time; an nginx sidecar in front of the gateway returns 504 Gateway Timeout to kubelet probes throughout.

Boot timing on 2026.4.21 for the same deployment was ~32 s (~80× faster). The regression appeared on first bump to 2026.4.26 (haven't bisected the intermediate releases yet).

Error Message

Sanitized agent log excerpt from a typical boot (timestamps relative to gateway start):

+0s     [gateway] loading configuration…
+1s     [gateway] starting...
+1s     Config warnings:
        - plugins.entries.codex: plugin disabled (bundled (disabled by default)) but config is present
+1s     [gateway] failed to persist plugin auto-enable changes: Error: EBUSY
+2s     [gateway] starting HTTP server...
+2s     [gateway] http server listening (...; 2.4 s)
+3s     [gateway] starting channels and sidecars...
                                           ← 2m42 s of complete log silence
+2m45s  [<my-channel>] channel ready
+2m45s  [gateway] ready
+2m45s  [heartbeat] started

With OPENCLAW_GATEWAY_STARTUP_TRACE=1:

+760ms  startup trace: sidecars.session-locks       1.5 ms
+760ms  startup trace: sidecars.gmail-watch         0.0 ms
+760ms  startup trace: sidecars.gmail-model         0.0 ms
+760ms  startup trace: sidecars.internal-hooks      0.0 ms
                                          ← prewarmConfiguredPrimaryModel runs here, 2m42 s
+2m43s  startup trace: sidecars.channels           157,007 ms   eventLoopMax=37,212 ms
+2m43s  startup trace: sidecars.plugin-services       0.6 ms
+2m43s  startup trace: sidecars.memory                0.0 ms
+2m43s  startup trace: sidecars.restart-sentinel     78 ms
+2m43s  startup trace: sidecars.subagent-recovery     3.7 ms
+2m43s  startup trace: sidecars.main-session-recovery 2.7 ms
+2m43s  startup trace: sidecars.total              160,053 ms
+2m43s  startup trace: ready                          0.5 ms

Happy to share verbose / instrumented runs or the diagnostic patch on request.

Root Cause

Root cause (instrumented)

Fix Action

Fix / Workaround

I patched the gateway dist with console.time-style markers to bisect the silent sidecars.channels step. With OPENCLAW_GATEWAY_STARTUP_TRACE=1 plus the patch:

Happy to share verbose / instrumented runs or the diagnostic patch on request.

PR fix notes

PR #73420: fix(gateway): avoid blocking channels on model prewarm

Repository: openclaw/openclaw
Author: dorukardahan
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/73420

Description (problem / solution / changelog)

Summary

schedule primary model prewarm in the background instead of awaiting it before channel startup
keep the existing bounded prewarm timeout and startup trace span
add coverage that model prewarm scheduling does not wait for completion

Why

On my VPS, startup tracing showed Slack was not the slow part. The gateway logged ready, then spent 134s in sidecars.channels before Slack provider startup because primary model prewarm was still ahead of startChannels().

Current main already scopes and bounds that prewarm to 5s, which is a good improvement. This PR takes the next step and makes channel startup independent from prewarm latency.

Expected behavior: chat channels can come online promptly, while model prewarm still runs as best-effort startup work.

Related work

#60027 adds an opt-in env escape hatch to skip startup model warmup. This PR is different: it keeps warmup enabled by default but stops it from blocking channels.
#71203 refreshes configured agent models.json caches during startup. That is a separate cache correctness problem and should still be reviewable on its own.
#73276, #73353, #73411, and #72846 describe the same broad startup stall class that recent main fixes reduced. This PR closes the remaining ordering gap where channel startup still waits for model prewarm.
#73298 is still open as a broader slowdown report. This PR may help if its stall includes the same prewarm-before-channel ordering.

Tests

pnpm exec oxfmt --check src/gateway/server-startup-post-attach.ts src/gateway/server-startup-post-attach.test.ts CHANGELOG.md
pnpm exec vitest run --config test/vitest/vitest.gateway.config.ts src/gateway/server-startup-post-attach.test.ts src/gateway/server-startup.test.ts
pnpm check:test-types

Real-world validation

Before the local production mitigation, startup trace on my OpenClaw VPS showed sidecars.channels at 134006ms and Slack socket connected around 189s after gateway ready. After skipping blocking prewarm locally, sidecars.channels dropped to 571.8ms and Slack connected about 3.8s after gateway ready.

AI-assisted: yes. I reviewed the code and tests before opening this PR.

Changed files

CHANGELOG.md (modified, +1/-0)
src/gateway/server-startup-post-attach.test.ts (modified, +44/-1)
src/gateway/server-startup-post-attach.ts (modified, +28/-7)

Code Example

sidecars.channels                   157,007 ms   eventLoopMax = 37,212 ms
└── prewarmConfiguredPrimaryModel   162,517 ms       ← THE BOTTLENECK
    ├── ensureOpenClawModelsJson     39,757 ms  (~40 s — matches eventLoopMax)
    └── resolveModel (sync)         122,399 ms  (~2 m to resolve ONE configured model, resolved=true)
└── params.startChannels()               85 ms  (channel start itself is healthy)

---

const authStorage   = options?.authStorage   ?? discoverAuthStorage(resolvedAgentDir);
const modelRegistry = options?.modelRegistry ?? discoverModels(authStorage, resolvedAgentDir);

---

+0s     [gateway] loading configuration…
+1s     [gateway] starting...
+1s     Config warnings:
        - plugins.entries.codex: plugin disabled (bundled (disabled by default)) but config is present
+1s     [gateway] failed to persist plugin auto-enable changes: Error: EBUSY
+2s     [gateway] starting HTTP server...
+2s     [gateway] http server listening (...; 2.4 s)
+3s     [gateway] starting channels and sidecars...
                                           ← 2m42 s of complete log silence
+2m45s  [<my-channel>] channel ready
+2m45s  [gateway] ready
+2m45s  [heartbeat] started

---

+760ms  startup trace: sidecars.session-locks       1.5 ms
+760ms  startup trace: sidecars.gmail-watch         0.0 ms
+760ms  startup trace: sidecars.gmail-model         0.0 ms
+760ms  startup trace: sidecars.internal-hooks      0.0 ms
                                          ← prewarmConfiguredPrimaryModel runs here, 2m42 s
+2m43s  startup trace: sidecars.channels           157,007 ms   eventLoopMax=37,212 ms
+2m43s  startup trace: sidecars.plugin-services       0.6 ms
+2m43s  startup trace: sidecars.memory                0.0 ms
+2m43s  startup trace: sidecars.restart-sentinel     78 ms
+2m43s  startup trace: sidecars.subagent-recovery     3.7 ms
+2m43s  startup trace: sidecars.main-session-recovery 2.7 ms
+2m43s  startup trace: sidecars.total              160,053 ms
+2m43s  startup trace: ready                          0.5 ms

RAW_BUFFERClick to expand / collapse

Summary

Boot timing on 2026.4.21 for the same deployment was ~32 s (~80× faster). The regression appeared on first bump to 2026.4.26 (haven't bisected the intermediate releases yet).

Reproduction

K8s StatefulSet with:

1 custom HTTP-webhook channel plugin (no persistent connection)
A single proxy-style model provider (the model is a <proxy-provider>/<vendor>/<model> triplet)
nginx sidecar in front of the gateway, proxying /readyz and /healthz
workspace PVC mounted at /home/node/.openclaw
StartupProbe with 5s × 36 = 180 s ceiling (anything < 300 s misfires on the first cold boot)

Expected: ~30–60 s gateway warm-up.

Actual: ~2m42 s of silent gateway work between [gateway] starting channels and sidecars... and the first channel-ready log line.

Root cause (instrumented)

I patched the gateway dist with console.time-style markers to bisect the silent sidecars.channels step. With OPENCLAW_GATEWAY_STARTUP_TRACE=1 plus the patch:

sidecars.channels                   157,007 ms   eventLoopMax = 37,212 ms
└── prewarmConfiguredPrimaryModel   162,517 ms       ← THE BOTTLENECK
    ├── ensureOpenClawModelsJson     39,757 ms  (~40 s — matches eventLoopMax)
    └── resolveModel (sync)         122,399 ms  (~2 m to resolve ONE configured model, resolved=true)
└── params.startChannels()               85 ms  (channel start itself is healthy)

resolveModel is the synchronous path in model-CkUlgtmi.js:

const authStorage   = options?.authStorage   ?? discoverAuthStorage(resolvedAgentDir);
const modelRegistry = options?.modelRegistry ?? discoverModels(authStorage, resolvedAgentDir);

discoverModels instantiates the Pi model registry (createOpenClawModelRegistry → instantiatePiModelRegistry). Both discoverAuthStorage and discoverModels are fully synchronous, so they block the Node event loop for the duration. With an nginx sidecar in front of the gateway, every probe hit during this window returns 504.

Investigations ruled out

I have logs available for each of these.

❌ Codex live model discovery — disabled via plugins.entries.codex.config.discovery.enabled = false (saved ~30 s and eliminated codex_app_server stderr, but not the bulk).
❌ Node.js compile-cache — wiped the cache on the PVC; cold boot was only ~7 s slower than warm.
❌ Plugin runtime-deps materialization — log shows installed bundled runtime deps in 908 ms.
❌ EBUSY on plugin auto-enable — single warning per boot, no retry loop.
❌ Provider-discovery live catalog calls — only codex (disabled) and anthropic-vertex (returns null fast without GCP creds) actually run live catalogs.
❌ Channel start itself — the channel plugin's startAccount callback runs in 85 ms; the freeze is upstream of any plugin-side work.

Suggested fixes (rough order of cleanliness)

Cache the Pi registry instantiation. After first cold boot, subsequent in-process resolves should be sub-ms; even better, bake a serialized snapshot into the workspace and skip instantiatePiModelRegistry entirely on cache-hit.
Skip prewarmConfiguredPrimaryModel for proxy-style providers where Pi's registry isn't actually consulted at chat time. The configured primary model in the failing deployment is a proxy passthrough — Pi resolves it as resolved=true after 2 minutes of work that produces no useful state.
Yield the event loop inside instantiatePiModelRegistry. Chunk by setImmediate every N providers so /readyz can answer 503 quickly during cold boot instead of blocking long enough for upstream proxies (nginx, Envoy, etc.) to time out.
Add OPENCLAW_SKIP_MODEL_PREWARM=1 (or a config flag like agents.defaults.skipPrewarm = true). Currently the only way to skip prewarm is to set OPENCLAW_AGENT_RUNTIME to a non-auto/non-pi value, which has chat-time side effects.
Have plugins.allow also gate provider/channel discovery enumeration for bundled plugins — isProviderPluginEligibleForSetupDiscovery (in providers-CjE0WvyR.js) returns true unconditionally for non-workspace bundled plugins regardless of plugins.allow / plugins.deny / plugins.entries.<id>.enabled. A tightly scoped deployment shouldn't pay enumeration cost for ~120 unused bundled plugins.

Logs

Sanitized agent log excerpt from a typical boot (timestamps relative to gateway start):

+0s     [gateway] loading configuration…
+1s     [gateway] starting...
+1s     Config warnings:
        - plugins.entries.codex: plugin disabled (bundled (disabled by default)) but config is present
+1s     [gateway] failed to persist plugin auto-enable changes: Error: EBUSY
+2s     [gateway] starting HTTP server...
+2s     [gateway] http server listening (...; 2.4 s)
+3s     [gateway] starting channels and sidecars...
                                           ← 2m42 s of complete log silence
+2m45s  [<my-channel>] channel ready
+2m45s  [gateway] ready
+2m45s  [heartbeat] started

With OPENCLAW_GATEWAY_STARTUP_TRACE=1:

+760ms  startup trace: sidecars.session-locks       1.5 ms
+760ms  startup trace: sidecars.gmail-watch         0.0 ms
+760ms  startup trace: sidecars.gmail-model         0.0 ms
+760ms  startup trace: sidecars.internal-hooks      0.0 ms
                                          ← prewarmConfiguredPrimaryModel runs here, 2m42 s
+2m43s  startup trace: sidecars.channels           157,007 ms   eventLoopMax=37,212 ms
+2m43s  startup trace: sidecars.plugin-services       0.6 ms
+2m43s  startup trace: sidecars.memory                0.0 ms
+2m43s  startup trace: sidecars.restart-sentinel     78 ms
+2m43s  startup trace: sidecars.subagent-recovery     3.7 ms
+2m43s  startup trace: sidecars.main-session-recovery 2.7 ms
+2m43s  startup trace: sidecars.total              160,053 ms
+2m43s  startup trace: ready                          0.5 ms

Happy to share verbose / instrumented runs or the diagnostic patch on request.

extent analysis

TL;DR

The most likely fix is to cache the Pi registry instantiation or skip prewarmConfiguredPrimaryModel for proxy-style providers to reduce the gateway warm-up time.

Guidance

Investigate caching the Pi registry instantiation to reduce the time spent in instantiatePiModelRegistry.
Consider skipping prewarmConfiguredPrimaryModel for proxy-style providers, as it does not produce useful state and blocks the event loop.
Review the discoverModels and discoverAuthStorage functions to identify opportunities for asynchronous execution or optimization.
Evaluate the effectiveness of adding OPENCLAW_SKIP_MODEL_PREWARM=1 or a similar config flag to skip prewarm for certain providers.

Example

No code snippet is provided, as the issue is more related to the overall architecture and configuration of the system.

Notes

The provided information suggests that the bottleneck is in the prewarmConfiguredPrimaryModel step, specifically in the instantiatePiModelRegistry function. However, without more context or code, it is difficult to provide a more specific solution.

Recommendation

Apply a workaround by caching the Pi registry instantiation or skipping prewarmConfiguredPrimaryModel for proxy-style providers, as these changes are likely to have the most significant impact on reducing the gateway warm-up time.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Cold-boot freeze (~3 min) on K8s when only one channel plugin is allowlisted; gateway /readyz returns 504 throughout [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (instrumented)

Fix Action

Fix / Workaround

PR fix notes

PR #73420: fix(gateway): avoid blocking channels on model prewarm

Description (problem / solution / changelog)

Summary

Why

Related work

Tests

Real-world validation

Changed files

Code Example

Summary

Reproduction

Root cause (instrumented)

Investigations ruled out

Suggested fixes (rough order of cleanliness)

Logs

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING