openclaw - ✅(Solved) Fix Gateway cold-start: bundled-plugin loading dominates startup; plugins.deny not honored at import-time [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71690Fetched 2026-04-26 05:09:45
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
cross-referenced ×2commented ×1referenced ×1

OpenClaw gateway cold-start time is dominated by loading bundled plugins that the operator doesn't actually use. On a fast workstation with warm disk cache, openclaw gateway reaches TCP-listening in ~4.3s; cold cache pushes that to ~10.8s. In containerized environments with slower per-vCPU and remote-mounted filesystems (Cloud Run + GCS FUSE in our case), the same boot routinely exceeds the 60–120s window we budget for it, causing user-visible failures.

This issue documents three separate cold-start cost drivers I found while investigating a real SandboxReadinessTimeoutError: Sandbox failed to become ready within 180s in our production environment, and proposes targeted fixes that should be additive (no behavior change for users who don't opt in).

Error Message

  1. Either way, the retry loop deserves investigation. One failure, one error log, no retry.

Root Cause

For operators who don't use Bedrock, this is dead weight on every cold start: ~2s of repeated module-resolution failures and ~22 MB of JS that we're forced to keep on the disk image. For operators who do use Bedrock with a working node_modules/@aws-sdk/client-bedrock symlink, the bundling layout works by accident.

Fix Action

Fixed

PR fix notes

PR #71746: perf(cli): skip plugin load on agents list --json (#71739)

Description (problem / solution / changelog)

Closes #71739.

Bug

Reporter measured `agents list --json` at ~7-9s on a fast host (~11s in container) on 2026.4.23, while peer `--json` commands stay sub-second:

commandwarmcold
`cron list --all --json`1.0s
`skills list --json`0.9s
`sessions --all-agents --json`1.0s
`channels list --json`0.9s
`agents list --json`8.6s~9s

The reporter's `agent-dash` endpoint cold path dropped from 27s → ~2s after a local dist patch — they could even retire the 5-min cache TTL workaround they shipped to dodge this.

Root cause

`agents list` inherits `loadPlugins: 'always'` from the parent `agents` policy in `command-catalog.ts`, then `agentsListCommand` calls `buildProviderStatusIndex(cfg)` unconditionally. Both paths trigger the bundled-extension import waterfall (~60+ extension `index.js` modules) — but `providerStatus` is only rendered into human text output, never used in JSON.

`channels list` already uses `loadPlugins: 'never'` and proves the shape is right; this PR matches that shape via the safer `text-only` variant so human invocations are unchanged.

Fix (two-line, per reporter's diagnosis)

  1. `src/cli/command-catalog.ts` — opt `agents list` into the existing `text-only` plugin-preload policy. Plugin preload runs for human text output, skips for `--json`.

  2. `src/commands/agents.commands.list.ts` — skip `buildProviderStatusIndex` (and the per-summary provider enrichment loop) when `opts.json`. Provider info is only rendered in human text output via `formatSummary`, so dropping it from JSON has no observable effect on callers that consume `id`, `name`, `model`, `bindings`, `isDefault`, `identity*`, `workspace`, or `agentDir`. `routes` is config-derived and continues to be set in both modes.

Tests

  • new assertion in `command-startup-policy.test.ts`: `agents list` with `jsonOutputMode: true` now resolves to `loadPlugins: false` (was effectively `true` via the parent `agents` "always" policy).
  • existing assertion that human (`jsonOutputMode: false`) still triggers plugin load is preserved verbatim — no behavior change for text mode.

6/6 tests pass. Lint clean (`pnpm oxlint` — 0 warnings, 0 errors).

Out of scope

  • `--bindings` flag opt-in for restoring `providers` in JSON output: worth adding later if any consumer needs it. Reporter said dashboard consumers don't, and `--bindings` already gates `bindingDetails` enrichment, so the precedent is there if needed.
  • Broader plugin-discovery cache work (#67040, #71690) addresses the same family of cold-start cost.

🤖 generated with assistance from Claude Code Co-authored-by: HCL [email protected]

Changed files

  • src/cli/command-catalog.ts (modified, +6/-0)
  • src/cli/command-startup-policy.test.ts (modified, +9/-0)
  • src/commands/agents.commands.list.ts (modified, +18/-9)

Code Example

// dist/discovery-Bwev8dJL.js (top of file)
import { BedrockClient, ListFoundationModelsCommand } from "@aws-sdk/client-bedrock";

---

17:43:51.569 [plugins] amazon-bedrock failed to load ... Cannot find module '@aws-sdk/client-bedrock'
17:43:53.562 [plugins] amazon-bedrock failed to load ... (retry)
17:43:53.909 [plugins] amazon-bedrock failed to load ... (retry)
17:43:54.311 [plugins] amazon-bedrock failed to load ... (retry)
17:43:55.219 [plugins] amazon-bedrock failed to load ... (retry)
17:43:55.502 [plugins] amazon-bedrock failed to load ... (retry)

---

// dist/cache-controls-DMmVdSiO.js
const DEFAULT_PLUGIN_DISCOVERY_CACHE_MS = 1e3;
const DEFAULT_PLUGIN_MANIFEST_CACHE_MS = 1e3;

---

mkdir -p /tmp/oc-state
cat > /tmp/oc-state/openclaw.json << 'JSON'
{ "gateway": { "mode": "local", "auth": { "mode": "none" }, "controlUi": { "enabled": false } },
  "models": { "providers": {
    "anthropic": { "baseUrl": "https://api.anthropic.com", "apiKey": "fake", "api": "anthropic-messages", "models": [{ "id": "claude-sonnet-4-6", "name": "Claude Sonnet 4.6" }] }
  } },
  "channels": { "slack": { "enabled": true, "webhookPath": "/slack/events", "botToken": "xoxb-fake", "signingSecret": "fake" } },
  "discovery": { "mdns": { "mode": "off" } },
  "agents": { "defaults": { "model": { "primary": "anthropic/claude-sonnet-4-6" } } } }
JSON
npm install -g openclaw@2026.3.31
OPENCLAW_STATE_DIR=/tmp/oc-state OPENCLAW_NO_RESPAWN=1 openclaw gateway --port 18091 --bind loopback --verbose 2>&1 | grep -E "READY|amazon-bedrock|listening"
RAW_BUFFERClick to expand / collapse

Summary

OpenClaw gateway cold-start time is dominated by loading bundled plugins that the operator doesn't actually use. On a fast workstation with warm disk cache, openclaw gateway reaches TCP-listening in ~4.3s; cold cache pushes that to ~10.8s. In containerized environments with slower per-vCPU and remote-mounted filesystems (Cloud Run + GCS FUSE in our case), the same boot routinely exceeds the 60–120s window we budget for it, causing user-visible failures.

This issue documents three separate cold-start cost drivers I found while investigating a real SandboxReadinessTimeoutError: Sandbox failed to become ready within 180s in our production environment, and proposes targeted fixes that should be additive (no behavior change for users who don't opt in).

Background — how I got here

Investigating a production cold-start timeout, I traced it to openclaw gateway itself eating most of the 180s budget. Repro environment for these numbers: workstation, Node v25.5, [email protected] from npm, npm install --omit=dev. Sandbox container in production is node:22-bookworm-slim on Cloud Run with a GCS-FUSE-mounted workspace volume.

Finding 1 — plugins.deny does not prevent bundled plugins from being imported

The plugin loader in dist/loader-BLFpy85U.js correctly checks resolveEnableState and skips the register(api) call for denied plugins. Good.

But several dist/*.js modules carry static top-level imports of plugin internals that re-import the plugin's transitive dependencies. The clearest example is dist/discovery-Bwev8dJL.js:

// dist/discovery-Bwev8dJL.js (top of file)
import { BedrockClient, ListFoundationModelsCommand } from "@aws-sdk/client-bedrock";

discovery-Bwev8dJL.js is imported by dist/api-Q8C6_PdG.js, which is imported by other gateway code that runs unconditionally on every boot. So even when amazon-bedrock is in plugins.deny, @aws-sdk/client-bedrock is still resolved at module-load time on every cold start.

To make matters worse, the bundling layout is broken in a way that turns this into a repeated failure:

  • @aws-sdk/client-bedrock lives in dist/extensions/amazon-bedrock/node_modules/@aws-sdk/client-bedrock (22 MB bundled inside the extension).
  • But the importer is dist/discovery-Bwev8dJL.js, at the top of dist/, not under the extension.
  • Node's module resolution walks up from dist/, finds nothing matching, and throws Cannot find module '@aws-sdk/client-bedrock'.
  • Something then retries this load 5–6 times within ~2 seconds, all post-listen:
17:43:51.569 [plugins] amazon-bedrock failed to load ... Cannot find module '@aws-sdk/client-bedrock'
17:43:53.562 [plugins] amazon-bedrock failed to load ... (retry)
17:43:53.909 [plugins] amazon-bedrock failed to load ... (retry)
17:43:54.311 [plugins] amazon-bedrock failed to load ... (retry)
17:43:55.219 [plugins] amazon-bedrock failed to load ... (retry)
17:43:55.502 [plugins] amazon-bedrock failed to load ... (retry)

Adding amazon-bedrock to plugins.deny in openclaw.json does not silence these.

Why this matters

For operators who don't use Bedrock, this is dead weight on every cold start: ~2s of repeated module-resolution failures and ~22 MB of JS that we're forced to keep on the disk image. For operators who do use Bedrock with a working node_modules/@aws-sdk/client-bedrock symlink, the bundling layout works by accident.

Suggested fix

  1. Stop static-importing plugin internals from non-extension dist modules. Make discovery-Bwev8dJL.js's @aws-sdk/client-bedrock import lazy (dynamic await import() inside the discovery functions, gated by cfg.plugins?.deny?.includes("amazon-bedrock")).
  2. Honor plugins.deny at the module-load layer, not just the registration layer. A simple guard at the top of discoverBedrockModels etc. that bails out when bedrock is denied would short-circuit the loop.
  3. Either way, the retry loop deserves investigation. One failure, one error log, no retry.

Finding 2 — 41 of 87 bundled plugins are enabledByDefault: true

For the use-case where an operator builds a gateway image with a known small subset of providers and channels (e.g. our case: anthropic, google, openai, slack, googlechat, browser), there's no clean way to short-circuit the load of the 30-plus other "enabled-by-default" provider plugins (byteplus, chutes, cloudflare-ai-gateway, copilot-proxy, deepseek, huggingface, kilocode, kimi, litellm, microsoft-foundry, minimax, mistral, modelstudio, moonshot, nvidia, ollama, openrouter, phone-control, qianfan, sglang, synthetic, talk-voice, together, venice, vercel-ai-gateway, vllm, volcengine, xai, xiaomi, zai, etc.).

Every one of those gets manifest-read, validated, and (per Finding 1) often module-imported even when the operator never reaches their auth methods.

Suggested fix

Either:

  • (a) Make enabledByDefault opt-in. Default to false for everything except a tiny core (anthropic, openai, google, browser, device-pair?). Operators who want others enable them via plugins.allow or plugins.entries.<id>.enabled = true. This is a breaking change but probably the right one.
  • (b) Add plugins.disableBundled: true operator-level flag. Under that flag, only plugins explicitly in plugins.allow or plugins.entries.<id>.enabled = true are loaded. Bundled-but-not-listed = skipped.
  • (c) Add a build-time OPENCLAW_PRESET=lean switch that ships an image with only the most-commonly-needed plugins.

(b) is the most surgical and least disruptive. It also solves Finding 1 indirectly because the implicit imports never happen if the modules aren't loaded.

Finding 3 — Plugin-discovery cache TTL of 1 s defeats its own purpose for cold starts

// dist/cache-controls-DMmVdSiO.js
const DEFAULT_PLUGIN_DISCOVERY_CACHE_MS = 1e3;
const DEFAULT_PLUGIN_MANIFEST_CACHE_MS = 1e3;

The 1 second TTL means the cache is purely an in-process micro-optimization for repeated lookups within the same boot — it can't help across cold starts because each cold start is a brand new process. There's no on-disk persisted snapshot that could survive container restart.

Suggested fix

Add an on-disk plugin-discovery cache keyed by (packageRoot, packageVersion), persisted to OPENCLAW_STATE_DIR. Boot reads it first, validates the package version still matches, then short-circuits the manifest scan + validation for all 87 extensions. Cache invalidates automatically when the OpenClaw image is rebuilt (since packageRoot path or packageVersion will change).

For a typical Cloud Run container that boots, serves, scales to zero, then boots again with the same image, this would skip ~87 manifest reads + JSON Schema validations on every subsequent cold start.

Numbers

Local timings, Node v25.5 on a workstation SSD, [email protected]:

ConfigDisk cacheTCP-ready time
Marvin's prod-ish config (no deny list)cold10.8 s
Marvin's prod-ish config (no deny list)warm4.3 s
Marvin's prod-ish config + 36-entry deny listwarm4.3 s

The warm-cache case shows that the deny list does not measurably improve cold start time today, confirming the static-import problem of Finding 1.

The cold-cache case shows the I/O cost of touching all 87 extension dirs and their node_modules/ alone is ~6.5 s on fast local hardware. In production on Cloud Run with GCS FUSE this gap is much larger.

What I'd love your guidance on

I'd like to send small PRs for #1 (lazy-load bedrock + maybe other heavy SDKs) and #2(b) (plugins.disableBundled). #3 is a bigger change and I'd want a sanity check before going that route. Happy to write any/all of them — just want to confirm direction before committing time. The team's preferred shape for #2 is the most ambiguous.

Repro recipe for the bedrock retry loop:

mkdir -p /tmp/oc-state
cat > /tmp/oc-state/openclaw.json << 'JSON'
{ "gateway": { "mode": "local", "auth": { "mode": "none" }, "controlUi": { "enabled": false } },
  "models": { "providers": {
    "anthropic": { "baseUrl": "https://api.anthropic.com", "apiKey": "fake", "api": "anthropic-messages", "models": [{ "id": "claude-sonnet-4-6", "name": "Claude Sonnet 4.6" }] }
  } },
  "channels": { "slack": { "enabled": true, "webhookPath": "/slack/events", "botToken": "xoxb-fake", "signingSecret": "fake" } },
  "discovery": { "mdns": { "mode": "off" } },
  "agents": { "defaults": { "model": { "primary": "anthropic/claude-sonnet-4-6" } } } }
JSON
npm install -g [email protected]
OPENCLAW_STATE_DIR=/tmp/oc-state OPENCLAW_NO_RESPAWN=1 openclaw gateway --port 18091 --bind loopback --verbose 2>&1 | grep -E "READY|amazon-bedrock|listening"

You'll see the gateway listen at ~1.2s and amazon-bedrock continue retrying for several seconds afterward.

extent analysis

TL;DR

To improve OpenClaw gateway's cold-start time, apply lazy loading for heavy SDKs like Bedrock and consider implementing a plugins.disableBundled flag to skip unused plugins.

Guidance

  • Identify and lazy-load heavy SDKs like @aws-sdk/client-bedrock to prevent unnecessary imports and module resolution failures.
  • Implement a plugins.disableBundled flag to allow operators to opt-out of loading unused plugins, reducing cold-start time.
  • Consider increasing the plugin-discovery cache TTL or implementing an on-disk cache to persist across cold starts.
  • Review the repro recipe provided to understand the bedrock retry loop issue and its implications on cold-start time.

Example

// Lazy-load @aws-sdk/client-bedrock example
const loadBedrock = async () => {
  if (cfg.plugins?.deny?.includes("amazon-bedrock")) return;
  const { BedrockClient, ListFoundationModelsCommand } = await import("@aws-sdk/client-bedrock");
  // Use BedrockClient and ListFoundationModelsCommand
};

Notes

The provided issue lacks information on the exact Node.js and OpenClaw versions used in production, which might affect the applicability of the suggested fixes. Additionally, the plugins.disableBundled flag implementation details are not specified, requiring further discussion.

Recommendation

Apply the plugins.disableBundled flag workaround to reduce cold-start time by skipping unused plugins, as it is a less invasive change compared to lazy-loading SDKs or modifying the plugin-discovery cache.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway cold-start: bundled-plugin loading dominates startup; plugins.deny not honored at import-time [1 pull requests, 1 comments, 2 participants]