openclaw - ✅(Solved) Fix Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery [2 pull requests, 5 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74135Fetched 2026-04-30 06:28:07
View on GitHub
Comments
5
Participants
3
Timeline
12
Reactions
0
Author
Timeline (top)
cross-referenced ×6commented ×5closed ×1

The OpenClaw Gateway intermittently becomes unresponsive for tens of seconds. During these windows, WebSocket preauth handshakes time out, but the timeout callbacks fire much later than the configured 10s timeout. Observed values range from ~18s to ~72s.

This strongly suggests the Gateway Node.js event loop is being blocked or starved by synchronous / CPU-heavy work inside the gateway process, rather than a pure networking or client-side timeout issue.

The strongest current suspect is expensive model catalog regeneration / provider discovery after model/config hot reloads, especially around:

  • ensureOpenClawModelsJson()
  • planOpenClawModelsJson()
  • resolveProvidersForModelsJsonWithDeps()
  • resolveImplicitProviders()
  • resolveRuntimePluginDiscoveryProviders()
  • dynamic import of plugins/provider-discovery.runtime.js
  • resolvePluginDiscoveryProvidersRuntime() / resolvePluginProviders()

This is an initial issue with current findings. I can add deeper traces if needed.

Root Cause

Because the timer callback itself fires late, this looks like event-loop starvation/blocking.

Fix Action

Fix / Workaround

Proposed fixes / mitigations

Short-term mitigations

PR fix notes

PR #74275: fix(gateway): keep model catalog refresh off request path

Description (problem / solution / changelog)

Summary

Describe the problem and fix in 2–5 bullets:

If this PR fixes a plugin beta-release blocker, title it fix(<plugin-id>): beta blocker - <summary> and link the matching Beta blocker: <plugin-name> - <summary> issue labeled beta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.

  • Problem: after model-related hot reload clears the core model catalog cache, the next Gateway request that needs model catalog data can synchronously re-enter model catalog/provider discovery.
  • Why it matters: latency-sensitive Gateway paths such as models.list can stall after reload, matching the issue #74135 responsiveness risk.
  • What changed: added Gateway-level stale-while-revalidate caching for the last successful model catalog, wired model hot reload to mark that Gateway cache stale, and added race/retry coverage.
  • What did NOT change (scope boundary): no provider-discovery rewrite, no worker-thread migration, no timeout/default changes, and no plugin ownership changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #74135
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: Gateway model catalog calls delegated directly to the core catalog loader, so after hot reload invalidated the core cache, the next request-path caller could pay the cold catalog rebuild synchronously.
  • Missing detection / guardrail: there was no Gateway wrapper test for serving the previous successful catalog while a stale refresh is in flight.
  • Contributing context (if known): model-related hot reload intentionally reset the core catalog cache, but Gateway request paths did not have a stale-while-revalidate boundary.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
    • Target test or file: src/gateway/server-model-catalog.test.ts, src/gateway/ server.reload.test.ts, src/agents/model-catalog.test.ts
    • Scenario the test should lock in: after one successful Gateway catalog load, stale reload returns the previous catalog immediately, refreshes in the background, retries empty/error results, and does not let obsolete loads overwrite newer results.
    • Why this is the smallest reliable guardrail: the bug is at the Gateway catalog wrapper and hot-reload invalidation boundary; unit/seam tests cover that boundary without broad provider discovery or full Gateway E2E cost.
    • Existing test that already covers this (if any): none for Gateway stale-while-revalidate behavior.
    • If no new test is added, why not: N/A

User-visible / Behavior Changes

Gateway model catalog consumers remain responsive after model-related hot reload by serving the last successful catalog while refreshing stale data in the background.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

  Before:
  [model hot reload] -> [core catalog cache reset] -> [next Gateway models request blocks on cold catalog rebuild]

  After:
  [model hot reload] -> [Gateway catalog marked stale] -> [next Gateway models request returns last successful catalog] -> [background refresh updates cache]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node/pnpm workspace
  • Model/provider: model catalog/provider discovery path
  • Integration/channel (if any): Gateway
  • Relevant config (redacted): model-related Gateway hot reload

Steps

  1. Start with a Gateway process that has successfully loaded a model catalog.
  2. Trigger model-related hot reload, such as a models.providers.* config change.
  3. Call a Gateway model catalog consumer such as models.list.

Expected

  • Gateway returns the last successful catalog immediately and refreshes stale catalog data in the background.

Actual

  • Before this fix, the next request could synchronously rebuild the catalog on the Gateway request path.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: focused Gateway catalog stale refresh, retryable empty/error catalog results, obsolete-load race handling, hot-reload stale marking, Plugin SDK API baseline check, build, and changed gate.
  • Edge cases checked: first load still blocks when no catalog exists; stale refresh failure keeps previous catalog; shared core in-flight retryable result reports correctly; obsolete first load does not overwrite newer catalog.
  • What you did not verify: live production Gateway under real provider latency.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: Gateway may briefly serve a stale model catalog immediately after hot reload.
    • Mitigation: stale serving only starts after a successful prior catalog load; background refresh updates the cache, and retryable empty/error results keep the cache stale for later retry.

Built with Codex

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/.generated/plugin-sdk-api-baseline.sha256 (modified, +2/-2)
  • src/agents/model-catalog.test.ts (modified, +48/-1)
  • src/agents/model-catalog.ts (modified, +29/-10)
  • src/gateway/server-model-catalog.test.ts (added, +332/-0)
  • src/gateway/server-model-catalog.ts (modified, +106/-2)
  • src/gateway/server-reload-handlers.ts (modified, +2/-0)
  • src/gateway/server.reload.test.ts (modified, +14/-0)

PR #74762: fix: gateway model catalog cache regression

Description (problem / solution / changelog)

Summary

Found one regression in the new gateway model catalog cache: it treats an empty catalog as a successful cached catalog, which breaks the underlying retry-on-empty contract.

What ClawSweeper Is Fixing

  • Medium: Gateway caches transient empty model catalogs until reload/restart (regression)
    • File: src/gateway/server-model-catalog.ts:49
    • Evidence: startGatewayModelCatalogRefresh() assigns lastSuccessfulCatalog = catalog for every resolved array, including []. Later, loadGatewayModelCatalog() returns lastSuccessfulCatalog whenever it is truthy, and empty arrays are truthy in JS. The underlying loader explicitly avoids caching empty results at src/agents/model-catalog.ts:215 because an empty catalog can come from transient dependency/filesystem/provider issues and should be retried.
    • Impact: if the first gateway catalog load returns [], models.list, TUI model surfaces, session/model metadata helpers, and related gateway callers keep seeing no models until a model config reload or process restart. This is worse than the prior behavior, where the next request retried immediately.
    • Suggested fix: preserve the underlying no-cache-on-empty behavior in the gateway wrapper. Do not mark an empty result as fresh; keep the cache stale or clear it so the next call retries. Add a regression test where the injected loader returns [] once and a non-empty catalog on the second call.
    • Confidence: high

Expected Repair Surface

  • src/gateway/server-model-catalog.ts
  • src/gateway/server-model-catalog.test.ts
  • src/gateway/server-reload-handlers.ts

Source And Review Context

Expected validation

  • pnpm check:changed

ClawSweeper already ran:

  • pnpm docs:list
  • pnpm install after the first targeted test failed because node_modules was missing
  • pnpm test src/gateway/server-model-catalog.test.ts -- --reporter=verbose passed
  • Injected smoke with first loader call returning [] and second returning a model produced {"first":[],"second":[],"calls":1}, confirming the retry is suppressed
  • git diff --check 57a3d7f6e897f25073e313d5c24b6fb6f60575ae..6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3

Known review limits:

  • Full suite and live gateway smoke were not run; review used focused gateway tests and an injected runtime proof.

ClawSweeper Guardrails

  • Re-check the finding against latest main before changing code.
  • Keep the patch to the narrowest behavior change and matching regression coverage.
  • Do not merge automatically; this PR stays for maintainer review.

ClawSweeper 🐠 replacement reef notes:

  • Cluster: clawsweeper-commit-openclaw-openclaw-6421e1f36a3c
  • Source PRs: none
  • Credit: Detected by ClawSweeper commit review for 6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3.; Original commit author: Peter Steinberger.
  • Validation: pnpm check:changed

fish notes: model gpt-5.5, reasoning medium; reviewed against da5e171ffab1.

Changed files

  • src/gateway/server-model-catalog.test.ts (modified, +18/-0)
  • src/gateway/server-model-catalog.ts (modified, +1/-1)

Code Example

resolvePluginDiscoveryProvidersRuntime({ onlyPluginIds: ['deepseek'] })

---

resolveProvidersForModelsJsonWithDeps()
  -> resolveImplicitProviders()
  -> resolveRuntimePluginDiscoveryProviders()
  -> dynamic import of plugins/provider-discovery.runtime.js
  -> resolvePluginDiscoveryProvidersRuntime()
  -> resolvePluginProviders()
RAW_BUFFERClick to expand / collapse

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery

Summary

The OpenClaw Gateway intermittently becomes unresponsive for tens of seconds. During these windows, WebSocket preauth handshakes time out, but the timeout callbacks fire much later than the configured 10s timeout. Observed values range from ~18s to ~72s.

This strongly suggests the Gateway Node.js event loop is being blocked or starved by synchronous / CPU-heavy work inside the gateway process, rather than a pure networking or client-side timeout issue.

The strongest current suspect is expensive model catalog regeneration / provider discovery after model/config hot reloads, especially around:

  • ensureOpenClawModelsJson()
  • planOpenClawModelsJson()
  • resolveProvidersForModelsJsonWithDeps()
  • resolveImplicitProviders()
  • resolveRuntimePluginDiscoveryProviders()
  • dynamic import of plugins/provider-discovery.runtime.js
  • resolvePluginDiscoveryProvidersRuntime() / resolvePluginProviders()

This is an initial issue with current findings. I can add deeper traces if needed.

Environment

  • OS: Ubuntu 22.04.5 LTS, arm64
  • Node.js: v24.14.1
  • Gateway bound on LAN, local port 18789
  • Gateway process remains alive during incidents
  • No obvious FD leak observed during checks
  • Host not thermally throttled during later checks:
    • 4 cores
    • temperature around ~41°C
    • vcgencmd get_throttled = 0x0

Symptoms

Observed in Gateway logs:

  • WebSocket connections close before full connect/preauth.
  • Important class: cause=handshake-timeout.
  • Configured preauth handshake timeout is 10s.
  • Actual logged handshakeMs / durationMs values were much larger:

Examples:

  • handshakeMs=47133
  • handshakeMs=54995
  • handshakeMs=49755
  • handshakeMs=71389
  • handshakeMs=18201
  • handshakeMs=34302

Because the timer callback itself fires late, this looks like event-loop starvation/blocking.

Additional symptoms seen during affected windows:

  • Some Gateway RPCs occasionally slow:
    • config.get ~2.1–2.4s
    • channels.status ~2.0s
    • config.schema.lookup ~3.2s
    • agent.wait reached expected long wait timeout
  • Telegram polling stalls were also seen near some incident windows.
  • openclaw status --json --timeout 1000 had ~14s wall time despite reported gateway connect latency around ~160ms.
  • openclaw logs --limit 5 --timeout 5000 had ~12s wall time, while the underlying logs.tail RPC later measured fast after warmup.

Findings

1. Not likely to be a simple network / socket issue

The Gateway was alive and listening on 0.0.0.0:18789.

HTTP/dashboard probes to 127.0.0.1:18789 were healthy when idle:

  • first probe around ~285ms
  • later probes around ~30ms

No obvious FD leak was observed:

  • FD count around mid/high 30s
  • threads around 11

2. Preauth timer firing late is the key evidence

Code inspection found:

  • DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS = 10000
  • server uses getPreauthHandshakeTimeoutMsFromEnv() for the preauth timer

But logs showed timeout callbacks firing after 18–72s. That strongly indicates the event loop was unable to run the timer callback on time.

3. Session/transcript sync reads are a smell, but probably not the sole root cause

Several gateway-adjacent paths use synchronous file reads/parsing:

  • session store reads via readSessionStoreReadOnly
  • transcript reads via readSessionMessages()
  • some status/history/session preview paths

This is concerning, especially for large transcripts/checkpoints. However, microbenchmarks on the largest local artifacts showed individual parse costs mostly in the ~100–500ms range, not enough alone to explain 70s stalls.

So this remains a contributor/smell, but not the strongest single root cause.

4. Late handshake windows correlate with state churn

The worst late-handshake windows clustered around:

  • config/model/default alias hot reloads
  • provider/model catalog related changes
  • pairing/device-state churn
  • stuck session / embedded-run lifecycle events

This points toward bursts of gateway-side state recomputation or plugin/runtime work.

5. Stronger suspect: model catalog / provider discovery cold path

Local probes outside the gateway showed ensureOpenClawModelsJson() can be extremely expensive:

  • importing model-catalog-*.js in a fresh Node process took ~6.5s

  • loadModelCatalog() / ensureOpenClawModelsJson() without an outer provider-discovery timeout did not complete within test timeouts in some probes

  • with:

    • OPENCLAW_LIVE_GATEWAY=1
    • OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000
    • OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex

    it completed in ~20.7s and logged that Codex provider catalog discovery timed out after 3000ms.

Filtered single-provider probes still showed high wall times:

  • anthropic-vertex: ~13.9s
  • deepseek: ~38.6s
  • codex with 3s provider timeout: ~20.7s

This suggests the problem is not only one slow live provider catalog call. There appears to be significant plugin/runtime/provider setup overhead.

6. Provider filtering appears insufficient for the full path

Direct call:

resolvePluginDiscoveryProvidersRuntime({ onlyPluginIds: ['deepseek'] })

behaved mostly as expected:

  • entries-only returned only deepseek in ~81ms
  • normal only-plugin run returned only deepseek in ~7.6s

But when running through ensureOpenClawModelsJson() with provider filtering, profiling still showed broad provider plugin loading, including:

  • anthropic
  • byteplus
  • deepseek
  • moonshot
  • openai
  • tencent
  • volcengine
  • xai

Even a minimized in-memory cfg.models.providers = { deepseek: ... } still showed broad provider plugin load/profile activity.

A later source-path refinement makes provider-specific policy helpers less likely to be the cold-load trigger: normalizeProviderConfigPolicy(), applyProviderNativeStreamingUsagePolicy(), and resolveProviderConfigApiKeyPolicy() call their plugin helpers with allowRuntimePluginLoad: false.

So the stronger suspect is the runtime discovery stage before/around normalization:

resolveProvidersForModelsJsonWithDeps()
  -> resolveImplicitProviders()
  -> resolveRuntimePluginDiscoveryProviders()
  -> dynamic import of plugins/provider-discovery.runtime.js
  -> resolvePluginDiscoveryProvidersRuntime()
  -> resolvePluginProviders()

In particular, resolvePluginDiscoveryProvidersRuntime() has a selective full-plugin fallback when onlyPluginIds is undefined. If the provider filter is not passed through correctly from config/hot-reload context, or if source snapshot projection reintroduces providers, this can cause broad provider loading.

7. Warm cache is much faster, cold path is the problem

Within one process:

  • first full provider discovery call: ~17.5s
  • second call: ~1.3s
  • third call: ~0.7s

So caching helps, but config/model hot reloads or cache invalidation can re-enter the expensive cold path.

Current diagnosis

The Gateway can become temporarily unresponsive because model catalog regeneration / provider discovery performs expensive cold plugin/runtime work in or adjacent to latency-sensitive Gateway paths.

This can delay timers, WebSocket handshakes, CLI/RPC responses, Telegram polling, and possibly status/log commands.

The issue is likely not one single slow transcript read, but a combination of:

  1. expensive cold provider/plugin runtime loading,
  2. insufficient provider filtering in the full model-catalog regeneration path,
  3. potentially unbounded or poorly bounded live provider discovery,
  4. cache invalidation after model/config hot reloads,
  5. work running on the Gateway event loop instead of being deferred/offloaded.

Proposed fixes / mitigations

Short-term mitigations

  1. Add a bounded default timeout for live provider discovery in Gateway contexts.

    • Example: default OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS to a small value when running inside gateway.
    • Avoid unbounded provider catalog resolution during hot reload.
  2. Avoid running full ensureOpenClawModelsJson() synchronously after config/model hot reload.

    • Defer regeneration.
    • Serve stale cached catalog while refresh happens in background.
    • Emit a warning if refresh fails or times out.
  3. Ensure provider discovery filters are passed through the full path.

    • In particular, check what onlyPluginIds is passed into resolveRuntimePluginDiscoveryProviders() during ensureOpenClawModelsJson() after config/model hot reloads.
    • Avoid broad resolvePluginProviders() when a narrowed provider set is known.
  4. Add method-level timing logs around:

    • applyHotReload()
    • resetModelCatalogCache()
    • ensureOpenClawModelsJson()
    • planOpenClawModelsJson()
    • resolveProvidersForModelsJsonWithDeps()
    • resolveImplicitProviders()
    • resolveRuntimePluginDiscoveryProviders()
    • resolvePluginDiscoveryProvidersRuntime()
    • resolvePluginProviders()
    • readSessionMessages()
  5. Add runtime event-loop-delay instrumentation in the Gateway.

    • For example, log when event-loop delay exceeds 1s / 5s / 10s, including the current high-level operation if known.

Longer-term fixes

  1. Move expensive model catalog/provider discovery work out of the Gateway request/event-loop path.

    • Use a worker thread or child process.
    • Keep Gateway responsive while catalog refresh happens.
  2. Make provider discovery fully cache-aware and cancellation-aware.

    • Timeouts should abort work cleanly, not merely stop waiting while leaving expensive tasks/resources alive.
  3. Separate static provider config normalization from live provider discovery.

    • Normalization should not require broad plugin loading if static config is already known.
  4. Improve CLI command decomposition.

    • Commands like openclaw status / openclaw logs should avoid unnecessary status/bootstrap/provider/model work when the requested RPC itself is cheap.

Reproduction / diagnostic hints

Useful probes:

  • Compare logs.tail RPC time vs full openclaw logs command wall time.
  • Trigger model/default alias config hot reload, then watch for late WS handshake timers.
  • Run ensureOpenClawModelsJson() in a fresh Node process with plugin profiling enabled.
  • Test with and without provider discovery timeout:
    • OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000
    • OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex
    • OPENCLAW_PLUGIN_LOAD_PROFILE=1

Key expected signal:

  • A 10s WebSocket preauth timer firing after much more than 10s is the clearest symptom of event-loop delay.

extent analysis

TL;DR

Implement a bounded default timeout for live provider discovery and defer expensive model catalog regeneration to prevent event-loop blocking.

Guidance

  1. Add a default timeout for live provider discovery: Set a small value for OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS when running inside the gateway to avoid unbounded provider catalog resolution.
  2. Defer model catalog regeneration: Serve a stale cached catalog while regenerating the model catalog in the background to prevent synchronous blocking of the event loop.
  3. Pass provider filters through the full path: Ensure onlyPluginIds is correctly passed into resolveRuntimePluginDiscoveryProviders() during ensureOpenClawModelsJson() to avoid broad provider loading.
  4. Add timing logs and event-loop delay instrumentation: Log method-level timings and event-loop delays to better understand the performance bottlenecks.
  5. Consider moving expensive work out of the event loop: Use a worker thread or child process to perform expensive model catalog and provider discovery work.

Example

// Example of adding a default timeout for live provider discovery
const OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS = 3000;

Notes

The provided issue description suggests that the problem is complex and multifaceted. The proposed fixes and mitigations are intended to address the symptoms and underlying causes, but may require further refinement and testing.

Recommendation

Apply the proposed short-term mitigations, such as adding a bounded default timeout for live provider discovery and deferring expensive model catalog regeneration, to alleviate the event-loop blocking issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery [2 pull requests, 5 comments, 3 participants]