openclaw - 💡(How to fix) Fix Provider auth prewarm can starve gateway event loop and cause sessions.list timeouts after restart

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

After a gateway restart on OpenClaw 2026.5.22 (a374c3a), local gateway RPCs can time out even though the service is active and eventually becomes healthy. The visible symptom was sessions.list / sessions-list tooling timing out against ws://127.0.0.1:18789 during a heartbeat/check path.

This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail.

Root Cause

After a gateway restart on OpenClaw 2026.5.22 (a374c3a), local gateway RPCs can time out even though the service is active and eventually becomes healthy. The visible symptom was sessions.list / sessions-list tooling timing out against ws://127.0.0.1:18789 during a heartbeat/check path.

This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail.

Fix Action

Fix / Workaround

Local Mitigation Used

A local hotfix mitigated the incident by:

After the mitigation, validation showed stable local calls:

Code Example

[fetch-timeout] fetch timeout after 10000ms (elapsed 43203ms) timer delayed 33203ms, likely event-loop starvation operation=fetchWithTimeout url=https://api.telegram.org/.../getMe
[ws] closed before connect ... code=1006
[gateway] provider auth state pre-warmed in 72203ms eventLoopMax=42983.2ms

---

catalogCount: 971
providerCount: 45
agentCount: 5

---

health:       ~2.9s CLI wall time
sessions.list ~2.8s CLI wall time, server-side ~150ms
cron.list     ~2.4s CLI wall time, server-side ~60-90ms
RAW_BUFFERClick to expand / collapse

Summary

After a gateway restart on OpenClaw 2026.5.22 (a374c3a), local gateway RPCs can time out even though the service is active and eventually becomes healthy. The visible symptom was sessions.list / sessions-list tooling timing out against ws://127.0.0.1:18789 during a heartbeat/check path.

This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail.

Evidence

Sanitized journal evidence from a local loopback deployment:

[fetch-timeout] fetch timeout after 10000ms (elapsed 43203ms) timer delayed 33203ms, likely event-loop starvation operation=fetchWithTimeout url=https://api.telegram.org/.../getMe
[ws] closed before connect ... code=1006
[gateway] provider auth state pre-warmed in 72203ms eventLoopMax=42983.2ms

Around the same window, sessions.list calls from clients timed out at the default 10s budget. After the provider-auth prewarm completed, sessions.list itself was fast again, with server-side log lines in the low hundreds of milliseconds.

A local isolated check also showed that loading the model catalog for provider auth prewarm can be very expensive on a configured host:

catalogCount: 971
providerCount: 45
agentCount: 5

The configured runtime only needed a much smaller subset of providers, but startup prewarm still considered the full catalog.

Expected

Once the gateway reports ready, basic local RPCs such as health, cron.list, and sessions.list should remain responsive within the default 10s client timeout.

Provider auth prewarm should be best-effort and must not starve the event loop. It should be bounded, idle-scheduled, chunked/yielding, cancellable, and scoped to providers actually referenced by config unless a full scan is explicitly requested.

Actual

Provider auth prewarm ran immediately after startup and caused event-loop stalls up to ~43s. During that period, local WebSocket clients could fail before connect or report gateway timeout even though the gateway later responded.

Local Mitigation Used

A local hotfix mitigated the incident by:

  • skipping provider auth prewarm at startup via an environment flag;
  • increasing the sessions-list tool gateway timeout to 30s;
  • changing provider-auth warmup to infer configured providers first and only fall back to the full model catalog if no providers can be inferred.

After the mitigation, validation showed stable local calls:

health:       ~2.9s CLI wall time
sessions.list ~2.8s CLI wall time, server-side ~150ms
cron.list     ~2.4s CLI wall time, server-side ~60-90ms

No new event-loop starvation or fetch-timeout warnings appeared after readiness in the validation window.

Suggested Fix Direction

  1. Add an official config/env toggle for startup provider-auth prewarm.
  2. Scope provider-auth prewarm to configured providers instead of the full model catalog by default.
  3. If full discovery is needed, run it after idle delay and chunk work with event-loop yields.
  4. Avoid caching negative auth results if external discovery was skipped or truncated.
  5. Consider a longer timeout for the agent sessions_list tool or make it resilient to startup warmup delays.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Provider auth prewarm can starve gateway event loop and cause sessions.list timeouts after restart