openclaw - 💡(How to fix) Fix Provider auth prewarm can starve gateway event loop and cause sessions.list timeouts after restart

StepCodex · 2026-05-26T06:09:49Z

[openclaw] After a gateway restart on OpenClaw 2026.5.22 a374c3a , local gateway RPCs can time out even though the service is active and eventually becomes hea… After a gateway restart on OpenClaw `2026.5.22 (a374c3a)`, local gateway RPCs can time out even though the service is active and eventually becomes healthy. The visible symptom was `sessions.list` / sessions-list tooling timing out against `ws://127.0.0.1:18789` during a heartbeat/check path. This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail. ## Fix / Workaround ## Local Mitigation Used A local hotfix mitigated the incident by: After the mitigation, validation showed stable local calls: ## Summary After a gateway restart on OpenClaw `2026.5.22 (a374c3a)`, local gateway RPCs can time out even though the service is active and eventually becomes healthy. The visible symptom was `sessions.list` / sessions-list tooling timing out against `ws://127.0.0.1:18789` during a heartbeat/check path. This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail. ## Evidence Sanitized journal evidence from a local loopback deployment: ```text [fetch-timeout] fetch timeout after 10000ms (elapsed 43203ms) timer delayed 33203ms, likely event-loop starvation operation=fetchWithTimeout url=https://api.telegram.org/.../getMe [ws] closed before connect ... code=1006 [gateway] provider auth state pre-warmed in 72203ms eventLoopMax=42983.2ms ``` Around the same window, `sessions.list` calls from clients timed out at the default 10s budget. After the provider-auth prewarm completed, `sessions.list` itself was fast again, with server-side log lines in the low hundreds of milliseconds. A local isolated check also showed that loading the model catalog for provider auth prewarm can be very expensive on a configured host: ```text catalogCount: 971 providerCount: 45 agentCount: 5 ``` The configured runtime only needed a much smaller subset of providers, but startup prewarm still considered the full catalog. ## Expected Once the gateway reports ready, basic local RPCs such as `health`, `cron.list`, and `sessions.list` should remain responsive within the default 10s client timeout. Provider auth prewarm should be best-effort and must not starve the event loop. It should be bounded, idle-scheduled, chunked/yielding, cancellable, and scoped to providers actually referenced by config unless a full scan is explicitly requested. ## Actual Provider auth prewarm ran immediately after startup and caused event-loop stalls up to ~43s. During that period, local WebSocket clients could fail before connect or report gateway timeout even though the gateway later responded. ## Local Mitigation Used A local hotfix mitigated the incident by: - skipping provider auth prewarm at startup via an environment flag; - increasing the sessions-list tool gateway timeout to 30s; - changing provider-auth warmup to infer configured providers first and only fall back to the full model catalog if no providers can be inferred. After the mitigation, validation showed stable local calls: ```text health: ~2.9s CLI wall time sessions.list ~2.8s CLI wall time, server-side ~150ms cron.list ~2.4s CLI wall time, server-side ~60-90ms ``` No new event-loop starvation or fetch-timeout warnings appeared after readiness in the validation window. ## Suggested Fix Direction 1. Add an official config/env toggle for startup provider-auth prewarm. 2. Scope provider-auth prewarm to configured providers instead of the full model catalog by default. 3. If full discovery is needed, run it after idle delay and chunk work with event-loop yields. 4. Avoid caching negative auth results if external discovery was skipped or truncated. 5. Consider a longer timeout for the agent `sessions_list` tool or make it resilient to startup warmup delays.

openclaw2026-05-26 06:09:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

After a gateway restart on OpenClaw 2026.5.22 (a374c3a), local gateway RPCs can time out even though the service is active and eventually becomes healthy. The visible symptom was sessions.list / sessions-list tooling timing out against ws://127.0.0.1:18789 during a heartbeat/check path.

This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail.

Root Cause

This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail.

Fix Action

Fix / Workaround

Local Mitigation Used

A local hotfix mitigated the incident by:

After the mitigation, validation showed stable local calls:

Code Example

[fetch-timeout] fetch timeout after 10000ms (elapsed 43203ms) timer delayed 33203ms, likely event-loop starvation operation=fetchWithTimeout url=https://api.telegram.org/.../getMe
[ws] closed before connect ... code=1006
[gateway] provider auth state pre-warmed in 72203ms eventLoopMax=42983.2ms

---

catalogCount: 971
providerCount: 45
agentCount: 5

---

health:       ~2.9s CLI wall time
sessions.list ~2.8s CLI wall time, server-side ~150ms
cron.list     ~2.4s CLI wall time, server-side ~60-90ms

RAW_BUFFERClick to expand / collapse

Summary

This did not appear to be a dead gateway. The gateway process was alive, but startup/provider warmup starved the Node event loop long enough for normal 10s gateway clients to fail.

Evidence

Sanitized journal evidence from a local loopback deployment:

[fetch-timeout] fetch timeout after 10000ms (elapsed 43203ms) timer delayed 33203ms, likely event-loop starvation operation=fetchWithTimeout url=https://api.telegram.org/.../getMe
[ws] closed before connect ... code=1006
[gateway] provider auth state pre-warmed in 72203ms eventLoopMax=42983.2ms

Around the same window, sessions.list calls from clients timed out at the default 10s budget. After the provider-auth prewarm completed, sessions.list itself was fast again, with server-side log lines in the low hundreds of milliseconds.

A local isolated check also showed that loading the model catalog for provider auth prewarm can be very expensive on a configured host:

catalogCount: 971
providerCount: 45
agentCount: 5

The configured runtime only needed a much smaller subset of providers, but startup prewarm still considered the full catalog.

Expected

Once the gateway reports ready, basic local RPCs such as health, cron.list, and sessions.list should remain responsive within the default 10s client timeout.

Provider auth prewarm should be best-effort and must not starve the event loop. It should be bounded, idle-scheduled, chunked/yielding, cancellable, and scoped to providers actually referenced by config unless a full scan is explicitly requested.

Actual

Provider auth prewarm ran immediately after startup and caused event-loop stalls up to ~43s. During that period, local WebSocket clients could fail before connect or report gateway timeout even though the gateway later responded.

Local Mitigation Used

A local hotfix mitigated the incident by:

skipping provider auth prewarm at startup via an environment flag;
increasing the sessions-list tool gateway timeout to 30s;
changing provider-auth warmup to infer configured providers first and only fall back to the full model catalog if no providers can be inferred.

After the mitigation, validation showed stable local calls:

health:       ~2.9s CLI wall time
sessions.list ~2.8s CLI wall time, server-side ~150ms
cron.list     ~2.4s CLI wall time, server-side ~60-90ms

No new event-loop starvation or fetch-timeout warnings appeared after readiness in the validation window.

Suggested Fix Direction

Add an official config/env toggle for startup provider-auth prewarm.
Scope provider-auth prewarm to configured providers instead of the full model catalog by default.
If full discovery is needed, run it after idle delay and chunk work with event-loop yields.
Avoid caching negative auth results if external discovery was skipped or truncated.
Consider a longer timeout for the agent sessions_list tool or make it resilient to startup warmup delays.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Provider auth prewarm can starve gateway event loop and cause sessions.list timeouts after restart

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Local Mitigation Used

Code Example

Summary

Evidence

Expected

Actual

Local Mitigation Used

Suggested Fix Direction

Still need to ship something?

TRENDING