openclaw - 💡(How to fix) Fix [Bug]: 2026.5.22 gateway pre-warm (warmCurrentProviderAuthState) blocks event loop ~60s on startup, breaks channel handshakes

openclaw2026-05-24 09:29:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

After upgrading from 2026.5.19 to 2026.5.22, gateway startup blocks the Node event loop for ~60 seconds inside warmCurrentProviderAuthState, causing channel handshakes (Discord READY, Feishu bot info, Telegram deleteWebhook) to time out, and leaving inbound messages stalled for ~1 minute on every restart.

Error Message

09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1) 09:17:21.496 WARN fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me 09:18:29.214 WARN liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]

Root Cause

The blocked work is provider-auth pre-warm, not the model call itself, so the model/provider path is largely incidental. The config has multiple providers enabled across the agent catalog (github-copilot, openai, anthropic, openrouter, etc.), which appears to amplify the cost (see Root cause below).

Fix Action

Fix / Workaround

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

Proposed fix shape (not a patch)

Last known good version: 2026.5.19 (we upgraded directly 5.19 → 5.22, skipping 5.20). First known bad version: 2026.5.22. No workaround attempted yet beyond systemctl restart; planning to roll back to 5.19.

Code Example

2026-05-24T08:46:04+00:00 [gateway] provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms
2026-05-24T09:17:43+00:00 [gateway] provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms

---

2026-05-24T08:40:17+00:00 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=14445.2 eventLoopUtilization=1 cpuCoreRatio=1.041 active=2 waiting=0 queued=0 recentPhases=...,sidecars.session-locks:50912ms work=[active=agent:main:telegram:direct:...(processing/model_call,q=1,age=14s last=model_call:started)|agent:main:discord:direct:...(processing/embedded_run,q=1,age=6s last=embedded_run:started)]

---

09:16:44.618 INFO  discord client initialized; awaiting gateway readiness
09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)
09:17:21.496 WARN  fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
09:17:43.612 INFO  provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
09:18:29.214 WARN  liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]
09:18:49.853 (trajectory) session.started   <- first inbound DM finally picked up
09:18:53.231 (trajectory) model.completed   <- ~3.4s model call

---

for (const agentId of listAgentIds(cfg)) {
    ensureAuthProfileStore(agentDir, {
      externalCli: externalCliDiscoveryForProviders({ cfg, providers: providerList })
    });
    for (const provider of providers) {
      await hasAuthForModelProvider({ provider, cfg, workspaceDir, agentId, store, runtimeAuthLookup });
    }
  }

RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

Summary

Steps to reproduce

On a 2vCPU Linux host (Azure B2als_v2), install [email protected] and start the gateway as a systemd user service.
Confirm at least one configured agent with multiple model providers (in our case: github-copilot, openai, anthropic, openrouter, plus the default catalog providers).
Restart the gateway (systemctl --user restart openclaw-gateway) and watch /tmp/openclaw/openclaw-YYYY-MM-DD.log and journalctl _PID=<pid>.
From the moment gateway ready is logged, send a Discord DM (or any inbound message) to the bot within the first ~90 seconds.

Expected behavior

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

Actual behavior

Two consecutive restarts on 2026.5.22 (PIDs 712063 and 721897, ~30 minutes apart, same config) both reproduced:

provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms (first restart)
provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms (second restart)
Liveness warnings during the same window: event_loop_delay,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=21776.8 eventLoopUtilization=1 cpuCoreRatio=1.041
Channel-side fallout, e.g.: [fetch-timeout] fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me, [discord] gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1), [feishu] bot info probe timed out after 30000ms; continuing startup, [telegram] deleteWebhook failed: Network request failed.
End-to-end Discord inbound latency: first user DM after restart took ~60 s before session.started showed up in the trajectory; the model call itself (github-copilot/gpt-5.5) took only ~3.9 s. The ~60 s delay is entirely on the inbound/gateway side, dominated by the pre-warm stall + the Discord WS reconnect it triggers.

External network from this host to discord.com/api, gateway.discord.gg, and api.telegram.org is healthy (curl latency 40–680 ms with 200/302), so this is not a transit issue.

OpenClaw version

2026.5.22 (a374c3a)

Operating system

Ubuntu 24.04.4 LTS (Azure VM, Standard_B2als_v2, 2 vCPU, 4 GB RAM, japaneast)

Install method

npm global (npm i -g [email protected])

Model

github-copilot/gpt-5.5

Provider / routing chain

openclaw -> github-copilot

Additional provider/model setup details

Logs, screenshots, and evidence

Two independent restarts of the same gateway both logged a single-line marker showing the pre-warm wall time and the worst single event-loop block during it:

2026-05-24T08:46:04+00:00 [gateway] provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms
2026-05-24T09:17:43+00:00 [gateway] provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms

Liveness warning during the same window:

2026-05-24T08:40:17+00:00 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=14445.2 eventLoopUtilization=1 cpuCoreRatio=1.041 active=2 waiting=0 queued=0 recentPhases=...,sidecars.session-locks:50912ms work=[active=agent:main:telegram:direct:...(processing/model_call,q=1,age=14s last=model_call:started)|agent:main:discord:direct:...(processing/embedded_run,q=1,age=6s last=embedded_run:started)]

Discord-side fallout (second restart):

09:16:44.618 INFO  discord client initialized; awaiting gateway readiness
09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)
09:17:21.496 WARN  fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
09:17:43.612 INFO  provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
09:18:29.214 WARN  liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]
09:18:49.853 (trajectory) session.started   <- first inbound DM finally picked up
09:18:53.231 (trajectory) model.completed   <- ~3.4s model call

Root cause (best-effort, from reading the installed npm package on disk)

In /usr/lib/node_modules/openclaw/dist/:

server-startup-post-attach-ezNyN6B3.js calls warmCurrentProviderAuthState(cfg, { isCancelled }) once per gateway post-attach pass and awaits it; the wall time + worst per-tick stall are then logged via formatProviderAuthWarmMetrics.
model-provider-auth-DAG1ddFR.js:91 warmCurrentProviderAuthState is structured as a double for loop:
```
for (const agentId of listAgentIds(cfg)) {
  ensureAuthProfileStore(agentDir, {
    externalCli: externalCliDiscoveryForProviders({ cfg, providers: providerList })
  });
  for (const provider of providers) {
    await hasAuthForModelProvider({ provider, cfg, workspaceDir, agentId, store, runtimeAuthLookup });
  }
}
```
Each ensureAuthProfileStore invokes externalCliDiscoveryForProviders, which on Linux can synchronously fan out to external CLI binaries (codex, gemini, claude, gh, etc.) to probe for cached auth. On a 2 vCPU box that combination is hot enough to monopolize the event loop for 30+ s at a time (eventLoopMax=36876.3ms) and ~60 s end-to-end.

During that window the Discord channel's 15 s gateway-READY timer fires, forcing a reconnect; the first inbound DM after restart then waits for the reconnect + RESUME, so user-visible latency is roughly pre-warm wall time + reconnect.

Proposed fix shape (not a patch)

Run warmCurrentProviderAuthState after gateway ready in the background instead of inside the post-attach awaited path, or at least yield (setImmediate/await scheduler.yield()) between providers so other handlers run.
Cache externalCliDiscoveryForProviders results across the agent loop (today it appears to re-discover per ensureAuthProfileStore call).
Make per-provider hasAuthForModelProvider work Promise.allSettled style rather than serial await, so a slow codex login status style probe does not stall the rest.

Impact and severity

Affected: every restart of the gateway, every inbound message in the first ~60–90 s window after restart, across all channels (Discord/Telegram/Feishu/Slack all observed timing out their startup probes simultaneously). Severity: Frustrating but recoverable (gateway eventually catches up). Frequency: 100% reproducible on restart on this host. Consequence: Loss of any user message sent during the stall window, or it lands minutes late; Discord WS forced into a reconnect every startup.

Additional information

This is not a duplicate of #85975 / PR #85978 (Codex app-server thread_bootstrap native-thread rotation): that path requires the openai-codex provider and triggers per-turn, while this stall happens deterministically on every startup with github-copilot/gpt-5.5 and is gone after the pre-warm finishes. The shared symptom is event-loop starvation, but the source files and trigger are different (warmCurrentProviderAuthState here, rotateOversizedCodexAppServerStartupBinding there).

Report drafted by an AI agent (Hermes / claude-opus-4.7), reviewed by the human reporter before filing. Evidence above was collected by the agent from the affected host's logs and the installed npm package; the proposed fix shape is the agent's best read of the on-disk code and has not been validated against the source repository.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: 2026.5.22 gateway pre-warm (warmCurrentProviderAuthState) blocks event loop ~60s on startup, breaks channel handshakes

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Proposed fix shape (not a patch)

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Root cause (best-effort, from reading the installed npm package on disk)

Proposed fix shape (not a patch)

Impact and severity

Additional information

FAQ

Expected behavior

Still need to ship something?

TRENDING