openclaw - 💡(How to fix) Fix [Bug]: 2026.5.22 gateway pre-warm (warmCurrentProviderAuthState) blocks event loop ~60s on startup, breaks channel handshakes

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

After upgrading from 2026.5.19 to 2026.5.22, gateway startup blocks the Node event loop for ~60 seconds inside warmCurrentProviderAuthState, causing channel handshakes (Discord READY, Feishu bot info, Telegram deleteWebhook) to time out, and leaving inbound messages stalled for ~1 minute on every restart.

Error Message

09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1) 09:17:21.496 WARN fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me 09:18:29.214 WARN liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]

Root Cause

The blocked work is provider-auth pre-warm, not the model call itself, so the model/provider path is largely incidental. The config has multiple providers enabled across the agent catalog (github-copilot, openai, anthropic, openrouter, etc.), which appears to amplify the cost (see Root cause below).

Fix Action

Fix / Workaround

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

Proposed fix shape (not a patch)

Last known good version: 2026.5.19 (we upgraded directly 5.19 → 5.22, skipping 5.20). First known bad version: 2026.5.22. No workaround attempted yet beyond systemctl restart; planning to roll back to 5.19.

Code Example

2026-05-24T08:46:04+00:00 [gateway] provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms
2026-05-24T09:17:43+00:00 [gateway] provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms

---

2026-05-24T08:40:17+00:00 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=14445.2 eventLoopUtilization=1 cpuCoreRatio=1.041 active=2 waiting=0 queued=0 recentPhases=...,sidecars.session-locks:50912ms work=[active=agent:main:telegram:direct:...(processing/model_call,q=1,age=14s last=model_call:started)|agent:main:discord:direct:...(processing/embedded_run,q=1,age=6s last=embedded_run:started)]

---

09:16:44.618 INFO  discord client initialized; awaiting gateway readiness
09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)
09:17:21.496 WARN  fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
09:17:43.612 INFO  provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
09:18:29.214 WARN  liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]
09:18:49.853 (trajectory) session.started   <- first inbound DM finally picked up
09:18:53.231 (trajectory) model.completed   <- ~3.4s model call

---

for (const agentId of listAgentIds(cfg)) {
    ensureAuthProfileStore(agentDir, {
      externalCli: externalCliDiscoveryForProviders({ cfg, providers: providerList })
    });
    for (const provider of providers) {
      await hasAuthForModelProvider({ provider, cfg, workspaceDir, agentId, store, runtimeAuthLookup });
    }
  }
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

After upgrading from 2026.5.19 to 2026.5.22, gateway startup blocks the Node event loop for ~60 seconds inside warmCurrentProviderAuthState, causing channel handshakes (Discord READY, Feishu bot info, Telegram deleteWebhook) to time out, and leaving inbound messages stalled for ~1 minute on every restart.

Steps to reproduce

  1. On a 2vCPU Linux host (Azure B2als_v2), install [email protected] and start the gateway as a systemd user service.
  2. Confirm at least one configured agent with multiple model providers (in our case: github-copilot, openai, anthropic, openrouter, plus the default catalog providers).
  3. Restart the gateway (systemctl --user restart openclaw-gateway) and watch /tmp/openclaw/openclaw-YYYY-MM-DD.log and journalctl _PID=<pid>.
  4. From the moment gateway ready is logged, send a Discord DM (or any inbound message) to the bot within the first ~90 seconds.

Expected behavior

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

Actual behavior

Two consecutive restarts on 2026.5.22 (PIDs 712063 and 721897, ~30 minutes apart, same config) both reproduced:

  • provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms (first restart)
  • provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms (second restart)
  • Liveness warnings during the same window: event_loop_delay,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=21776.8 eventLoopUtilization=1 cpuCoreRatio=1.041
  • Channel-side fallout, e.g.: [fetch-timeout] fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me, [discord] gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1), [feishu] bot info probe timed out after 30000ms; continuing startup, [telegram] deleteWebhook failed: Network request failed.
  • End-to-end Discord inbound latency: first user DM after restart took ~60 s before session.started showed up in the trajectory; the model call itself (github-copilot/gpt-5.5) took only ~3.9 s. The ~60 s delay is entirely on the inbound/gateway side, dominated by the pre-warm stall + the Discord WS reconnect it triggers.

External network from this host to discord.com/api, gateway.discord.gg, and api.telegram.org is healthy (curl latency 40–680 ms with 200/302), so this is not a transit issue.

OpenClaw version

2026.5.22 (a374c3a)

Operating system

Ubuntu 24.04.4 LTS (Azure VM, Standard_B2als_v2, 2 vCPU, 4 GB RAM, japaneast)

Install method

npm global (npm i -g [email protected])

Model

github-copilot/gpt-5.5

Provider / routing chain

openclaw -> github-copilot

Additional provider/model setup details

The blocked work is provider-auth pre-warm, not the model call itself, so the model/provider path is largely incidental. The config has multiple providers enabled across the agent catalog (github-copilot, openai, anthropic, openrouter, etc.), which appears to amplify the cost (see Root cause below).

Logs, screenshots, and evidence

Two independent restarts of the same gateway both logged a single-line marker showing the pre-warm wall time and the worst single event-loop block during it:

2026-05-24T08:46:04+00:00 [gateway] provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms
2026-05-24T09:17:43+00:00 [gateway] provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms

Liveness warning during the same window:

2026-05-24T08:40:17+00:00 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=14445.2 eventLoopUtilization=1 cpuCoreRatio=1.041 active=2 waiting=0 queued=0 recentPhases=...,sidecars.session-locks:50912ms work=[active=agent:main:telegram:direct:...(processing/model_call,q=1,age=14s last=model_call:started)|agent:main:discord:direct:...(processing/embedded_run,q=1,age=6s last=embedded_run:started)]

Discord-side fallout (second restart):

09:16:44.618 INFO  discord client initialized; awaiting gateway readiness
09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)
09:17:21.496 WARN  fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
09:17:43.612 INFO  provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
09:18:29.214 WARN  liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]
09:18:49.853 (trajectory) session.started   <- first inbound DM finally picked up
09:18:53.231 (trajectory) model.completed   <- ~3.4s model call

Root cause (best-effort, from reading the installed npm package on disk)

In /usr/lib/node_modules/openclaw/dist/:

  • server-startup-post-attach-ezNyN6B3.js calls warmCurrentProviderAuthState(cfg, { isCancelled }) once per gateway post-attach pass and awaits it; the wall time + worst per-tick stall are then logged via formatProviderAuthWarmMetrics.
  • model-provider-auth-DAG1ddFR.js:91 warmCurrentProviderAuthState is structured as a double for loop:
    for (const agentId of listAgentIds(cfg)) {
      ensureAuthProfileStore(agentDir, {
        externalCli: externalCliDiscoveryForProviders({ cfg, providers: providerList })
      });
      for (const provider of providers) {
        await hasAuthForModelProvider({ provider, cfg, workspaceDir, agentId, store, runtimeAuthLookup });
      }
    }
    Each ensureAuthProfileStore invokes externalCliDiscoveryForProviders, which on Linux can synchronously fan out to external CLI binaries (codex, gemini, claude, gh, etc.) to probe for cached auth. On a 2 vCPU box that combination is hot enough to monopolize the event loop for 30+ s at a time (eventLoopMax=36876.3ms) and ~60 s end-to-end.

During that window the Discord channel's 15 s gateway-READY timer fires, forcing a reconnect; the first inbound DM after restart then waits for the reconnect + RESUME, so user-visible latency is roughly pre-warm wall time + reconnect.

Proposed fix shape (not a patch)

  • Run warmCurrentProviderAuthState after gateway ready in the background instead of inside the post-attach awaited path, or at least yield (setImmediate/await scheduler.yield()) between providers so other handlers run.
  • Cache externalCliDiscoveryForProviders results across the agent loop (today it appears to re-discover per ensureAuthProfileStore call).
  • Make per-provider hasAuthForModelProvider work Promise.allSettled style rather than serial await, so a slow codex login status style probe does not stall the rest.

Impact and severity

Affected: every restart of the gateway, every inbound message in the first ~60–90 s window after restart, across all channels (Discord/Telegram/Feishu/Slack all observed timing out their startup probes simultaneously). Severity: Frustrating but recoverable (gateway eventually catches up). Frequency: 100% reproducible on restart on this host. Consequence: Loss of any user message sent during the stall window, or it lands minutes late; Discord WS forced into a reconnect every startup.

Additional information

Last known good version: 2026.5.19 (we upgraded directly 5.19 → 5.22, skipping 5.20). First known bad version: 2026.5.22. No workaround attempted yet beyond systemctl restart; planning to roll back to 5.19.

This is not a duplicate of #85975 / PR #85978 (Codex app-server thread_bootstrap native-thread rotation): that path requires the openai-codex provider and triggers per-turn, while this stall happens deterministically on every startup with github-copilot/gpt-5.5 and is gone after the pre-warm finishes. The shared symptom is event-loop starvation, but the source files and trigger are different (warmCurrentProviderAuthState here, rotateOversizedCodexAppServerStartupBinding there).


Report drafted by an AI agent (Hermes / claude-opus-4.7), reviewed by the human reporter before filing. Evidence above was collected by the agent from the affected host's logs and the installed npm package; the proposed fix shape is the agent's best read of the on-disk code and has not been validated against the source repository.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING