openclaw - ✅(Solved) Fix Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75297Fetched 2026-05-01 05:35:37
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
4
Timeline (top)
cross-referenced ×4commented ×1renamed ×1subscribed ×1

We hit a production-impacting regression after upgrading OpenClaw beyond 2026.4.23. In our environment, every tested version after 2026.4.23 showed instability or severe degradation in real use, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29. The system was usable again only after rolling back to 2026.4.23.

The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from sessions.list, models.list, node.list, event-loop diagnostics, and CPU saturation.

This looks related to existing reports about event-loop saturation, slow/unbounded sessions.list, Control UI polling, stuck sessions, and runtime-deps issues around 2026.4.26 / 2026.4.27 / current builds.

Root Cause

I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.

Fix Action

Fixed

PR fix notes

PR #75326: perf: keep models list responsive during catalog discovery

Description (problem / solution / changelog)

Summary

  • Problem: models.list default/configured views wait for full gateway model catalog discovery on cold or slow provider/plugin catalog paths.
  • Why it matters: slow model discovery makes common CLI/UI control-plane calls look stalled during gateway CPU/event-loop starvation incidents.
  • What changed: default/configured models.list now falls back to configured/synthetic rows after a short wait, while discovery continues through the existing catalog cache path.
  • What did NOT change (scope boundary): models.list --all remains exact and waits for the full discovered catalog; no provider auth, catalog schema, or model execution behavior changed.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Related #72338
  • Related #74404
  • Related #75297
  • Related #75287
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: models.list always awaited context.loadGatewayModelCatalog(), even for views that can answer from configured model rows when provider discovery is slow.
  • Missing detection / guardrail: sessions.list had slow-catalog timeout coverage, but models.list did not.
  • Contributing context (if known): provider plugin augmentation/runtime-deps can make model catalog discovery slow on packaged gateway installs.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-methods/models.test.ts
  • Scenario the test should lock in: configured view returns configured model rows after a slow-catalog timeout; view=all still waits for exact catalog data.
  • Why this is the smallest reliable guardrail: it exercises the request handler directly with fake timers and a deferred catalog promise.
  • Existing test that already covers this (if any): existing server model tests cover catalog contents, not slow discovery behavior.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

models list default/configured views can return configured model rows quickly while provider discovery is still slow. models list --all remains the exact wait-for-discovery command.

Diagram (if applicable)

Before:
models.list default/configured -> await full provider catalog -> delayed response

After:
models.list default/configured -> short wait -> configured/synthetic rows -> responsive control plane
models.list --all -> await full provider catalog -> exact response

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS local plus Blacksmith Linux Testbox
  • Runtime/container: Node/pnpm via repo wrappers
  • Model/provider: configured model fixture
  • Integration/channel (if any): Gateway RPC models.list
  • Relevant config (redacted): configured models.providers.openai.models fixture

Steps

  1. Call models.list configured view with loadGatewayModelCatalog() deferred.
  2. Advance fake timers past the short timeout.
  3. Assert the handler returns configured model rows.
  4. Repeat with view=all and assert it waits for the deferred catalog.

Expected

  • Default/configured views stay responsive; all remains exact.

Actual

  • Before this patch, all views waited on the slow catalog promise. After this patch, only all does.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: focused models handler test passed locally; combined changed gate passed on Blacksmith Testbox before PR split.
  • Edge cases checked: view=all exact behavior is covered by a separate test.
  • What you did not verify: live Linux npm-global Telegram repro/profile.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: default/configured views may temporarily omit auth-backed discovered provider rows during slow discovery.
    • Mitigation: configured/synthetic rows still return, existing cache refresh continues, and --all remains available for exact discovery.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/cli/models.md (modified, +5/-0)
  • src/gateway/server-methods/models.test.ts (added, +156/-0)
  • src/gateway/server-methods/models.ts (modified, +40/-1)

Code Example

19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G

---

19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms

---

OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok

---

2.5G  /root/.openclaw/agents/scout-localidades/sessions
379M  /root/.openclaw/agents/main/sessions
349M  /root/.openclaw/agents/scout-artistas/sessions
312M  /root/.openclaw/agents/bruno/sessions
266M  /root/.openclaw/agents/validator/sessions
151M  /root/.openclaw/agents/research/sessions
150M  /root/.openclaw/agents/frankie/sessions

---

4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident
RAW_BUFFERClick to expand / collapse

Summary

We hit a production-impacting regression after upgrading OpenClaw beyond 2026.4.23. In our environment, every tested version after 2026.4.23 showed instability or severe degradation in real use, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29. The system was usable again only after rolling back to 2026.4.23.

The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from sessions.list, models.list, node.list, event-loop diagnostics, and CPU saturation.

This looks related to existing reports about event-loop saturation, slow/unbounded sessions.list, Control UI polling, stuck sessions, and runtime-deps issues around 2026.4.26 / 2026.4.27 / current builds.

Environment

  • Host OS: Linux 6.8.0-110-generic x64
  • Node: v24.14.1
  • Gateway: systemd user service
  • Gateway bind: loopback 127.0.0.1:18789
  • Current stable rollback version: OpenClaw 2026.4.23 (a979721)
  • Affected versions observed across the incident sequence: every tested version after 2026.4.23, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29
  • Channels/plugins in use include Telegram, Control UI/webchat, ACP/Codex-related tooling, browser/device-pair/talk-voice, etc.

What happened

We had 2026.4.27 apparently working after addressing runtime dependency issues around memory-core, chokidar, and sqlite-vec. After a config/model correction and a gateway/server restart, the instance became heavily degraded. We then tried newer builds including 2026.4.29, but the symptoms remained. Rolling back to 2026.4.23 restored practical stability.

I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.

Local evidence from the last affected attempt (2026.4.29)

From journalctl --user -u openclaw-gateway.service around 2026-04-30 19:07-19:10 ART:

19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G

After rollback to 2026.4.23:

19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms

Current verification after rollback:

OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok

Local state that may amplify the bug

This instance has a relatively large session/transcript footprint. Current session directory sizes:

2.5G  /root/.openclaw/agents/scout-localidades/sessions
379M  /root/.openclaw/agents/main/sessions
349M  /root/.openclaw/agents/scout-artistas/sessions
312M  /root/.openclaw/agents/bruno/sessions
266M  /root/.openclaw/agents/validator/sessions
151M  /root/.openclaw/agents/research/sessions
150M  /root/.openclaw/agents/frankie/sessions

There is also a previously archived checkpoint bundle outside the hot sessions path:

4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident

This likely amplifies sessions.list / transcript scanning / Control UI behavior, but it does not seem to be the sole cause: the same local state is usable again on 2026.4.23.

Related issues that look relevant

These existing issues seem strongly related:

  • #74345 — Event-loop saturation and ACP session leak on 2026.4.27
  • #74328 — Gateway main thread CPU-bound at ~100% on v2026.4.26/current main
  • #75287 — Gateway reloads provider plugins repeatedly and saturates event loop under Control UI polling
  • #64004 — Control UI remains slow although sessions.list returns quickly
  • #57715 — sessions.list slow: N+1 transcript fallback + full row build before limit
  • #75236 — Cursor pagination for sessions.list
  • #73510 — Stuck sessions cause permanent gateway hang with no auto-recovery
  • #74692 / #74883 / #75109 — 2026.4.27 runtime-deps issues around sqlite-vec, chokidar, and auto-loaded memory-core

Expected behavior

A stable release newer than 2026.4.23 should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on 2026.4.23.

Actual behavior

On tested versions after 2026.4.23, especially 2026.4.27/2026.4.29, the gateway becomes heavily degraded:

  • sessions.list: 13-38s
  • models.list: ~59s
  • node.list/device.pair.list: can also take ~14s+
  • WebSocket handshake timeouts
  • event-loop delay warnings with p99/max >14-24s
  • eventLoopUtilization ~1
  • CPU around one saturated core

Rollback to 2026.4.23 restores usable behavior.

Question

Is this expected to be covered by the fixes for the issues above, or should this be tracked as a separate regression? I can provide more logs/details if useful, but I wanted to report the concrete version-to-version behavior and timings from a real production-like OpenClaw state.

extent analysis

TL;DR

The most likely fix is to wait for a newer version of OpenClaw that addresses the event-loop saturation and session listing issues, or apply a workaround to mitigate the performance degradation.

Guidance

  • Review the related issues (#74345, #74328, #75287, #64004, #57715, #75236, #73510, #74692, #74883, #75109) to understand the potential causes of the regression.
  • Consider applying a workaround to reduce the load on the gateway event loop, such as optimizing the session listing query or reducing the frequency of Control UI polling.
  • Monitor the performance of the gateway and Control UI after applying any workarounds or updates to ensure that the issue is resolved.
  • Provide additional logs and details to the OpenClaw developers if requested to help resolve the issue.

Example

No code snippet is provided as the issue is related to a specific version of OpenClaw and its configuration, and the solution requires waiting for an update or applying a workaround.

Notes

The issue is likely related to the event-loop saturation and session listing issues reported in the related issues. The fact that rolling back to version 2026.4.23 resolves the issue suggests that the regression was introduced in a later version.

Recommendation

Apply a workaround to mitigate the performance degradation, such as optimizing the session listing query or reducing the frequency of Control UI polling, until a newer version of OpenClaw that addresses the issue is released.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A stable release newer than 2026.4.23 should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on 2026.4.23.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability [1 pull requests, 1 comments, 2 participants]