openclaw - ✅(Solved) Fix Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability [1 pull requests, 1 comments, 2 participants]

Q: Expected behavior

A stable release newer than `2026.4.23` should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on `2026.4.23`.

lisandromachado · 2026-04-30T22:54:32Z

[openclaw] We hit a production-impacting regression after upgrading OpenClaw beyond 2026.4.23 . In our environment, every tested version after 2026.4.23 showed… We hit a production-impacting regression after upgrading OpenClaw beyond `2026.4.23`. In our environment, every tested version after `2026.4.23` showed instability or severe degradation in real use, including `2026.4.24`, `2026.4.25`, `2026.4.26`, `2026.4.27`, beta attempts, and `2026.4.29`. The system was usable again only after rolling back to `2026.4.23`. The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from `sessions.list`, `models.list`, `node.list`, event-loop diagnostics, and CPU saturation. This looks related to existing reports about event-loop saturation, slow/unbounded `sessions.list`, Control UI polling, stuck sessions, and runtime-deps issues around `2026.4.26` / `2026.4.27` / current builds. # PR #75326: perf: keep models list responsive during catalog discovery - Repository: openclaw/openclaw - Author: steipete - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/75326 ## Description (problem / solution / changelog) ## Summary - Problem: `models.list` default/configured views wait for full gateway model catalog discovery on cold or slow provider/plugin catalog paths. - Why it matters: slow model discovery makes common CLI/UI control-plane calls look stalled during gateway CPU/event-loop starvation incidents. - What changed: default/configured `models.list` now falls back to configured/synthetic rows after a short wait, while discovery continues through the existing catalog cache path. - What did NOT change (scope boundary): `models.list --all` remains exact and waits for the full discovered catalog; no provider auth, catalog schema, or model execution behavior changed. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [x] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Related #72338 - Related #74404 - Related #75297 - Related #75287 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: `models.list` always awaited `context.loadGatewayModelCatalog()`, even for views that can answer from configured model rows when provider discovery is slow. - Missing detection / guardrail: `sessions.list` had slow-catalog timeout coverage, but `models.list` did not. - Contributing context (if known): provider plugin augmentation/runtime-deps can make model catalog discovery slow on packaged gateway installs. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [ ] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: `src/gateway/server-methods/models.test.ts` - Scenario the test should lock in: configured view returns configured model rows after a slow-catalog timeout; `view=all` still waits for exact catalog data. - Why this is the smallest reliable guardrail: it exercises the request handler directly with fake timers and a deferred catalog promise. - Existing test that already covers this (if any): existing server model tests cover catalog contents, not slow discovery behavior. - If no new test is added, why not: N/A. ## User-visible / Behavior Changes `models list` default/configured views can return configured model rows quickly while provider discovery is still slow. `models list --all` remains the exact wait-for-discovery command. ## Diagram (if applicable) ```text Before: models.list default/configured -> await full provider catalog -> delayed response After: models.list default/configured -> short wait -> configured/synthetic rows -> responsive control plane models.list --all -> await full provider catalog -> exact response ``` ## Security Impact (required) - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No - Command/tool execution surface changed? No - Data access scope changed? No - If any `Yes`, explain risk + mitigation: N/A ## Repro + Verification ### Environment - OS: macOS local plus Blacksmith Linux Testbox - Runtime/container: Node/pnpm via repo wrappers - Model/provider: configured model fixture - Integration/channel (if any): Gateway RPC `models.list` - Relevant config (redacted): configured `models.providers.openai.models` fixture ### Steps 1. Call `models.list` configured view with `loadGatewayModelCatalog()` deferred. 2. Advance fake timers past the short timeout. 3. Assert the handler returns configured model rows. 4. Repeat with `view=all` and assert it waits for the deferred catalog. ### Expected - Default/configured views stay responsive;

openclaw2026-04-30 22:54:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#75297•Fetched 2026-05-01 05:35:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lisandromachado

Participants

clawsweeper[bot]

lisandromachado

Timeline (top)

cross-referenced ×4commented ×1renamed ×1subscribed ×1

We hit a production-impacting regression after upgrading OpenClaw beyond 2026.4.23. In our environment, every tested version after 2026.4.23 showed instability or severe degradation in real use, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29. The system was usable again only after rolling back to 2026.4.23.

The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from sessions.list, models.list, node.list, event-loop diagnostics, and CPU saturation.

This looks related to existing reports about event-loop saturation, slow/unbounded sessions.list, Control UI polling, stuck sessions, and runtime-deps issues around 2026.4.26 / 2026.4.27 / current builds.

Root Cause

I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.

Fix Action

Fixed

Fixed by PR: perf: keep models list responsive during catalog discovery (https://github.com/openclaw/openclaw/pull/75326)

PR fix notes

PR #75326: perf: keep models list responsive during catalog discovery

Repository: openclaw/openclaw
Author: steipete
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/75326

Description (problem / solution / changelog)

Summary

Problem: models.list default/configured views wait for full gateway model catalog discovery on cold or slow provider/plugin catalog paths.
Why it matters: slow model discovery makes common CLI/UI control-plane calls look stalled during gateway CPU/event-loop starvation incidents.
What changed: default/configured models.list now falls back to configured/synthetic rows after a short wait, while discovery continues through the existing catalog cache path.
What did NOT change (scope boundary): models.list --all remains exact and waits for the full discovered catalog; no provider auth, catalog schema, or model execution behavior changed.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Related #72338
Related #74404
Related #75297
Related #75287
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: models.list always awaited context.loadGatewayModelCatalog(), even for views that can answer from configured model rows when provider discovery is slow.
Missing detection / guardrail: sessions.list had slow-catalog timeout coverage, but models.list did not.
Contributing context (if known): provider plugin augmentation/runtime-deps can make model catalog discovery slow on packaged gateway installs.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/server-methods/models.test.ts
Scenario the test should lock in: configured view returns configured model rows after a slow-catalog timeout; view=all still waits for exact catalog data.
Why this is the smallest reliable guardrail: it exercises the request handler directly with fake timers and a deferred catalog promise.
Existing test that already covers this (if any): existing server model tests cover catalog contents, not slow discovery behavior.
If no new test is added, why not: N/A.

User-visible / Behavior Changes

models list default/configured views can return configured model rows quickly while provider discovery is still slow. models list --all remains the exact wait-for-discovery command.

Diagram (if applicable)

Before:
models.list default/configured -> await full provider catalog -> delayed response

After:
models.list default/configured -> short wait -> configured/synthetic rows -> responsive control plane
models.list --all -> await full provider catalog -> exact response

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: macOS local plus Blacksmith Linux Testbox
Runtime/container: Node/pnpm via repo wrappers
Model/provider: configured model fixture
Integration/channel (if any): Gateway RPC models.list
Relevant config (redacted): configured models.providers.openai.models fixture

Steps

Call models.list configured view with loadGatewayModelCatalog() deferred.
Advance fake timers past the short timeout.
Assert the handler returns configured model rows.
Repeat with view=all and assert it waits for the deferred catalog.

Expected

Default/configured views stay responsive; all remains exact.

Actual

Before this patch, all views waited on the slow catalog promise. After this patch, only all does.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: focused models handler test passed locally; combined changed gate passed on Blacksmith Testbox before PR split.
Edge cases checked: view=all exact behavior is covered by a separate test.
What you did not verify: live Linux npm-global Telegram repro/profile.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: default/configured views may temporarily omit auth-backed discovered provider rows during slow discovery.
- Mitigation: configured/synthetic rows still return, existing cache refresh continues, and --all remains available for exact discovery.

Changed files

CHANGELOG.md (modified, +1/-0)
docs/cli/models.md (modified, +5/-0)
src/gateway/server-methods/models.test.ts (added, +156/-0)
src/gateway/server-methods/models.ts (modified, +40/-1)

Code Example

19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G

---

19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms

---

OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok

---

2.5G  /root/.openclaw/agents/scout-localidades/sessions
379M  /root/.openclaw/agents/main/sessions
349M  /root/.openclaw/agents/scout-artistas/sessions
312M  /root/.openclaw/agents/bruno/sessions
266M  /root/.openclaw/agents/validator/sessions
151M  /root/.openclaw/agents/research/sessions
150M  /root/.openclaw/agents/frankie/sessions

---

4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident

RAW_BUFFERClick to expand / collapse

Summary

Environment

Host OS: Linux 6.8.0-110-generic x64
Node: v24.14.1
Gateway: systemd user service
Gateway bind: loopback 127.0.0.1:18789
Current stable rollback version: OpenClaw 2026.4.23 (a979721)
Affected versions observed across the incident sequence: every tested version after 2026.4.23, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29
Channels/plugins in use include Telegram, Control UI/webchat, ACP/Codex-related tooling, browser/device-pair/talk-voice, etc.

What happened

We had 2026.4.27 apparently working after addressing runtime dependency issues around memory-core, chokidar, and sqlite-vec. After a config/model correction and a gateway/server restart, the instance became heavily degraded. We then tried newer builds including 2026.4.29, but the symptoms remained. Rolling back to 2026.4.23 restored practical stability.

Local evidence from the last affected attempt (`2026.4.29`)

From journalctl --user -u openclaw-gateway.service around 2026-04-30 19:07-19:10 ART:

19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G

After rollback to 2026.4.23:

19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms

Current verification after rollback:

OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok

Local state that may amplify the bug

This instance has a relatively large session/transcript footprint. Current session directory sizes:

2.5G  /root/.openclaw/agents/scout-localidades/sessions
379M  /root/.openclaw/agents/main/sessions
349M  /root/.openclaw/agents/scout-artistas/sessions
312M  /root/.openclaw/agents/bruno/sessions
266M  /root/.openclaw/agents/validator/sessions
151M  /root/.openclaw/agents/research/sessions
150M  /root/.openclaw/agents/frankie/sessions

There is also a previously archived checkpoint bundle outside the hot sessions path:

4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident

This likely amplifies sessions.list / transcript scanning / Control UI behavior, but it does not seem to be the sole cause: the same local state is usable again on 2026.4.23.

Related issues that look relevant

These existing issues seem strongly related:

#74345 — Event-loop saturation and ACP session leak on 2026.4.27
#74328 — Gateway main thread CPU-bound at ~100% on v2026.4.26/current main
#75287 — Gateway reloads provider plugins repeatedly and saturates event loop under Control UI polling
#64004 — Control UI remains slow although sessions.list returns quickly
#57715 — sessions.list slow: N+1 transcript fallback + full row build before limit
#75236 — Cursor pagination for sessions.list
#73510 — Stuck sessions cause permanent gateway hang with no auto-recovery
#74692 / #74883 / #75109 — 2026.4.27 runtime-deps issues around sqlite-vec, chokidar, and auto-loaded memory-core

Expected behavior

A stable release newer than 2026.4.23 should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on 2026.4.23.

Actual behavior

On tested versions after 2026.4.23, especially 2026.4.27/2026.4.29, the gateway becomes heavily degraded:

sessions.list: 13-38s
models.list: ~59s
node.list/device.pair.list: can also take ~14s+
WebSocket handshake timeouts
event-loop delay warnings with p99/max >14-24s
eventLoopUtilization ~1
CPU around one saturated core

Rollback to 2026.4.23 restores usable behavior.

Question

Is this expected to be covered by the fixes for the issues above, or should this be tracked as a separate regression? I can provide more logs/details if useful, but I wanted to report the concrete version-to-version behavior and timings from a real production-like OpenClaw state.

extent analysis

TL;DR

The most likely fix is to wait for a newer version of OpenClaw that addresses the event-loop saturation and session listing issues, or apply a workaround to mitigate the performance degradation.

Guidance

Review the related issues (#74345, #74328, #75287, #64004, #57715, #75236, #73510, #74692, #74883, #75109) to understand the potential causes of the regression.
Consider applying a workaround to reduce the load on the gateway event loop, such as optimizing the session listing query or reducing the frequency of Control UI polling.
Monitor the performance of the gateway and Control UI after applying any workarounds or updates to ensure that the issue is resolved.
Provide additional logs and details to the OpenClaw developers if requested to help resolve the issue.

Example

No code snippet is provided as the issue is related to a specific version of OpenClaw and its configuration, and the solution requires waiting for an update or applying a workaround.

Notes

The issue is likely related to the event-loop saturation and session listing issues reported in the related issues. The fact that rolling back to version 2026.4.23 resolves the issue suggests that the regression was introduced in a later version.

Recommendation

Apply a workaround to mitigate the performance degradation, such as optimizing the session listing query or reducing the frequency of Control UI polling, until a newer version of OpenClaw that addresses the issue is released.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#serialization error #model compatibility #GPU setup #container setup #orchestration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #75326: perf: keep models list responsive during catalog discovery

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Summary

Environment

What happened

Local evidence from the last affected attempt (2026.4.29)

Local state that may amplify the bug

Related issues that look relevant

Expected behavior

Actual behavior

Question

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Local evidence from the last affected attempt (`2026.4.29`)