openclaw - ✅(Solved) Fix Meta: correlated regression cluster in 2026.4.24 to 2026.4.26 around gateway startup/runtime/control-plane stability [1 pull requests, 2 comments, 3 participants]

moltar-bot · 2026-04-29T21:34:29Z

[openclaw] This issue is a synthesis of the recent issue/comment corpus, not a claim that one exact root cause is already proven. The evidence from issues crea… This issue is a synthesis of the recent issue/comment corpus, not a claim that one exact root cause is already proven. The evidence from issues created since `2026-04-17` strongly suggests that `2026.4.24` through `2026.4.26` should be treated as a **correlated regression cluster** rather than as a large set of unrelated bugs. The recurring pattern across reports is: 1. upgrade into `2026.4.24`, `2026.4.25`, or `2026.4.26` 2. gateway startup becomes slow, inconsistent, or only partially healthy 3. bundled/plugin runtime-deps work, plugin bootstrap, startup discovery, or restart/reload paths appear to stall or churn 4. loopback/gateway probe/WebSocket handshake failures start showing up 5. long-lived channel connections then destabilize 6. user-visible symptoms show up as WebSocket `1006`, probe timeout, channels never connecting, delayed replies, stuck sessions, Telegram churn, Slack socket disconnects, or dropped/stale replies 7. downgrade to an earlier known-good version removes the instability This issue is meant to map the cluster and connect related reports. It is **not** asserting that every report below has one identical root cause, only that the issue and comment stream repeatedly points to a shared control-plane/runtime/bootstrap failure family. # PR #74762: fix: gateway model catalog cache regression - Repository: openclaw/openclaw - Author: clawsweeper[bot] - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/74762 ## Description (problem / solution / changelog) ## Summary Found one regression in the new gateway model catalog cache: it treats an empty catalog as a successful cached catalog, which breaks the underlying retry-on-empty contract. ## What ClawSweeper Is Fixing - **Medium: Gateway caches transient empty model catalogs until reload/restart** (regression) - File: `src/gateway/server-model-catalog.ts:49` - Evidence: `startGatewayModelCatalogRefresh()` assigns `lastSuccessfulCatalog = catalog` for every resolved array, including `[]`. Later, `loadGatewayModelCatalog()` returns `lastSuccessfulCatalog` whenever it is truthy, and empty arrays are truthy in JS. The underlying loader explicitly avoids caching empty results at `src/agents/model-catalog.ts:215` because an empty catalog can come from transient dependency/filesystem/provider issues and should be retried. - Impact: if the first gateway catalog load returns `[]`, `models.list`, TUI model surfaces, session/model metadata helpers, and related gateway callers keep seeing no models until a model config reload or process restart. This is worse than the prior behavior, where the next request retried immediately. - Suggested fix: preserve the underlying no-cache-on-empty behavior in the gateway wrapper. Do not mark an empty result as fresh; keep the cache stale or clear it so the next call retries. Add a regression test where the injected loader returns `[]` once and a non-empty catalog on the second call. - Confidence: high ## Expected Repair Surface - `src/gateway/server-model-catalog.ts` - `src/gateway/server-model-catalog.test.ts` - `src/gateway/server-reload-handlers.ts` ## Source And Review Context - ClawSweeper report: https://github.com/openclaw/clawsweeper/blob/main/records/openclaw-openclaw/commits/6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3.md - Commit under review: https://github.com/openclaw/openclaw/commit/6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3 - Latest main at intake: a6390efeba3ce19869c0d2d2eb53be2aa3092ae3 - Original commit author: Peter Steinberger - GitHub author: @steipete - Highest severity: medium - Review confidence: high - Diff: `57a3d7f6e897f25073e313d5c24b6fb6f60575ae..6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3` - Changed files: `CHANGELOG.md`, `src/gateway/server-model-catalog.ts`, `src/gateway/server-model-catalog.test.ts`, `src/gateway/server-reload-handlers.ts` - Code read: gateway model catalog wrapper, underlying `src/agents/model-catalog.ts`, reload handling, request context wiring, `models.list`, HTTP model override, session/model support call paths - GitHub refs read: https://github.com/openclaw/openclaw/issues/74135, https://github.com/openclaw/openclaw/issues/74630, https://github.com/openclaw/openclaw/issues/74633 ## Expected validation - `pnpm check:changed` ClawSweeper already ran: - `pnpm docs:list` - `pnpm install` after the first targeted test failed because `node_modules` was missing - `pnpm test src/gateway/server-model-catalog.test.ts -- --reporter=verbose` passed - Injected smoke with first loader call returning `[]` and second returning a model produced `{"first":[],"second":[],"calls":1}`, confirming the retry is suppressed - `git diff --check 57a3d7f6e897f25073e313d5c24b6fb6f60575ae..6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3` Known review limits: - Full suite and live gateway smoke were not run; review used fo

openclaw2026-04-29 21:34:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74630•Fetched 2026-04-30 06:21:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

cross-referenced ×3commented ×2subscribed ×1

This issue is a synthesis of the recent issue/comment corpus, not a claim that one exact root cause is already proven.

The evidence from issues created since 2026-04-17 strongly suggests that 2026.4.24 through 2026.4.26 should be treated as a correlated regression cluster rather than as a large set of unrelated bugs.

The recurring pattern across reports is:

upgrade into 2026.4.24, 2026.4.25, or 2026.4.26
gateway startup becomes slow, inconsistent, or only partially healthy
bundled/plugin runtime-deps work, plugin bootstrap, startup discovery, or restart/reload paths appear to stall or churn
loopback/gateway probe/WebSocket handshake failures start showing up
long-lived channel connections then destabilize
user-visible symptoms show up as WebSocket 1006, probe timeout, channels never connecting, delayed replies, stuck sessions, Telegram churn, Slack socket disconnects, or dropped/stale replies
downgrade to an earlier known-good version removes the instability

This issue is meant to map the cluster and connect related reports. It is not asserting that every report below has one identical root cause, only that the issue and comment stream repeatedly points to a shared control-plane/runtime/bootstrap failure family.

Root Cause

This issue is a synthesis of the recent issue/comment corpus, not a claim that one exact root cause is already proven.

Fix Action

Fix / Workaround

upgrade into 2026.4.24, 2026.4.25, or 2026.4.26
gateway startup becomes slow, inconsistent, or only partially healthy
bundled/plugin runtime-deps work, plugin bootstrap, startup discovery, or restart/reload paths appear to stall or churn
loopback/gateway probe/WebSocket handshake failures start showing up
long-lived channel connections then destabilize
user-visible symptoms show up as WebSocket 1006, probe timeout, channels never connecting, delayed replies, stuck sessions, Telegram churn, Slack socket disconnects, or dropped/stale replies
downgrade to an earlier known-good version removes the instability

4. Telegram-facing symptoms

Some Telegram issues look adapter-specific, but many severe incidents cluster around startup stalls, stuck sessions, runtime churn, or shared dispatch failure:

PR fix notes

PR #74762: fix: gateway model catalog cache regression

Repository: openclaw/openclaw
Author: clawsweeper[bot]
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/74762

Description (problem / solution / changelog)

Summary

Found one regression in the new gateway model catalog cache: it treats an empty catalog as a successful cached catalog, which breaks the underlying retry-on-empty contract.

What ClawSweeper Is Fixing

Medium: Gateway caches transient empty model catalogs until reload/restart (regression)
- File: src/gateway/server-model-catalog.ts:49
- Evidence: startGatewayModelCatalogRefresh() assigns lastSuccessfulCatalog = catalog for every resolved array, including []. Later, loadGatewayModelCatalog() returns lastSuccessfulCatalog whenever it is truthy, and empty arrays are truthy in JS. The underlying loader explicitly avoids caching empty results at src/agents/model-catalog.ts:215 because an empty catalog can come from transient dependency/filesystem/provider issues and should be retried.
- Impact: if the first gateway catalog load returns [], models.list, TUI model surfaces, session/model metadata helpers, and related gateway callers keep seeing no models until a model config reload or process restart. This is worse than the prior behavior, where the next request retried immediately.
- Suggested fix: preserve the underlying no-cache-on-empty behavior in the gateway wrapper. Do not mark an empty result as fresh; keep the cache stale or clear it so the next call retries. Add a regression test where the injected loader returns [] once and a non-empty catalog on the second call.
- Confidence: high

Expected Repair Surface

src/gateway/server-model-catalog.ts
src/gateway/server-model-catalog.test.ts
src/gateway/server-reload-handlers.ts

Source And Review Context

ClawSweeper report: https://github.com/openclaw/clawsweeper/blob/main/records/openclaw-openclaw/commits/6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3.md
Commit under review: https://github.com/openclaw/openclaw/commit/6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3
Latest main at intake: a6390efeba3ce19869c0d2d2eb53be2aa3092ae3
Original commit author: Peter Steinberger
GitHub author: @steipete
Highest severity: medium
Review confidence: high
Diff: 57a3d7f6e897f25073e313d5c24b6fb6f60575ae..6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3
Changed files: CHANGELOG.md, src/gateway/server-model-catalog.ts, src/gateway/server-model-catalog.test.ts, src/gateway/server-reload-handlers.ts
Code read: gateway model catalog wrapper, underlying src/agents/model-catalog.ts, reload handling, request context wiring, models.list, HTTP model override, session/model support call paths
GitHub refs read: https://github.com/openclaw/openclaw/issues/74135, https://github.com/openclaw/openclaw/issues/74630, https://github.com/openclaw/openclaw/issues/74633

Expected validation

pnpm check:changed

ClawSweeper already ran:

pnpm docs:list
pnpm install after the first targeted test failed because node_modules was missing
pnpm test src/gateway/server-model-catalog.test.ts -- --reporter=verbose passed
Injected smoke with first loader call returning [] and second returning a model produced {"first":[],"second":[],"calls":1}, confirming the retry is suppressed
git diff --check 57a3d7f6e897f25073e313d5c24b6fb6f60575ae..6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3

Known review limits:

Full suite and live gateway smoke were not run; review used focused gateway tests and an injected runtime proof.

ClawSweeper Guardrails

Re-check the finding against latest main before changing code.
Keep the patch to the narrowest behavior change and matching regression coverage.
Do not merge automatically; this PR stays for maintainer review.

ClawSweeper 🐠 replacement reef notes:

Cluster: clawsweeper-commit-openclaw-openclaw-6421e1f36a3c
Source PRs: none
Credit: Detected by ClawSweeper commit review for 6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3.; Original commit author: Peter Steinberger.
Validation: pnpm check:changed

fish notes: model gpt-5.5, reasoning medium; reviewed against da5e171ffab1.

Changed files

src/gateway/server-model-catalog.test.ts (modified, +18/-0)
src/gateway/server-model-catalog.ts (modified, +1/-1)

RAW_BUFFERClick to expand / collapse

Meta: correlated regression cluster in `2026.4.24` to `2026.4.26` around gateway startup/runtime/control-plane stability

Summary

This issue is a synthesis of the recent issue/comment corpus, not a claim that one exact root cause is already proven.

The recurring pattern across reports is:

upgrade into 2026.4.24, 2026.4.25, or 2026.4.26
gateway startup becomes slow, inconsistent, or only partially healthy
bundled/plugin runtime-deps work, plugin bootstrap, startup discovery, or restart/reload paths appear to stall or churn
loopback/gateway probe/WebSocket handshake failures start showing up
long-lived channel connections then destabilize
user-visible symptoms show up as WebSocket 1006, probe timeout, channels never connecting, delayed replies, stuck sessions, Telegram churn, Slack socket disconnects, or dropped/stale replies
downgrade to an earlier known-good version removes the instability

Why this looks like one regression family

Across the issue/comment corpus, the same higher-level themes recur:

bundled runtime deps / staged runtime-deps repair
plugin loader / plugin registry / manifest load behavior
startup-time gateway probe / readiness / handshake timing
event-loop starvation during startup or restart
restart / reload / stale runtime state after update
stuck processing / running session state causing channel-visible outages

A strong signal is that many channel-facing and transport-facing reports are explicitly being consolidated into a smaller set of control-plane/runtime trackers rather than being treated as isolated Telegram, Slack, or WebSocket bugs.

Release arc visible in the corpus

`2026.4.24`

Recurring reports around:

bonjour / CIAO crash-loop and hostname behavior
migration/runtime breakage
early update/restart instability

Relevant issues:

#72366
#72561
#72355
#72434
#72526
#72665
#73044

`2026.4.25`

Recurring reports around:

runtime-deps staging / plugin-loader / packaging fallout
startup-sidecar stalls
post-update unhealthy state
missing staged deps such as chokidar

Relevant issues:

#72846
#72848
#72882
#72956
#72992
#73176
#73140
#73332
#73524

`2026.4.26`

Recurring reports around:

event-loop starvation
CPU spin during or after startup
probe timeout / channel startup failure
loopback WebSocket handshake timeout / 1006
socket instability
stuck sessions and delayed or dropped replies

Relevant issues:

#72338
#73532
#73647
#73655
#73857
#73874
#74135
#74153
#74279
#74281
#74292
#74307
#74323
#74325
#74328
#74346
#74405
#74568
#74570

Symptom families that seem correlated

1. Gateway startup / readiness / probe / handshake instability

These issues repeatedly describe the gateway reaching ready or opening a listening socket, but remaining unhealthy for probes, channels, or loopback clients:

#72338
#73524
#74135
#74279
#74281
#74292
#74323
#74325
#74568

2. Runtime-deps / plugin bootstrap / staging / update fallout

These issues repeatedly connect startup breakage, plugin loss, stale runtime state, or broken update recovery to runtime-deps/bootstrap paths:

#72665
#72848
#72882
#72956
#72992
#73140
#73176
#73532
#73647
#74199
#74307
#74346
#74405
#74570
#74597

3. WebSocket / `1006` / loopback failures

These issues repeatedly show loopback handshake starvation, timeout, or disconnect loops that appear to be symptoms of blocked startup or runtime load:

#73044
#73524
#74135
#74279
#74292
#74323
#74449
#74568
#74583

4. Telegram-facing symptoms

Some Telegram issues look adapter-specific, but many severe incidents cluster around startup stalls, stuck sessions, runtime churn, or shared dispatch failure:

#72338
#73323
#73647
#74154
#74299
#74344
#74540
#74550
#74581

5. Slack-facing symptoms

Slack has some isolated transport-specific issues, but comments also repeatedly intersect with the same startup/runtime load class:

#72808
#73857
#74011
#74358
#74590

6. Session wedge / failover / compaction fallout

These issues repeatedly show processing or running sessions wedging the control plane and then surfacing as channel silence, stale output, or delayed replies:

#71127
#72903
#73510
#74153
#74154
#74550
#74607
#73204
#74073
#72676
#72697
#74239

Maintainer response pattern seen in comments

A recurring maintainer pattern across the corpus:

broad reports get narrowed into a smaller number of runtime/control-plane trackers
many issues are closed as fixed on current main with commit evidence
published releases often remain broken while main has partial or full fixes
only the broadest starvation/stall/socket-instability trackers remain open

That pattern makes the stream look fragmented at first glance, but in aggregate it supports treating this as a release-band regression cluster.

Strong candidate umbrella / canonical trackers

If maintainers think this meta issue should instead collapse into existing umbrella issues, the strongest candidates appear to be:

#72338
#73532
#73655
#74135

Working hypothesis, stated cautiously

The most evidence-backed reading of the current corpus is:

2026.4.24 through 2026.4.26 introduced an overlapping runtime/bootstrap/control-plane regression cluster. WebSocket instability, probe timeout, Telegram churn, Slack socket disconnects, stuck sessions, delayed replies, and startup/channel failures are often downstream manifestations of that shared instability rather than isolated adapter-only bugs.

This is a mapping / synthesis hypothesis, not a claim that one exact single defect has been proven.

What would be useful next

To confirm or falsify this cluster framing, the most useful maintainer follow-up would likely be:

identify whether the broadest remaining failures all still reproduce on a current post-2026.4.26 build
determine how much is explained by:
- runtime-deps staging/repair
- plugin manifest/registry load churn
- readiness/probe timing
- startup event-loop starvation
- stuck-session recovery gaps
decide whether one existing umbrella issue should own this cluster, or whether a dedicated meta tracker like this is useful

Notes

This synthesis intentionally omits private infrastructure details, local paths, hostnames, IPs, org names, and copied local logs. It is based on the public issue/comment stream and tries to connect patterns without overclaiming certainty.

extent analysis

TL;DR

Downgrade to a version prior to 2026.4.24 or wait for a fixed version to be released, as the correlated regression cluster introduced in 2026.4.24 through 2026.4.26 causes gateway startup, runtime, and control-plane stability issues.

Guidance

Identify if the issues still reproduce on a current post-2026.4.26 build to confirm the regression cluster.
Investigate the role of runtime-deps staging/repair, plugin manifest/registry load churn, readiness/probe timing, startup event-loop starvation, and stuck-session recovery gaps in the failures.
Consider using a dedicated meta tracker like this issue to connect patterns and follow up on the regression cluster.
Review the strongest candidate umbrella trackers (#72338, #73532, #73655, #74135) for potential fixes or workarounds.

Example

No specific code snippet can be provided without more context, but reviewing the runtime-deps staging and plugin manifest loading code may help identify the root cause of the regression cluster.

Notes

The provided information is based on the public issue/comment stream and may not reflect the full scope of the issue. Further investigation is needed to confirm the root cause and develop a comprehensive fix.

Recommendation

Apply a workaround by downgrading to a version prior to 2026.4.24 until a fixed version is released, as the regression cluster affects gateway startup, runtime, and control-plane stability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Meta: correlated regression cluster in 2026.4.24 to 2026.4.26 around gateway startup/runtime/control-plane stability [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

4. Telegram-facing symptoms

PR fix notes

PR #74762: fix: gateway model catalog cache regression

Description (problem / solution / changelog)

Summary

What ClawSweeper Is Fixing

Expected Repair Surface

Source And Review Context

Expected validation

ClawSweeper Guardrails

Changed files

Meta: correlated regression cluster in 2026.4.24 to 2026.4.26 around gateway startup/runtime/control-plane stability

Summary

Why this looks like one regression family

Release arc visible in the corpus

2026.4.24

2026.4.25

2026.4.26

Symptom families that seem correlated

1. Gateway startup / readiness / probe / handshake instability

2. Runtime-deps / plugin bootstrap / staging / update fallout

3. WebSocket / 1006 / loopback failures

4. Telegram-facing symptoms

5. Slack-facing symptoms

6. Session wedge / failover / compaction fallout

Maintainer response pattern seen in comments

Strong candidate umbrella / canonical trackers

Working hypothesis, stated cautiously

What would be useful next

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Meta: correlated regression cluster in `2026.4.24` to `2026.4.26` around gateway startup/runtime/control-plane stability

`2026.4.24`

`2026.4.25`

`2026.4.26`

3. WebSocket / `1006` / loopback failures