openclaw - 💡(How to fix) Fix Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time [4 comments, 4 participants]

openclaw2026-05-01 15:08:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#75703•Fetched 2026-05-02 05:31:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

cross-referenced ×10commented ×4subscribed ×4mentioned ×2

Every release of OpenClaw since 2026.4.24 produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's [gateway] ready line fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on 127.0.0.1:18789 either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The 2026.4.23 rollback is clean every time — within ~30s the gateway reports Connect: ok ~30 ms with idle CPU.

This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit Refs #73655 and Refs #72338 fixes intended to address it.

Root Cause

Fix Action

Workaround

npm i -g [email protected]
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.

Code Example

# Pre-flight (every attempt)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak-pre-update
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-unknown-*
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.<old>-*

# Update
openclaw update --yes
# → "Update Result: OK", doctor passes (~60-160s, varies)

# Health
openclaw gateway probe
# → "Connect: failed - timeout" repeatedly for 4-8+ minutes

---

[gateway] ready                                           (within 8s)
[plugins] embedded acpx runtime backend registered        (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/   (+15s, this works)
[telegram] [<account>] starting provider                  (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable).   ← starvation marker
[ws] closed before connect ... code=1006                  (repeating, while CPU pegged)

---

npm i -g openclaw@2026.4.23
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: tested 2026.4.26 (be8c246), 2026.4.29 (a448042). Last known-good: 2026.4.23 (a979721).
Hardware: Raspberry Pi 5 (8 GB), ARM64 (aarch64)
OS: Linux 6.12.62+rpt-rpi-2712 (64-bit, Debian-derived)
Node: tested both v22.22.2 and v24.15.0 (NodeSource apt). Symptom changes shape with Node version (see below) but does not go away.
Install: npm i -g openclaw into ~/.npm-global
Gateway: systemctl --user openclaw-gateway (loopback only, no remote, no reverse proxy)
Plugins: telegram (3 accounts), discord, memory-core, embedded acpx, browser-control. No Manifest plugin, no Lark, no Polymarket — different from #73655 reporter's stack.

Reproduction

# Pre-flight (every attempt)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak-pre-update
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-unknown-*
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.<old>-*

# Update
openclaw update --yes
# → "Update Result: OK", doctor passes (~60-160s, varies)

# Health
openclaw gateway probe
# → "Connect: failed - timeout" repeatedly for 4-8+ minutes

The gateway log shows:

[gateway] ready                                           (within 8s)
[plugins] embedded acpx runtime backend registered        (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/   (+15s, this works)
[telegram] [<account>] starting provider                  (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable).   ← starvation marker
[ws] closed before connect ... code=1006                  (repeating, while CPU pegged)

ss -tln | grep :18789 shows the listener up with Recv-Q=1 (one queued connection unserviced). top -bn1 -p $(MainPID) shows 90-280% CPU sustained. curl -m3 -H 'Connection:Upgrade' -H 'Upgrade:websocket' -H 'Sec-WebSocket-Key: dGVzdA==' -H 'Sec-WebSocket-Version: 13' http://127.0.0.1:18789/ returns 0 bytes after 3s.

Symptom variations across attempts

Attempt	Versions	Symptom
1	Node 22.22.2 + 4.26	WS handler completely dead from boot; recv-Q stuck; probe times out indefinitely (>5 min waited)
2	Node 24.15.0 + 4.26	WS handler intermittently responds after ~3 min; one probe succeeds with 435 ms RPC, next probe times out; CPU 100%+ pegged
3	Node 24.15.0 + 4.29	Same as #2; 8+ min after boot CPU still pegged 100%, 6/6 fresh probes timeout, internal cron-service unreachable

The 4.26 changelog #72720 fix ("skip CLI startup self-respawn for foreground gateway runs so low-memory Linux/Node 24 hosts start through the same path … without hanging before logs") does help — on Node 22 the gateway never reaches a state where any probe can succeed; on Node 24 it eventually serves a few. But that fix only addresses the observable liveness, not the underlying event-loop starvation.

What 4.29 changes (and what it does not)

What worked as documented:

Fixes #74692 (sqlite-vec mirrored into bundled-plugin runtime-deps) — verified after rollback that memory-graph stats returns the correct rows; vector extension loads fine. So the fix is good, just lands inside a release whose other regressions block the upgrade for this host.

What did not change the symptom on this hardware:

Refs #73655 (conservative stuck-session recovery that releases only stale session lanes…) — no observable effect on the boot-time CPU peg. The starvation appears to begin during plugin/runtime-deps materialization, before any session lane could be marked stuck.
Refs #72338 (subagents/runs.json file-signature caching) — no observable effect; CPU stays pegged for >8 min post-restart with the new cache in place.

Possible root cause angle

This is one host's read, not a diagnosis: the symptom feels less like leak (1) Manifest EADDRINUSE retry and more like a pre-session starvation source — runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work — that owns the event loop during and after [gateway] ready. Two specific observations:

This host has no 2099/2100 Manifest listener, so leak (1) of #73655 cannot apply.
The cron service unavailable log line fires during the CPU peg, not after a queued reply. That suggests internal RPC contention rather than session-write-lock starvation.

Whatever it is, it scales with something Pi5-specific (slow disk? ARM64 jit warm-up? lower core count vs the m1/m2 reporters?) since the same 4.26 → 4.29 upgrade reportedly works on faster-ARM64 mac hosts (per #74323's macOS reporter, after they reverted to a working 4.26).

Workaround

npm i -g [email protected]
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.

What I can capture next time

If a maintainer wants targeted evidence on the next attempt, I can capture any of:

OPENCLAW_GATEWAY_STARTUP_TRACE=1 per-phase timings and lookup-table counts
per-thread CPU samples on the gateway PID (top -H -p $PID)
signal-handler counts (/proc/$PID/status | grep Sig, plus the MaxListenersExceededWarning mentioned in #73655 if it appears)
lsof -p $PID snapshot during the peg
strace -c -p $PID 30s sample to see which syscalls dominate
raw gateway log (StandardOutput=append: is configured)
any other capture you'd like

Happy to retry the upgrade on this host with capture instrumentation as soon as someone is in a position to consume the evidence. Holding on 4.23 until then.

Cross-references

#73655 — Gateway leak triad, root-cause hypothesis (still open). My prior 4.26 reproduction comment: https://github.com/openclaw/openclaw/issues/73655#issuecomment-4343520517 . 4.29 reproduction comment: https://github.com/openclaw/openclaw/issues/73655#issuecomment-4359928329
#74323 — closed as duplicate of #73655.
#72338 — earlier CPU-spin pattern (filed against 4.24).

extent analysis

TL;DR

The most likely fix for the WebSocket handler starvation issue is to identify and address the root cause of the event-loop starvation, which may be related to runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work.

Guidance

Verify that the issue is not related to the Manifest plugin or other plugins by testing with a minimal plugin setup.
Investigate the runtime-deps materialization process and plugin-discovery work to see if they are causing the event-loop starvation.
Capture per-thread CPU samples on the gateway PID and signal-handler counts to gather more information about the issue.
Test the upgrade on a different hardware setup to see if the issue is specific to the Raspberry Pi 5 / ARM64 configuration.

Example

No code snippet is provided as the issue is more related to system configuration and plugin interactions.

Notes

The issue seems to be specific to the Raspberry Pi 5 / ARM64 configuration, and the same upgrade reportedly works on faster-ARM64 mac hosts. The root cause of the event-loop starvation is still unknown and requires further investigation.

Recommendation

Apply the workaround by rolling back to version 2026.4.23 until the root cause of the issue is identified and addressed. This is because the workaround has been proven to work and allows the gateway to function normally, whereas the newer versions have consistently shown the WebSocket handler starvation issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time [4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Symptom variations across attempts

What 4.29 changes (and what it does not)

Possible root cause angle

Workaround

What I can capture next time

Cross-references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time [4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Symptom variations across attempts

What 4.29 changes (and what it does not)

Possible root cause angle

Workaround

What I can capture next time

Cross-references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING