openclaw - 💡(How to fix) Fix Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time [4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75703Fetched 2026-05-02 05:31:24
View on GitHub
Comments
4
Participants
4
Timeline
20
Reactions
2
Author
Timeline (top)
cross-referenced ×10commented ×4subscribed ×4mentioned ×2

Every release of OpenClaw since 2026.4.24 produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's [gateway] ready line fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on 127.0.0.1:18789 either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The 2026.4.23 rollback is clean every time — within ~30s the gateway reports Connect: ok ~30 ms with idle CPU.

This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit Refs #73655 and Refs #72338 fixes intended to address it.

Root Cause

This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit Refs #73655 and Refs #72338 fixes intended to address it.

Fix Action

Workaround

npm i -g [email protected]
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.

Code Example

# Pre-flight (every attempt)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak-pre-update
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-unknown-*
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.<old>-*

# Update
openclaw update --yes
# → "Update Result: OK", doctor passes (~60-160s, varies)

# Health
openclaw gateway probe
# → "Connect: failed - timeout" repeatedly for 4-8+ minutes

---

[gateway] ready                                           (within 8s)
[plugins] embedded acpx runtime backend registered        (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/   (+15s, this works)
[telegram] [<account>] starting provider                  (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable).    starvation marker
[ws] closed before connect ... code=1006                  (repeating, while CPU pegged)

---

npm i -g openclaw@2026.4.23
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.
RAW_BUFFERClick to expand / collapse

Summary

Every release of OpenClaw since 2026.4.24 produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's [gateway] ready line fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on 127.0.0.1:18789 either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The 2026.4.23 rollback is clean every time — within ~30s the gateway reports Connect: ok ~30 ms with idle CPU.

This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit Refs #73655 and Refs #72338 fixes intended to address it.

Environment

  • OpenClaw: tested 2026.4.26 (be8c246), 2026.4.29 (a448042). Last known-good: 2026.4.23 (a979721).
  • Hardware: Raspberry Pi 5 (8 GB), ARM64 (aarch64)
  • OS: Linux 6.12.62+rpt-rpi-2712 (64-bit, Debian-derived)
  • Node: tested both v22.22.2 and v24.15.0 (NodeSource apt). Symptom changes shape with Node version (see below) but does not go away.
  • Install: npm i -g openclaw into ~/.npm-global
  • Gateway: systemctl --user openclaw-gateway (loopback only, no remote, no reverse proxy)
  • Plugins: telegram (3 accounts), discord, memory-core, embedded acpx, browser-control. No Manifest plugin, no Lark, no Polymarket — different from #73655 reporter's stack.

Reproduction

# Pre-flight (every attempt)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak-pre-update
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-unknown-*
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.<old>-*

# Update
openclaw update --yes
# → "Update Result: OK", doctor passes (~60-160s, varies)

# Health
openclaw gateway probe
# → "Connect: failed - timeout" repeatedly for 4-8+ minutes

The gateway log shows:

[gateway] ready                                           (within 8s)
[plugins] embedded acpx runtime backend registered        (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/   (+15s, this works)
[telegram] [<account>] starting provider                  (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable).   ← starvation marker
[ws] closed before connect ... code=1006                  (repeating, while CPU pegged)

ss -tln | grep :18789 shows the listener up with Recv-Q=1 (one queued connection unserviced). top -bn1 -p $(MainPID) shows 90-280% CPU sustained. curl -m3 -H 'Connection:Upgrade' -H 'Upgrade:websocket' -H 'Sec-WebSocket-Key: dGVzdA==' -H 'Sec-WebSocket-Version: 13' http://127.0.0.1:18789/ returns 0 bytes after 3s.

Symptom variations across attempts

AttemptVersionsSymptom
1Node 22.22.2 + 4.26WS handler completely dead from boot; recv-Q stuck; probe times out indefinitely (>5 min waited)
2Node 24.15.0 + 4.26WS handler intermittently responds after ~3 min; one probe succeeds with 435 ms RPC, next probe times out; CPU 100%+ pegged
3Node 24.15.0 + 4.29Same as #2; 8+ min after boot CPU still pegged 100%, 6/6 fresh probes timeout, internal cron-service unreachable

The 4.26 changelog #72720 fix ("skip CLI startup self-respawn for foreground gateway runs so low-memory Linux/Node 24 hosts start through the same path … without hanging before logs") does help — on Node 22 the gateway never reaches a state where any probe can succeed; on Node 24 it eventually serves a few. But that fix only addresses the observable liveness, not the underlying event-loop starvation.

What 4.29 changes (and what it does not)

What worked as documented:

  • Fixes #74692 (sqlite-vec mirrored into bundled-plugin runtime-deps) — verified after rollback that memory-graph stats returns the correct rows; vector extension loads fine. So the fix is good, just lands inside a release whose other regressions block the upgrade for this host.

What did not change the symptom on this hardware:

  • Refs #73655 (conservative stuck-session recovery that releases only stale session lanes…) — no observable effect on the boot-time CPU peg. The starvation appears to begin during plugin/runtime-deps materialization, before any session lane could be marked stuck.
  • Refs #72338 (subagents/runs.json file-signature caching) — no observable effect; CPU stays pegged for >8 min post-restart with the new cache in place.

Possible root cause angle

This is one host's read, not a diagnosis: the symptom feels less like leak (1) Manifest EADDRINUSE retry and more like a pre-session starvation source — runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work — that owns the event loop during and after [gateway] ready. Two specific observations:

  • This host has no 2099/2100 Manifest listener, so leak (1) of #73655 cannot apply.
  • The cron service unavailable log line fires during the CPU peg, not after a queued reply. That suggests internal RPC contention rather than session-write-lock starvation.

Whatever it is, it scales with something Pi5-specific (slow disk? ARM64 jit warm-up? lower core count vs the m1/m2 reporters?) since the same 4.26 → 4.29 upgrade reportedly works on faster-ARM64 mac hosts (per #74323's macOS reporter, after they reverted to a working 4.26).

Workaround

npm i -g [email protected]
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.

What I can capture next time

If a maintainer wants targeted evidence on the next attempt, I can capture any of:

  • OPENCLAW_GATEWAY_STARTUP_TRACE=1 per-phase timings and lookup-table counts
  • per-thread CPU samples on the gateway PID (top -H -p $PID)
  • signal-handler counts (/proc/$PID/status | grep Sig, plus the MaxListenersExceededWarning mentioned in #73655 if it appears)
  • lsof -p $PID snapshot during the peg
  • strace -c -p $PID 30s sample to see which syscalls dominate
  • raw gateway log (StandardOutput=append: is configured)
  • any other capture you'd like

Happy to retry the upgrade on this host with capture instrumentation as soon as someone is in a position to consume the evidence. Holding on 4.23 until then.

Cross-references

extent analysis

TL;DR

The most likely fix for the WebSocket handler starvation issue is to identify and address the root cause of the event-loop starvation, which may be related to runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work.

Guidance

  • Verify that the issue is not related to the Manifest plugin or other plugins by testing with a minimal plugin setup.
  • Investigate the runtime-deps materialization process and plugin-discovery work to see if they are causing the event-loop starvation.
  • Capture per-thread CPU samples on the gateway PID and signal-handler counts to gather more information about the issue.
  • Test the upgrade on a different hardware setup to see if the issue is specific to the Raspberry Pi 5 / ARM64 configuration.

Example

No code snippet is provided as the issue is more related to system configuration and plugin interactions.

Notes

The issue seems to be specific to the Raspberry Pi 5 / ARM64 configuration, and the same upgrade reportedly works on faster-ARM64 mac hosts. The root cause of the event-loop starvation is still unknown and requires further investigation.

Recommendation

Apply the workaround by rolling back to version 2026.4.23 until the root cause of the issue is identified and addressed. This is because the workaround has been proven to work and allows the gateway to function normally, whereas the newer versions have consistently shown the WebSocket handler starvation issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time [4 comments, 4 participants]