openclaw - 💡(How to fix) Fix `channels status` reports persistent false-positive `Gateway event loop degraded` on idle 5.6 (eventLoopDelayMaxMs=0 with utilization=1)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When openclaw channels status --json is invoked at any time on a steady-state OpenClaw 5.6 installation (no real load, container CPU < 1%), the response payload's eventLoop block returns degraded: true with an internally inconsistent set of values. The formatter at dist/status-EujRAaez.js line 92-93 then renders:

Gateway event loop degraded: reasons=event_loop_utilization,cpu eventLoopDelayMaxMs=0 eventLoopUtilization=1 cpuCoreRatio=<1.018-1.057>

The internal inconsistency: eventLoopDelayMaxMs=0 (no events were delayed) co-occurring with eventLoopUtilization=1 (loop was 100% busy) and cpuCoreRatio>1 (over one full core's capacity). Either:

  • The measurement window is too short to capture queueing delay (utilization is computed over current measurement interval; delayMaxMs requires a delay to be observed), OR
  • The utilization value is sticky/cached at peak from a prior interval, OR
  • The 5.4 sustained-window threshold fix was applied to the gateway's core event-loop alerting path but NOT to the channels.status payload-builder path

Root Cause

Hypothesised root cause (needs upstream confirmation)

Fix Action

Fix / Workaround

Suggested patches

Cheapest patch (if internals can't be touched): in status-EujRAaez.js formatEventLoopBits, treat eventLoopDelayMaxMs=0 as a signal of insufficient sampling window and suppress the alert line entirely. Rationale: a real saturation event would queue at least one task, producing nonzero delayMaxMs. delayMaxMs=0 with utilization>=1 is structurally inconsistent and almost certainly a measurement artifact.

Local mitigation (none ideal)

Code Example

Gateway event loop degraded: reasons=event_loop_utilization,cpu eventLoopDelayMaxMs=0 eventLoopUtilization=1 cpuCoreRatio=<1.018-1.057>

---

docker exec <container> openclaw channels status

---

Gateway reachable.
Gateway event loop degraded: reasons=event_loop_utilization,cpu eventLoopDelayMaxMs=0 eventLoopUtilization=1 cpuCoreRatio=1.041
- Telegram default: enabled, configured, running, connected, in:48m ago, out:47m ago, mode:polling, bot:@AskSageAI_bot, ...

---

type EventLoopState = {
   degraded: boolean;
+  degradedSustained: boolean;  // true only after N consecutive samples over threshold
   reasons: string[];
   delayMaxMs: number;
   utilization: number;
   cpuCoreRatio: number;
+  sampleWindowMs: number;
 };

---

function formatEventLoopBits(value) {
   if (!value || typeof value !== "object") return null;
   const record = value;
   if (record.degraded !== true) return null;
+  // Suppress measurement-artifact alerts: utilization=1 with delayMaxMs=0
+  // is structurally inconsistent (full utilization should produce queueing delay).
+  // Almost certainly a sub-window sampling artifact; don't surface as user alert.
+  if (record.delayMaxMs === 0 && record.utilization >= 0.95) return null;
   ...
 }
RAW_BUFFERClick to expand / collapse

Summary

When openclaw channels status --json is invoked at any time on a steady-state OpenClaw 5.6 installation (no real load, container CPU < 1%), the response payload's eventLoop block returns degraded: true with an internally inconsistent set of values. The formatter at dist/status-EujRAaez.js line 92-93 then renders:

Gateway event loop degraded: reasons=event_loop_utilization,cpu eventLoopDelayMaxMs=0 eventLoopUtilization=1 cpuCoreRatio=<1.018-1.057>

The internal inconsistency: eventLoopDelayMaxMs=0 (no events were delayed) co-occurring with eventLoopUtilization=1 (loop was 100% busy) and cpuCoreRatio>1 (over one full core's capacity). Either:

  • The measurement window is too short to capture queueing delay (utilization is computed over current measurement interval; delayMaxMs requires a delay to be observed), OR
  • The utilization value is sticky/cached at peak from a prior interval, OR
  • The 5.4 sustained-window threshold fix was applied to the gateway's core event-loop alerting path but NOT to the channels.status payload-builder path

Repro

On a steady-state container with OpenClaw 2026.5.6 (boot 7.4s, idle telegram polling, no active sessions):

docker exec <container> openclaw channels status

Expected on idle: clean output, no "Gateway event loop degraded" line.

Actual:

Gateway reachable.
Gateway event loop degraded: reasons=event_loop_utilization,cpu eventLoopDelayMaxMs=0 eventLoopUtilization=1 cpuCoreRatio=1.041
- Telegram default: enabled, configured, running, connected, in:48m ago, out:47m ago, mode:polling, bot:@AskSageAI_bot, ...

Production evidence

We sampled 8 most recent hourly heartbeat session jsonl files from a running 5.6 container. Heartbeat polls channels status as part of its content-gathering. Every single fire across 6 hours captured the same alert content:

File mtime UTCeventLoopUtilizationcpuCoreRatioeventLoopDelayMaxMsReal container CPU (docker stats)
07:1411.0570~0.5%
08:0311.050~0.5%
09:0311.0180~0.5%
10:0311.0430~0.5%
11:0311.0450~0.5%
12:0311.0550~0.5%
12:1511.050~0.5%
13:0311.0410~0.5%

Container resource state during this window:

  • MEM 753.9MiB / 5GiB (15% of limit)
  • CPU 0.45%
  • PIDS 30
  • 1 idle telegram polling session, no active agent calls in the moments between the heartbeat fires

In contrast, the gateway's INTERNAL [diagnostic] liveness warning log entries during the same window report sane values: eventLoopP99Ms=39.6-96.7, eventLoopDelayMaxMs=1537-1773, eventLoopUtilization=0.189-0.227, cpuCoreRatio=0.215-0.257. The disparity is:

  • Gateway internal liveness monitor: 4 warnings in 15h, all with eventLoopUtilization=0.196-0.227 (low, normal)
  • channels status payload eventLoop block: 8 of 8 sampled fires show eventLoopUtilization=1.0 (saturated)

This suggests the 5.4 fix to the gateway's internal event-loop monitor (which appears to use a sustained-window threshold and is correctly reporting low values) was NOT applied to the payload.eventLoop builder used by channels.status. Two paths, two different sources of truth.

Hypothesised root cause (needs upstream confirmation)

The channels.status payload-builder appears to compute eventLoop.degraded, utilization, cpuCoreRatio from a different sampling window or different source than the internal liveness-monitor that the 5.4 fix touched. Likely paths to investigate (from grep on running container):

  • /usr/local/lib/node_modules/openclaw/dist/server-methods-BvQQUQsB.js (contains eventLoop: builder)
  • /usr/local/lib/node_modules/openclaw/dist/protocol-ByTcB0og.js (eventLoop schema)
  • /usr/local/lib/node_modules/openclaw/dist/status-EujRAaez.js line 92-93 (formatter consumer; not the builder)

The internal inconsistency eventLoopDelayMaxMs=0 + eventLoopUtilization=1 strongly suggests the degraded flag is being set without proper sustained-window filtering. A momentary 100% utilization spike (e.g., during boot-time module loading or during telegram pollLoop tick) would set utilization=1 for the snapshot interval; without sustained-window threshold, degraded flips true; the alert string is generated; the next call samples again, may catch the same ephemeral state, alert continues.

Suggested patches

Fix A — apply 5.4 sustained-window threshold to channels.status payload-builder

If the 5.4 fix added a sustained-window check (e.g., "utilization must exceed threshold for N consecutive samples before degraded=true"), apply the same logic in payload.eventLoop construction so both paths use consistent semantics.

Fix B — distinguish "instantaneous saturation" from "sustained degradation" in payload schema

 type EventLoopState = {
   degraded: boolean;
+  degradedSustained: boolean;  // true only after N consecutive samples over threshold
   reasons: string[];
   delayMaxMs: number;
   utilization: number;
   cpuCoreRatio: number;
+  sampleWindowMs: number;
 };

Have the formatter check degradedSustained before emitting the user-visible "Gateway event loop degraded" line. Keep degraded for ad-hoc instantaneous reporting if needed.

Fix C — formatter-side filter

Cheapest patch (if internals can't be touched): in status-EujRAaez.js formatEventLoopBits, treat eventLoopDelayMaxMs=0 as a signal of insufficient sampling window and suppress the alert line entirely. Rationale: a real saturation event would queue at least one task, producing nonzero delayMaxMs. delayMaxMs=0 with utilization>=1 is structurally inconsistent and almost certainly a measurement artifact.

 function formatEventLoopBits(value) {
   if (!value || typeof value !== "object") return null;
   const record = value;
   if (record.degraded !== true) return null;
+  // Suppress measurement-artifact alerts: utilization=1 with delayMaxMs=0
+  // is structurally inconsistent (full utilization should produce queueing delay).
+  // Almost certainly a sub-window sampling artifact; don't surface as user alert.
+  if (record.delayMaxMs === 0 && record.utilization >= 0.95) return null;
   ...
 }

This is brittle (could mask real ultra-fast saturation) but immediate and low-risk.

Local mitigation (none ideal)

We don't have a clean local mitigation. Options considered:

  1. Configure heartbeat to filter the line from its content payload — possible via heartbeat content-filter config, but couples our config to upstream-bug shape; will break when fix lands. NOT applied.
  2. Edit node_modules/openclaw/dist/status-EujRAaez.js locally — banned per our Never edit node_modules rule (CLAUDE.md L46).
  3. Wait for 5.7+ — current path. Hourly noise tolerated; does not impact infrastructure correctness.

Why this matters operationally

In our deployment, heartbeat fires hourly during 07-22 IST and forwards channels status output to Telegram as part of its content-gathering. The persistent Gateway event loop degraded line is included in the content the heartbeat agent summarizes. The agent often surfaces this as a user-visible alert ("HEARTBEAT ALERT: Gateway event loop degraded"). 9 such alerts in 24h is high noise for an operator and erodes trust in the heartbeat signal.

The infrastructure itself is fine — actual CPU/memory/event-loop health metrics are within normal range when measured externally (docker stats, gateway internal liveness logs). It's specifically the channels status payload-builder that's miscomputing.

Test cases to add upstream

  1. Unit test for formatEventLoopBits: input {degraded:true, delayMaxMs:0, utilization:1, cpuCoreRatio:1.04} → assert no alert (or assert it's flagged as measurement artifact)
  2. Integration test on idle gateway: start gateway, idle 60s, call channels status, assert no Gateway event loop degraded line
  3. Integration test on real load: start gateway, drive sustained load, assert Gateway event loop degraded DOES appear; stop load, assert it disappears within sustained-window threshold

Cross-reference

  • S31 close noted "I-12 channels status false-positive event-loop degraded" as a cosmetic upstream bug worth filing; S32 collected the empirical evidence above.
  • 5.4 changelog mentions "Fix A event-loop sustained-window threshold" — that's the gateway core path. This issue is about the channels.status payload-builder path, which appears not to have received the same fix.
  • Related but separate: S31 noted a stale phase=channels.telegram.start-account label persists in liveness warnings 90+ min post-boot; that's a different cosmetic logging bug and merits its own draft.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix `channels status` reports persistent false-positive `Gateway event loop degraded` on idle 5.6 (eventLoopDelayMaxMs=0 with utilization=1)