openclaw - ✅(Solved) Fix [Bug]: health reports event-loop degradation on tiny zero-delay sample while live probes are healthy [1 pull requests, 1 comments, 2 participants]

Q: Expected behavior

`openclaw health` should not mark the Gateway event loop degraded purely from utilization/cpu on a sub-second sample when event-loop delay is `0ms`, especially when the forced live probe path and `status --deep` report healthy metrics. At minimum, util/cpu-only degradation should probably require a sufficiently long sampling window, or the default health path should reuse a stable event-loop sample rather than generating a destructive short-window sample.

murdawkmedia · 2026-05-05T17:00:09Z

[openclaw] On OpenClaw 2026.5.4 , openclaw health and openclaw health --json can report the Gateway event loop as degraded for event loop utilization and cpu e… On OpenClaw `2026.5.4`, `openclaw health` and `openclaw health --json` can report the Gateway event loop as degraded for `event_loop_utilization` and `cpu` even though the same Gateway is healthy through live probe paths. The degraded snapshot is based on a very short interval (`~246ms`) with zero event-loop delay (`delayP99Ms=0`, `delayMaxMs=0`) but `utilization=1` and `cpuCoreRatio≈1`. `openclaw health --verbose` and `openclaw status --deep` both report normal event-loop delay/utilization/cpu immediately around the same time. This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation. # PR #77028: fix(gateway): stabilize event-loop health sampling - Repository: openclaw/openclaw - Author: rubencu - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/77028 ## Description (problem / solution / changelog) ## Summary - Keep rapid follow-up health/readiness/status probes from classifying sub-second event-loop utilization and CPU samples as degraded. - Preserve event-loop health baselines until a full 1s sample window is available, so frequent polling cannot hide sustained utilization/CPU saturation. - Preserve the previous event-loop health snapshot during rapid follow-up probes, so frequent status/readiness polling does not drop the last sampled event-loop state. - Add focused gateway tests for short-window classification and cached-snapshot baseline retention. ## Verification - `pnpm test src/gateway/server/event-loop-health.test.ts` - `pnpm exec oxfmt --check --threads=1 src/gateway/server/event-loop-health.ts src/gateway/server/event-loop-health.test.ts` - `git diff --check origin/main...HEAD` - `codex review --base origin/main` ## Surface - No new config surface. - User manual verification: owner performs final manual verification outside this agent run. - Broad validation: GitHub PR CI. ## Real behavior proof - Behavior or issue addressed: rapid `openclaw health`, readiness, and status polling no longer lets sub-second CPU/event-loop utilization samples mark the Gateway degraded or erase the last sampled event-loop health state. - Real environment tested: isolated macOS OpenClaw source worktree on this PR branch, Node 22, pnpm, with a separate dev Gateway on loopback port 18790 and channel startup skipped. - Exact steps or command run after this patch: started the PR-branch Gateway with `OPENCLAW_HOME=/tmp/openclaw-eventloop-proof-77028 OPENCLAW_PROFILE=eventloop-proof OPENCLAW_GATEWAY_PORT=18790 OPENCLAW_SKIP_CHANNELS=1 pnpm gateway:watch:raw --dev --auth none`, then ran `pnpm --silent openclaw gateway health --json | jq -c '{ok,eventLoop}'` against that isolated Gateway. I also ran a direct `node --import tsx` runtime probe against `createGatewayEventLoopHealthMonitor()` to verify rapid sub-second snapshots reuse the prior full-window sample. - Evidence after fix: terminal output from the isolated PR-branch Gateway and the runtime monitor probe: ```json {"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":54719,"delayP99Ms":22.5,"delayMaxMs":36.4,"utilization":0.017,"cpuCoreRatio":0.013}} {"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":2329,"delayP99Ms":24.3,"delayMaxMs":28.5,"utilization":0.495,"cpuCoreRatio":0.561}} {"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":2171,"delayP99Ms":22.1,"delayMaxMs":22.1,"utilization":0.442,"cpuCoreRatio":0.445}} ``` ```json { "first": { "degraded": false, "reasons": [], "intervalMs": 1102, "delayP99Ms": 22.3, "delayMaxMs": 22.3, "utilization": 0.006, "cpuCoreRatio": 0.013 }, "rapidSameObject": true, "rapidAfter100msSameObject": true, "rapid": { "degraded": false, "reasons": [], "intervalMs": 1102, "delayP99Ms": 22.3, "delayMaxMs": 22.3, "utilization": 0.006, "cpuCoreRatio": 0.013 }, "rapidAfter100ms": { "degraded": false, "reasons": [], "intervalMs": 1102, "delayP99Ms": 22.3, "delayMaxMs": 22.3, "utilization": 0.006, "cpuCoreRatio": 0.013 } } ``` - Observed result after fix: the isolated PR-branch Gateway reports healthy event-loop snapshots, and the real monitor returns the same previous snapshot for immediate and 100ms follow-up probes instead of recomputing CPU/utilization from a sub-second interval. Real event-loop delay is still covered by the focused regression test and remains classified immediately. - What was not tested: full Control UI polling against the isolated Gateway; broad validation is left to GitHub PR CI. ## Changed files - `CHANGELOG.md` (modified, +1/-1) - `src/gateway/server/event-loop-health.test.ts` (modified, +63/-1) - `src/gateway/server/event-loop-health.ts` (modified, +30/-20) ## Fixed - Fixed by PR: fix(gateway): stabilize event-loop health sampling (https://github.com/openclaw/openclaw/pull/77028) ### Bug type Behavior bug / false-positive health diagnostic ### Summary On

openclaw2026-05-05 17:00:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77955•Fetched 2026-05-06 06:18:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

murdawkmedia

Participants

clawsweeper[bot]

murdawkmedia

Timeline (top)

closed ×1commented ×1cross-referenced ×1

On OpenClaw 2026.5.4, openclaw health and openclaw health --json can report the Gateway event loop as degraded for event_loop_utilization and cpu even though the same Gateway is healthy through live probe paths.

The degraded snapshot is based on a very short interval (~246ms) with zero event-loop delay (delayP99Ms=0, delayMaxMs=0) but utilization=1 and cpuCoreRatio≈1. openclaw health --verbose and openclaw status --deep both report normal event-loop delay/utilization/cpu immediately around the same time.

This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation.

Root Cause

This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation.

Fix Action

Fixed

Fixed by PR: fix(gateway): stabilize event-loop health sampling (https://github.com/openclaw/openclaw/pull/77028)

PR fix notes

PR #77028: fix(gateway): stabilize event-loop health sampling

Repository: openclaw/openclaw
Author: rubencu
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/77028

Description (problem / solution / changelog)

Summary

Keep rapid follow-up health/readiness/status probes from classifying sub-second event-loop utilization and CPU samples as degraded.
Preserve event-loop health baselines until a full 1s sample window is available, so frequent polling cannot hide sustained utilization/CPU saturation.
Preserve the previous event-loop health snapshot during rapid follow-up probes, so frequent status/readiness polling does not drop the last sampled event-loop state.
Add focused gateway tests for short-window classification and cached-snapshot baseline retention.

Verification

pnpm test src/gateway/server/event-loop-health.test.ts
pnpm exec oxfmt --check --threads=1 src/gateway/server/event-loop-health.ts src/gateway/server/event-loop-health.test.ts
git diff --check origin/main...HEAD
codex review --base origin/main

Surface

No new config surface.
User manual verification: owner performs final manual verification outside this agent run.
Broad validation: GitHub PR CI.

Real behavior proof

Behavior or issue addressed: rapid openclaw health, readiness, and status polling no longer lets sub-second CPU/event-loop utilization samples mark the Gateway degraded or erase the last sampled event-loop health state.
Real environment tested: isolated macOS OpenClaw source worktree on this PR branch, Node 22, pnpm, with a separate dev Gateway on loopback port 18790 and channel startup skipped.
Exact steps or command run after this patch: started the PR-branch Gateway with OPENCLAW_HOME=/tmp/openclaw-eventloop-proof-77028 OPENCLAW_PROFILE=eventloop-proof OPENCLAW_GATEWAY_PORT=18790 OPENCLAW_SKIP_CHANNELS=1 pnpm gateway:watch:raw --dev --auth none, then ran pnpm --silent openclaw gateway health --json | jq -c '{ok,eventLoop}' against that isolated Gateway. I also ran a direct node --import tsx runtime probe against createGatewayEventLoopHealthMonitor() to verify rapid sub-second snapshots reuse the prior full-window sample.
Evidence after fix: terminal output from the isolated PR-branch Gateway and the runtime monitor probe:

{"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":54719,"delayP99Ms":22.5,"delayMaxMs":36.4,"utilization":0.017,"cpuCoreRatio":0.013}}
{"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":2329,"delayP99Ms":24.3,"delayMaxMs":28.5,"utilization":0.495,"cpuCoreRatio":0.561}}
{"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":2171,"delayP99Ms":22.1,"delayMaxMs":22.1,"utilization":0.442,"cpuCoreRatio":0.445}}

{
  "first": {
    "degraded": false,
    "reasons": [],
    "intervalMs": 1102,
    "delayP99Ms": 22.3,
    "delayMaxMs": 22.3,
    "utilization": 0.006,
    "cpuCoreRatio": 0.013
  },
  "rapidSameObject": true,
  "rapidAfter100msSameObject": true,
  "rapid": {
    "degraded": false,
    "reasons": [],
    "intervalMs": 1102,
    "delayP99Ms": 22.3,
    "delayMaxMs": 22.3,
    "utilization": 0.006,
    "cpuCoreRatio": 0.013
  },
  "rapidAfter100ms": {
    "degraded": false,
    "reasons": [],
    "intervalMs": 1102,
    "delayP99Ms": 22.3,
    "delayMaxMs": 22.3,
    "utilization": 0.006,
    "cpuCoreRatio": 0.013
  }
}

Observed result after fix: the isolated PR-branch Gateway reports healthy event-loop snapshots, and the real monitor returns the same previous snapshot for immediate and 100ms follow-up probes instead of recomputing CPU/utilization from a sub-second interval. Real event-loop delay is still covered by the focused regression test and remains classified immediately.
What was not tested: full Control UI polling against the isolated Gateway; broad validation is left to GitHub PR CI.

Changed files

CHANGELOG.md (modified, +1/-1)
src/gateway/server/event-loop-health.test.ts (modified, +63/-1)
src/gateway/server/event-loop-health.ts (modified, +30/-20)

Code Example

$ openclaw health
Slack: configured
Gateway event loop: degraded reasons=event_loop_utilization,cpu max=0ms p99=0ms util=1 cpu=1.024
Agents: main (default), card-sherpa

---

{
  "ok": true,
  "durationMs": 411,
  "eventLoop": {
    "degraded": true,
    "reasons": ["event_loop_utilization", "cpu"],
    "intervalMs": 246,
    "delayP99Ms": 0,
    "delayMaxMs": 0,
    "utilization": 1,
    "cpuCoreRatio": 1.035
  }
}

---

$ openclaw health --verbose
Gateway connection:
  Gateway target: ws://127.0.0.1:18789
  Source: local loopback
  Bind: loopback
Slack: configured
Gateway event loop: ok max=114ms p99=43ms util=0.076 cpu=0.037

---

Health
Gateway    reachable  362ms
Event loop OK         healthy · max 115ms · p99 30ms · util 0.032 · cpu 0.036
Slack      OK         configured

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug / false-positive health diagnostic

Summary

This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation.

Environment

OpenClaw: 2026.5.4 (325df3e)
OS: macOS 14.8.5 (23J423), x86_64
Node: v25.9.0
Install method: npm global, /usr/local/bin/openclaw
Gateway: macOS LaunchAgent, loopback 127.0.0.1:18789
Channel observed: Slack configured and OK
Workload at time of repro: no queued/running OpenClaw tasks

Steps to reproduce

Start/run the Gateway normally as a LaunchAgent.
Run openclaw health.
Run openclaw health --json.
Run openclaw health --verbose.
Run openclaw status --deep.

Expected behavior

openclaw health should not mark the Gateway event loop degraded purely from utilization/cpu on a sub-second sample when event-loop delay is 0ms, especially when the forced live probe path and status --deep report healthy metrics.

At minimum, util/cpu-only degradation should probably require a sufficiently long sampling window, or the default health path should reuse a stable event-loop sample rather than generating a destructive short-window sample.

Actual behavior

Default human output reports degraded:

$ openclaw health
Slack: configured
Gateway event loop: degraded reasons=event_loop_utilization,cpu max=0ms p99=0ms util=1 cpu=1.024
Agents: main (default), card-sherpa

JSON output reports the same degraded state:

{
  "ok": true,
  "durationMs": 411,
  "eventLoop": {
    "degraded": true,
    "reasons": ["event_loop_utilization", "cpu"],
    "intervalMs": 246,
    "delayP99Ms": 0,
    "delayMaxMs": 0,
    "utilization": 1,
    "cpuCoreRatio": 1.035
  }
}

But a forced live health probe is healthy:

$ openclaw health --verbose
Gateway connection:
  Gateway target: ws://127.0.0.1:18789
  Source: local loopback
  Bind: loopback
Slack: configured
Gateway event loop: ok max=114ms p99=43ms util=0.076 cpu=0.037

And status --deep is also healthy:

Health
Gateway    reachable  362ms
Event loop OK         healthy · max 115ms · p99 30ms · util 0.032 · cpu 0.036
Slack      OK         configured

Why this appears distinct from existing event-loop reports

I found several related issues with real event-loop or CPU saturation, for example plugin discovery/stat loops, stuck sessions, high CPU, and long RPC latency. Those reports usually include large event-loop delay, timeouts, or sustained high CPU.

This case has:

Gateway reachable
Slack OK
no active/queued work
default health showing delayMaxMs=0
forced live probe showing healthy p99/max/util/cpu
status --deep showing healthy p99/max/util/cpu

So the symptom is specifically the default health/event-loop diagnostic declaring degradation from a tiny util/cpu sample with zero event-loop delay.

Suspected area

The docs say default openclaw health asks the running Gateway for its health snapshot and may return a cached payload while refreshing in the background, while --verbose forces a live probe. This mismatch suggests the event-loop health sample used by the default health snapshot may be read/reset over too short an interval, allowing eventLoopUtilization/CPU to hit 1 over a ~250ms window and trip thresholds even though longer live-probe windows are healthy.

A possible fix would be one of:

ignore utilization/cpu-only degradation below a minimum sample interval,
require event-loop delay to be nonzero/significant for tiny intervals,
make event-loop health snapshot reads non-destructive,
or keep a rolling/cached event-loop health sample for health snapshots separate from readiness/background refresh probes.

extent analysis

TL;DR

The issue can be resolved by modifying the openclaw health command to ignore utilization/cpu-only degradation below a minimum sample interval or require event-loop delay to be nonzero for tiny intervals.

Guidance

Review the openclaw health command's implementation to understand how it calculates event-loop utilization and CPU usage, and consider adding a minimum sample interval threshold.
Investigate the possibility of making event-loop health snapshot reads non-destructive to prevent interference with live probe results.
Consider implementing a rolling/cached event-loop health sample for health snapshots to reduce the impact of short-term utilization spikes.
Verify that the forced live probe results are consistent with the expected behavior, and use this as a reference point for adjusting the default health snapshot calculation.

Example

No code snippet is provided as the issue description does not include specific implementation details.

Notes

The root cause of the issue appears to be related to the sampling interval used by the openclaw health command, which may be too short to accurately reflect the event-loop's health. The suggested fixes aim to address this by introducing a minimum sample interval or requiring nonzero event-loop delay for tiny intervals.

Recommendation

Apply a workaround by modifying the openclaw health command to ignore utilization/cpu-only degradation below a minimum sample interval, as this is a more targeted and less invasive fix compared to rewriting the event-loop health snapshot mechanism.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: health reports event-loop degradation on tiny zero-delay sample while live probes are healthy [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #77028: fix(gateway): stabilize event-loop health sampling

Description (problem / solution / changelog)

Summary

Verification

Surface

Real behavior proof

Changed files

Code Example

Bug type

Summary

Environment

Steps to reproduce

Expected behavior

Actual behavior

Why this appears distinct from existing event-loop reports

Suspected area

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING