openclaw - ✅(Solved) Fix [Bug]: health reports event-loop degradation on tiny zero-delay sample while live probes are healthy [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77955Fetched 2026-05-06 06:18:51
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
2
Timeline (top)
closed ×1commented ×1cross-referenced ×1

On OpenClaw 2026.5.4, openclaw health and openclaw health --json can report the Gateway event loop as degraded for event_loop_utilization and cpu even though the same Gateway is healthy through live probe paths.

The degraded snapshot is based on a very short interval (~246ms) with zero event-loop delay (delayP99Ms=0, delayMaxMs=0) but utilization=1 and cpuCoreRatio≈1. openclaw health --verbose and openclaw status --deep both report normal event-loop delay/utilization/cpu immediately around the same time.

This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation.

Root Cause

On OpenClaw 2026.5.4, openclaw health and openclaw health --json can report the Gateway event loop as degraded for event_loop_utilization and cpu even though the same Gateway is healthy through live probe paths.

The degraded snapshot is based on a very short interval (~246ms) with zero event-loop delay (delayP99Ms=0, delayMaxMs=0) but utilization=1 and cpuCoreRatio≈1. openclaw health --verbose and openclaw status --deep both report normal event-loop delay/utilization/cpu immediately around the same time.

This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation.

Fix Action

Fixed

PR fix notes

PR #77028: fix(gateway): stabilize event-loop health sampling

Description (problem / solution / changelog)

Summary

  • Keep rapid follow-up health/readiness/status probes from classifying sub-second event-loop utilization and CPU samples as degraded.
  • Preserve event-loop health baselines until a full 1s sample window is available, so frequent polling cannot hide sustained utilization/CPU saturation.
  • Preserve the previous event-loop health snapshot during rapid follow-up probes, so frequent status/readiness polling does not drop the last sampled event-loop state.
  • Add focused gateway tests for short-window classification and cached-snapshot baseline retention.

Verification

  • pnpm test src/gateway/server/event-loop-health.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server/event-loop-health.ts src/gateway/server/event-loop-health.test.ts
  • git diff --check origin/main...HEAD
  • codex review --base origin/main

Surface

  • No new config surface.
  • User manual verification: owner performs final manual verification outside this agent run.
  • Broad validation: GitHub PR CI.

Real behavior proof

  • Behavior or issue addressed: rapid openclaw health, readiness, and status polling no longer lets sub-second CPU/event-loop utilization samples mark the Gateway degraded or erase the last sampled event-loop health state.
  • Real environment tested: isolated macOS OpenClaw source worktree on this PR branch, Node 22, pnpm, with a separate dev Gateway on loopback port 18790 and channel startup skipped.
  • Exact steps or command run after this patch: started the PR-branch Gateway with OPENCLAW_HOME=/tmp/openclaw-eventloop-proof-77028 OPENCLAW_PROFILE=eventloop-proof OPENCLAW_GATEWAY_PORT=18790 OPENCLAW_SKIP_CHANNELS=1 pnpm gateway:watch:raw --dev --auth none, then ran pnpm --silent openclaw gateway health --json | jq -c '{ok,eventLoop}' against that isolated Gateway. I also ran a direct node --import tsx runtime probe against createGatewayEventLoopHealthMonitor() to verify rapid sub-second snapshots reuse the prior full-window sample.
  • Evidence after fix: terminal output from the isolated PR-branch Gateway and the runtime monitor probe:
{"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":54719,"delayP99Ms":22.5,"delayMaxMs":36.4,"utilization":0.017,"cpuCoreRatio":0.013}}
{"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":2329,"delayP99Ms":24.3,"delayMaxMs":28.5,"utilization":0.495,"cpuCoreRatio":0.561}}
{"ok":true,"eventLoop":{"degraded":false,"reasons":[],"intervalMs":2171,"delayP99Ms":22.1,"delayMaxMs":22.1,"utilization":0.442,"cpuCoreRatio":0.445}}
{
  "first": {
    "degraded": false,
    "reasons": [],
    "intervalMs": 1102,
    "delayP99Ms": 22.3,
    "delayMaxMs": 22.3,
    "utilization": 0.006,
    "cpuCoreRatio": 0.013
  },
  "rapidSameObject": true,
  "rapidAfter100msSameObject": true,
  "rapid": {
    "degraded": false,
    "reasons": [],
    "intervalMs": 1102,
    "delayP99Ms": 22.3,
    "delayMaxMs": 22.3,
    "utilization": 0.006,
    "cpuCoreRatio": 0.013
  },
  "rapidAfter100ms": {
    "degraded": false,
    "reasons": [],
    "intervalMs": 1102,
    "delayP99Ms": 22.3,
    "delayMaxMs": 22.3,
    "utilization": 0.006,
    "cpuCoreRatio": 0.013
  }
}
  • Observed result after fix: the isolated PR-branch Gateway reports healthy event-loop snapshots, and the real monitor returns the same previous snapshot for immediate and 100ms follow-up probes instead of recomputing CPU/utilization from a sub-second interval. Real event-loop delay is still covered by the focused regression test and remains classified immediately.
  • What was not tested: full Control UI polling against the isolated Gateway; broad validation is left to GitHub PR CI.

Changed files

  • CHANGELOG.md (modified, +1/-1)
  • src/gateway/server/event-loop-health.test.ts (modified, +63/-1)
  • src/gateway/server/event-loop-health.ts (modified, +30/-20)

Code Example

$ openclaw health
Slack: configured
Gateway event loop: degraded reasons=event_loop_utilization,cpu max=0ms p99=0ms util=1 cpu=1.024
Agents: main (default), card-sherpa

---

{
  "ok": true,
  "durationMs": 411,
  "eventLoop": {
    "degraded": true,
    "reasons": ["event_loop_utilization", "cpu"],
    "intervalMs": 246,
    "delayP99Ms": 0,
    "delayMaxMs": 0,
    "utilization": 1,
    "cpuCoreRatio": 1.035
  }
}

---

$ openclaw health --verbose
Gateway connection:
  Gateway target: ws://127.0.0.1:18789
  Source: local loopback
  Bind: loopback
Slack: configured
Gateway event loop: ok max=114ms p99=43ms util=0.076 cpu=0.037

---

Health
Gateway    reachable  362ms
Event loop OK         healthy · max 115ms · p99 30ms · util 0.032 · cpu 0.036
Slack      OK         configured
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug / false-positive health diagnostic

Summary

On OpenClaw 2026.5.4, openclaw health and openclaw health --json can report the Gateway event loop as degraded for event_loop_utilization and cpu even though the same Gateway is healthy through live probe paths.

The degraded snapshot is based on a very short interval (~246ms) with zero event-loop delay (delayP99Ms=0, delayMaxMs=0) but utilization=1 and cpuCoreRatio≈1. openclaw health --verbose and openclaw status --deep both report normal event-loop delay/utilization/cpu immediately around the same time.

This looks like a sampler/cache artifact rather than actual Gateway event-loop degradation.

Environment

  • OpenClaw: 2026.5.4 (325df3e)
  • OS: macOS 14.8.5 (23J423), x86_64
  • Node: v25.9.0
  • Install method: npm global, /usr/local/bin/openclaw
  • Gateway: macOS LaunchAgent, loopback 127.0.0.1:18789
  • Channel observed: Slack configured and OK
  • Workload at time of repro: no queued/running OpenClaw tasks

Steps to reproduce

  1. Start/run the Gateway normally as a LaunchAgent.
  2. Run openclaw health.
  3. Run openclaw health --json.
  4. Run openclaw health --verbose.
  5. Run openclaw status --deep.

Expected behavior

openclaw health should not mark the Gateway event loop degraded purely from utilization/cpu on a sub-second sample when event-loop delay is 0ms, especially when the forced live probe path and status --deep report healthy metrics.

At minimum, util/cpu-only degradation should probably require a sufficiently long sampling window, or the default health path should reuse a stable event-loop sample rather than generating a destructive short-window sample.

Actual behavior

Default human output reports degraded:

$ openclaw health
Slack: configured
Gateway event loop: degraded reasons=event_loop_utilization,cpu max=0ms p99=0ms util=1 cpu=1.024
Agents: main (default), card-sherpa

JSON output reports the same degraded state:

{
  "ok": true,
  "durationMs": 411,
  "eventLoop": {
    "degraded": true,
    "reasons": ["event_loop_utilization", "cpu"],
    "intervalMs": 246,
    "delayP99Ms": 0,
    "delayMaxMs": 0,
    "utilization": 1,
    "cpuCoreRatio": 1.035
  }
}

But a forced live health probe is healthy:

$ openclaw health --verbose
Gateway connection:
  Gateway target: ws://127.0.0.1:18789
  Source: local loopback
  Bind: loopback
Slack: configured
Gateway event loop: ok max=114ms p99=43ms util=0.076 cpu=0.037

And status --deep is also healthy:

Health
Gateway    reachable  362ms
Event loop OK         healthy · max 115ms · p99 30ms · util 0.032 · cpu 0.036
Slack      OK         configured

Why this appears distinct from existing event-loop reports

I found several related issues with real event-loop or CPU saturation, for example plugin discovery/stat loops, stuck sessions, high CPU, and long RPC latency. Those reports usually include large event-loop delay, timeouts, or sustained high CPU.

This case has:

  • Gateway reachable
  • Slack OK
  • no active/queued work
  • default health showing delayMaxMs=0
  • forced live probe showing healthy p99/max/util/cpu
  • status --deep showing healthy p99/max/util/cpu

So the symptom is specifically the default health/event-loop diagnostic declaring degradation from a tiny util/cpu sample with zero event-loop delay.

Suspected area

The docs say default openclaw health asks the running Gateway for its health snapshot and may return a cached payload while refreshing in the background, while --verbose forces a live probe. This mismatch suggests the event-loop health sample used by the default health snapshot may be read/reset over too short an interval, allowing eventLoopUtilization/CPU to hit 1 over a ~250ms window and trip thresholds even though longer live-probe windows are healthy.

A possible fix would be one of:

  • ignore utilization/cpu-only degradation below a minimum sample interval,
  • require event-loop delay to be nonzero/significant for tiny intervals,
  • make event-loop health snapshot reads non-destructive,
  • or keep a rolling/cached event-loop health sample for health snapshots separate from readiness/background refresh probes.

extent analysis

TL;DR

The issue can be resolved by modifying the openclaw health command to ignore utilization/cpu-only degradation below a minimum sample interval or require event-loop delay to be nonzero for tiny intervals.

Guidance

  • Review the openclaw health command's implementation to understand how it calculates event-loop utilization and CPU usage, and consider adding a minimum sample interval threshold.
  • Investigate the possibility of making event-loop health snapshot reads non-destructive to prevent interference with live probe results.
  • Consider implementing a rolling/cached event-loop health sample for health snapshots to reduce the impact of short-term utilization spikes.
  • Verify that the forced live probe results are consistent with the expected behavior, and use this as a reference point for adjusting the default health snapshot calculation.

Example

No code snippet is provided as the issue description does not include specific implementation details.

Notes

The root cause of the issue appears to be related to the sampling interval used by the openclaw health command, which may be too short to accurately reflect the event-loop's health. The suggested fixes aim to address this by introducing a minimum sample interval or requiring nonzero event-loop delay for tiny intervals.

Recommendation

Apply a workaround by modifying the openclaw health command to ignore utilization/cpu-only degradation below a minimum sample interval, as this is a more targeted and less invasive fix compared to rewriting the event-loop health snapshot mechanism.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

openclaw health should not mark the Gateway event loop degraded purely from utilization/cpu on a sub-second sample when event-loop delay is 0ms, especially when the forced live probe path and status --deep report healthy metrics.

At minimum, util/cpu-only degradation should probably require a sufficiently long sampling window, or the default health path should reuse a stable event-loop sample rather than generating a destructive short-window sample.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: health reports event-loop degradation on tiny zero-delay sample while live probes are healthy [1 pull requests, 1 comments, 2 participants]