openclaw - 💡(How to fix) Fix [Bug]: Gateway CPU stuck at 100% causing service degradation under high load [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63149Fetched 2026-04-09 07:57:49
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

Gateway process continuously occupies a single CPU core at 100% under high load, causing subagent notifications to fail and heartbeat queues to severely delay.

Error Message

[2026-04-08T03:05:00+08:00] [WARN] [heartbeat] monitor heartbeat triggered (lane wait exceeded 26000ms) [2026-04-08T00:03:57.498+08:00] ERROR: Subagent announce failed: Error: gateway timeout after 10000ms [2026-04-08T09:29:45.552+08:00] ERROR: GatewayDrainingError: Gateway is draining for restart

Root Cause

Additional information

Possible root causes:

  1. agents.defaults.maxConcurrent: 7 × 5 agents = theoretical max 35 concurrent sessions
  2. subagents.maxConcurrent: 14 with no special limits for heartbeat tasks
  3. 300s embedded run timeout blocks drain process
  4. Lack of Gateway-level resource isolation or rate limiting

Code Example

[2026-04-08T03:05:00+08:00] [WARN] [heartbeat] monitor heartbeat triggered (lane wait exceeded 26000ms)
[2026-04-08T00:03:57.498+08:00] ERROR: Subagent announce failed: Error: gateway timeout after 10000ms
[2026-04-08T09:29:45.552+08:00] ERROR: GatewayDrainingError: Gateway is draining for restart
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process hangs under load)

Beta release blocker

No

Summary

Gateway process continuously occupies a single CPU core at 100% under high load, causing subagent notifications to fail and heartbeat queues to severely delay.

Steps to reproduce

  1. Configure 5 Agents (orchestrator/monitor/coder/tester/researcher) each with heartbeat scheduled tasks
  2. Stagger heartbeat intervals (13m/14m/15m/16m/17m), but heartbeat triggers spawn multiple isolated sessions for distillation/introspection
  3. When heartbeats occur nearly simultaneously (~hourly peak), large numbers of concurrent subagent tasks start simultaneously
  4. Observe Gateway single core fully saturated, response time degrades from normal 10-100ms to 1000-3000ms

Expected behavior

Gateway should stably handle multi-agent concurrent heartbeats without service degradation.

Actual behavior

Gateway CPU 100% on single core, causing:

  • gateway timeout after 10000ms for subagent notifications
  • probe timeout 30000ms for bot channel startup
  • lane wait exceeded 26s for heartbeat queue
  • ~5 minutes for full drain after config restart

OpenClaw version

2026.4.5

Operating system

macOS (Gateway Local mode)

Install method

NOT_ENOUGH_INFO

Model

NOT_ENOUGH_INFO

Provider / routing chain

NOT_ENOUGH_INFO

Additional provider/model setup details

Gateway Local mode with Feishu WebSocket channel × 5 accounts

Logs, screenshots, and evidence

[2026-04-08T03:05:00+08:00] [WARN] [heartbeat] monitor heartbeat triggered (lane wait exceeded 26000ms)
[2026-04-08T00:03:57.498+08:00] ERROR: Subagent announce failed: Error: gateway timeout after 10000ms
[2026-04-08T09:29:45.552+08:00] ERROR: GatewayDrainingError: Gateway is draining for restart

Impact and severity

Affected: Multi-agent deployments with heartbeat tasks Severity: High (blocks critical notifications and heartbeat processing) Frequency: Hourly peaks when multiple heartbeats coincide Consequence: Service degradation, missed notifications, delayed task processing

Additional information

Possible root causes:

  1. agents.defaults.maxConcurrent: 7 × 5 agents = theoretical max 35 concurrent sessions
  2. subagents.maxConcurrent: 14 with no special limits for heartbeat tasks
  3. 300s embedded run timeout blocks drain process
  4. Lack of Gateway-level resource isolation or rate limiting

Suggested fix: Add Gateway-level concurrency control, reduce timeout, and serialize heartbeat-triggered distillation tasks.

extent analysis

TL;DR

Implement Gateway-level concurrency control to prevent single-core saturation under high load.

Guidance

  • Review and adjust agents.defaults.maxConcurrent and subagents.maxConcurrent settings to prevent excessive concurrent sessions.
  • Consider implementing rate limiting for heartbeat tasks to prevent simultaneous triggering of multiple subagent tasks.
  • Evaluate the effectiveness of reducing the embedded run timeout to facilitate faster drain processes.
  • Investigate the feasibility of serializing heartbeat-triggered distillation tasks to prevent concurrent execution.

Example

No specific code snippet is provided due to the lack of explicit code references in the issue.

Notes

The suggested fix relies on the availability of configuration options for concurrency control and rate limiting. The effectiveness of these adjustments may depend on the specific Gateway and subagent configurations.

Recommendation

Apply workaround: Implement Gateway-level concurrency control and adjust relevant settings to prevent single-core saturation and service degradation. This approach is chosen due to the high severity and frequency of the issue, and the potential for concurrency control to mitigate the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Gateway should stably handle multi-agent concurrent heartbeats without service degradation.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING