openclaw - 💡(How to fix) Fix Event-loop saturation and ACP session leak on 2026.4.27 (9d8de70) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74345Fetched 2026-04-30 06:25:14
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
commented ×1cross-referenced ×1

Between a8b64b7d52 (good) and 9d8de70c20 (bad) — both shipped under the 2026.4.27 tag, ~509 commits of post-tag drift — the gateway becomes unable to close ACP sessions during task-registry-maintenance runs. Sessions accumulate unbounded, the event loop is held for 480-490 seconds at a time by long-running synchronous work, every embedded model run surfaces decision=surface_error reason=timeout, Telegram polling stalls (getUpdates stuck for 700+s), and Discord disconnects with gateway was not ready after 15000ms. The bot becomes effectively unreachable across all transports.

A simple systemctl restart openclaw-gateway does not clear it — a fresh process reproduces the leak within seconds at idle (no user activity required).

Rolling back the worktree to a8b64b7d52 and rebuilding fully resolves the issue. Both versions report OpenClaw 2026.4.27 (<short-hash>).

Root Cause

  1. Transport flapping
    [telegram] Polling stall detected (active getUpdates stuck for 701.09s); forcing restart.
    [discord] gateway was not ready after 15000ms; restarting gateway
    Both inbound transports flap because their heartbeat timers fire on a thread the event loop can't service.

Fix Action

Fix / Workaround

Verified workaround

Code Example

# Bad commit — symptom appears immediately on a fresh process at idle
cd ~/openclaw
git checkout 9d8de70c20
rm -rf dist && pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
journalctl --user -u openclaw-gateway -f   # watch for the symptoms below

---

[tasks/task-registry-maintenance] Failed to close orphaned parent-owned ACP session during task maintenance
   [tasks/task-registry-maintenance] Failed to close terminal ACP session during task maintenance

---

[session-write-lock] releasing lock held for 489034ms (max=15000ms): /home/<user>/.openclaw/agents/claude/sessions/sessions.json.lock
   [session-write-lock] releasing lock held for 76908ms  (max=15000ms): /home/<user>/.openclaw/agents/main/sessions/sessions.json.lock

---

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=487s
       eventLoopDelayP99Ms=309.1 eventLoopDelayMaxMs=480767.9 eventLoopUtilization=0.993 cpuCoreRatio=1.002
       active=0 waiting=0 queued=0

---

[agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.5
   [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4

---

[telegram] Polling stall detected (active getUpdates stuck for 701.09s); forcing restart.
   [discord] gateway was not ready after 15000ms; restarting gateway

---

cd ~/openclaw
git reset --hard a8b64b7d523170ffdcabb538e601c6a871d8a7a7
rm -rf dist
pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
RAW_BUFFERClick to expand / collapse

Gateway event-loop pegs and ACP session lifecycle leaks on 2026.4.27 post-tag drift (commit 9d8de70)

Summary

Between a8b64b7d52 (good) and 9d8de70c20 (bad) — both shipped under the 2026.4.27 tag, ~509 commits of post-tag drift — the gateway becomes unable to close ACP sessions during task-registry-maintenance runs. Sessions accumulate unbounded, the event loop is held for 480-490 seconds at a time by long-running synchronous work, every embedded model run surfaces decision=surface_error reason=timeout, Telegram polling stalls (getUpdates stuck for 700+s), and Discord disconnects with gateway was not ready after 15000ms. The bot becomes effectively unreachable across all transports.

A simple systemctl restart openclaw-gateway does not clear it — a fresh process reproduces the leak within seconds at idle (no user activity required).

Rolling back the worktree to a8b64b7d52 and rebuilding fully resolves the issue. Both versions report OpenClaw 2026.4.27 (<short-hash>).

Reproduction

# Bad commit — symptom appears immediately on a fresh process at idle
cd ~/openclaw
git checkout 9d8de70c20
rm -rf dist && pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
journalctl --user -u openclaw-gateway -f   # watch for the symptoms below

Symptom fingerprint

Five log signals appear together within ~30 seconds of a fresh gateway start, with no user activity:

  1. ACP session-close maintenance failures looping

    [tasks/task-registry-maintenance] Failed to close orphaned parent-owned ACP session during task maintenance
    [tasks/task-registry-maintenance] Failed to close terminal ACP session during task maintenance

    ~10 per minute, observed ~2,278 in 24 hours.

  2. Session-write-lock holds far past max

    [session-write-lock] releasing lock held for 489034ms (max=15000ms): /home/<user>/.openclaw/agents/claude/sessions/sessions.json.lock
    [session-write-lock] releasing lock held for 76908ms  (max=15000ms): /home/<user>/.openclaw/agents/main/sessions/sessions.json.lock

    29-489 seconds repeated; max-allowed is 15s.

  3. Event loop pegged

    [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=487s
        eventLoopDelayP99Ms=309.1 eventLoopDelayMaxMs=480767.9 eventLoopUtilization=0.993 cpuCoreRatio=1.002
        active=0 waiting=0 queued=0

    eventLoopDelayMaxMs consistently 480000+ ms (≈8 min) per ~500s window. active=0 waiting=0 queued=0 rules out backed-up agent work — something held synchronously.

  4. Embedded model runs all timeout

    [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.5
    [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4

    Every model call times out, both gpt-5.5 and the gpt-5.4 fallback. OpenAI status page is green, access_token validates fine — this is local event-loop saturation, not a provider issue.

  5. Transport flapping

    [telegram] Polling stall detected (active getUpdates stuck for 701.09s); forcing restart.
    [discord] gateway was not ready after 15000ms; restarting gateway

    Both inbound transports flap because their heartbeat timers fire on a thread the event loop can't service.

Process-level: gateway node process at 30-77 % CPU with no user activity; Tasks (cgroup) climbs unbounded — observed 1138 before manual intervention. On a8b64b7d52, idle is 12 % CPU and ~85 tasks steady.

Verified workaround

cd ~/openclaw
git reset --hard a8b64b7d523170ffdcabb538e601c6a871d8a7a7
rm -rf dist
pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway

After ~90 seconds, all five symptoms disappear (verified by 0 maintenance failures / 0 liveness warnings / 0 long lock holds in a 90s observation window).

Likely culprits

git log --oneline a8b64b7..9d8de70 is 509 commits. Highest-suspicion candidates based on the fingerprint (session-write-lock holds + ACP session lifecycle + gateway transport):

  • 023d3371a5 refactor(gateway): classify gateway transport failures
  • 2b811fe6d9 fix(memory): make qmd gateway startup lazy
  • afc4f06ca3 fix(memory): isolate qmd boot refresh
  • Any change to task-registry / ACP session close paths

A bisect across that range should land it quickly given how immediately the symptom reproduces.

Environment

  • OpenClaw 2026.4.27 (both commits report this version)
  • Node 22.22.2, pnpm 10.33.0
  • Linux x86_64, systemd-managed user service
  • Channels enabled: Telegram, Discord, Signal
  • ACP plugin (@zed-industries/codex-acp) and claude-agent-acp wrappers active

Additional logs / artifacts

I have ~24 hours of journal output covering the broken build, plus a side-by-side comparison against the fresh post-rollback gateway. Happy to attach a redacted excerpt or run any specific diagnostic if it would help bisecting.

extent analysis

TL;DR

Reverting to commit a8b64b7d52 resolves the issue, suggesting a problem introduced between this commit and 9d8de70c20.

Guidance

  1. Bisect the commits: Perform a binary search between a8b64b7d52 and 9d8de70c20 to identify the specific commit causing the issue.
  2. Inspect session-write-lock and ACP session close paths: Review changes to these areas, as they are likely culprits based on the symptom fingerprint.
  3. Verify event loop utilization: Monitor eventLoopDelayMaxMs and eventLoopUtilization to ensure the event loop is not being held synchronously.
  4. Test transport flapping: Check for polling stalls and gateway readiness issues on Telegram and Discord after applying any fixes.

Example

No specific code snippet is provided, as the issue is related to a complex interaction between multiple components.

Notes

The issue is likely related to changes introduced between a8b64b7d52 and 9d8de70c20, and reverting to the earlier commit resolves the issue. However, a thorough bisect and review of the code changes are necessary to identify the root cause.

Recommendation

Apply the workaround by reverting to commit a8b64b7d52 until the root cause is identified and fixed, as it immediately resolves the symptoms.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Event-loop saturation and ACP session leak on 2026.4.27 (9d8de70) [1 comments, 2 participants]