openclaw - 💡(How to fix) Fix bug: Pi session event queue self-wait can hang Gateway at tool calls

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

OpenClaw 2026.5.22 can deadlock an embedded Pi agent turn at the tool-call boundary. In production this made the Gateway appear active/running while Feishu replies, /health, /status, and openclaw gateway status became intermittently or fully unresponsive. This is a catastrophic Gateway availability bug because one agent turn can effectively starve the main Gateway process.

Root Cause

The issue appears to be in the 2026.5.22-era Pi embedded session write-lock changes around:

src/agents/pi-embedded-runner/run/attempt.session-lock.ts

The problematic interaction is:

  1. installAwaitableSessionEventQueue() wraps _handleAgentEvent() so event handling is represented by / waits on session._agentEventQueue.
  2. installSessionExternalHookWriteLock() wraps agent.beforeToolCall and calls waitForSessionEventQueue(session) before taking the write lock.
  3. During normal event processing, the current _agentEventQueue entry handles a tool_call event and invokes beforeToolCall.
  4. The wrapped beforeToolCall then waits for session._agentEventQueue — which is the same queue promise containing the current event handler.

That creates a self-wait/deadlock:

_handleAgentEvent
  -> _agentEventQueue current entry
    -> _processAgentEvent(tool_call)
      -> agent.beforeToolCall()
        -> waitForSessionEventQueue(session)
          -> waits for current _agentEventQueue entry to complete

The current entry cannot complete because it is waiting inside itself.

Fix Action

Workaround

Rollback Gateway runtime to 2026.5.19 and restart externally. In this environment the 5.19 rollback restored:

openclaw gateway status: OK
/health: 200
/status: 200
Feishu WS: ready

If config was written by 2026.5.22, 5.19 may refuse startup with exit 78. For an intentional rollback/recovery window, the service needs:

OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1

This should be treated as a recovery-only workaround, not a long-term fix.

Code Example

liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=43050.3
  eventLoopUtilization=1
  active=... processing/model_call ... last=model_call:started | ... processing/embedded_run ...

liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=53888.4
  eventLoopUtilization=1
  active=... processing/tool_call ... last=tool:exec:started

---

per-chat task exceeded 300000ms cap
Feishu WebSocket reconnects
Gateway HTTP probes time out despite the process still listening on 127.0.0.1:18789

---

src/agents/pi-embedded-runner/run/attempt.session-lock.ts

---

_handleAgentEvent
  -> _agentEventQueue current entry
    -> _processAgentEvent(tool_call)
      -> agent.beforeToolCall()
        -> waitForSessionEventQueue(session)
          -> waits for current _agentEventQueue entry to complete

---

const activeSessionEventProcessing = new AsyncLocalStorage<unknown>();

async function waitForSessionEventQueue(session: unknown): Promise<void> {
  // Hooks invoked by the queue entry itself must not wait for that same entry to finish.
  if (activeSessionEventProcessing.getStore() === session) {
    return;
  }
  // existing queue-drain logic...
}

session["_processAgentEvent"] = async function lockedProcessAgentEvent(this: unknown, event: unknown) {
  return await activeSessionEventProcessing.run(session, async () => {
    if (!eventMayReachTranscriptWriters(session, event)) {
      return await original.call(this, event);
    }
    return await params.withSessionWriteLock(async () => await original.call(this, event));
  });
};

---

_handleAgentEvent -> _agentEventQueue -> _processAgentEvent -> beforeToolCall

---

node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# 2 files passed, 66 tests passed

git diff --check -- src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# OK

XDG_DATA_HOME=/home/liao/.hermes/tmp/pnpm-data pnpm build
# OK

---

e1765a6dcd fix(agents): avoid session event queue self-wait

---

openclaw gateway status: OK
/health: 200
/status: 200
Feishu WS: ready

---

OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1
RAW_BUFFERClick to expand / collapse

Summary

OpenClaw 2026.5.22 can deadlock an embedded Pi agent turn at the tool-call boundary. In production this made the Gateway appear active/running while Feishu replies, /health, /status, and openclaw gateway status became intermittently or fully unresponsive. This is a catastrophic Gateway availability bug because one agent turn can effectively starve the main Gateway process.

Impact

  • Affected runtime: observed on 2026.5.22.
  • Last known stable runtime in this environment: 2026.5.19.
  • Channel impact: Feishu WebSocket remained connected at first, but replies and probes stalled.
  • Operational impact: systemd still reported active/running, so normal service checks were misleading.

Observed diagnostics

Representative liveness diagnostics from the affected runtime:

liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=43050.3
  eventLoopUtilization=1
  active=... processing/model_call ... last=model_call:started | ... processing/embedded_run ...

liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=53888.4
  eventLoopUtilization=1
  active=... processing/tool_call ... last=tool:exec:started

Additional symptoms:

per-chat task exceeded 300000ms cap
Feishu WebSocket reconnects
Gateway HTTP probes time out despite the process still listening on 127.0.0.1:18789

Root cause analysis

The issue appears to be in the 2026.5.22-era Pi embedded session write-lock changes around:

src/agents/pi-embedded-runner/run/attempt.session-lock.ts

The problematic interaction is:

  1. installAwaitableSessionEventQueue() wraps _handleAgentEvent() so event handling is represented by / waits on session._agentEventQueue.
  2. installSessionExternalHookWriteLock() wraps agent.beforeToolCall and calls waitForSessionEventQueue(session) before taking the write lock.
  3. During normal event processing, the current _agentEventQueue entry handles a tool_call event and invokes beforeToolCall.
  4. The wrapped beforeToolCall then waits for session._agentEventQueue — which is the same queue promise containing the current event handler.

That creates a self-wait/deadlock:

_handleAgentEvent
  -> _agentEventQueue current entry
    -> _processAgentEvent(tool_call)
      -> agent.beforeToolCall()
        -> waitForSessionEventQueue(session)
          -> waits for current _agentEventQueue entry to complete

The current entry cannot complete because it is waiting inside itself.

Candidate fix

Track the currently executing session event queue entry with AsyncLocalStorage, and make waitForSessionEventQueue(session) return immediately only when it is called from inside that same active session event processing context.

External hook/cleanup/provider paths should still drain pending session events normally.

Minimal patch shape:

const activeSessionEventProcessing = new AsyncLocalStorage<unknown>();

async function waitForSessionEventQueue(session: unknown): Promise<void> {
  // Hooks invoked by the queue entry itself must not wait for that same entry to finish.
  if (activeSessionEventProcessing.getStore() === session) {
    return;
  }
  // existing queue-drain logic...
}

session["_processAgentEvent"] = async function lockedProcessAgentEvent(this: unknown, event: unknown) {
  return await activeSessionEventProcessing.run(session, async () => {
    if (!eventMayReachTranscriptWriters(session, event)) {
      return await original.call(this, event);
    }
    return await params.withSessionWriteLock(async () => await original.call(this, event));
  });
};

Regression test

Add a test that simulates the actual self-wait path:

_handleAgentEvent -> _agentEventQueue -> _processAgentEvent -> beforeToolCall

The test should prove the hook completes and does not time out when invoked from inside the active queue entry, while existing tests should continue to prove external hooks still drain queued session events.

A local candidate fix passed:

node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# 2 files passed, 66 tests passed

git diff --check -- src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# OK

XDG_DATA_HOME=/home/liao/.hermes/tmp/pnpm-data pnpm build
# OK

Candidate fix commit in fork:

e1765a6dcd fix(agents): avoid session event queue self-wait

Workaround

Rollback Gateway runtime to 2026.5.19 and restart externally. In this environment the 5.19 rollback restored:

openclaw gateway status: OK
/health: 200
/status: 200
Feishu WS: ready

If config was written by 2026.5.22, 5.19 may refuse startup with exit 78. For an intentional rollback/recovery window, the service needs:

OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1

This should be treated as a recovery-only workaround, not a long-term fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix bug: Pi session event queue self-wait can hang Gateway at tool calls