openclaw - 💡(How to fix) Fix [Bug]: @openclaw/codex notification handlers (account/rateLimits/updated, mcpServer/startupStatus/updated) synchronously block Node event loop [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In @openclaw/codex 2026.5.7 (and still in 2026.5.12), the codex_app_server:notification handler appears to do synchronous work heavy enough to starve the Node event loop when invoked during an embedded_run cycle on a agentRuntime: { id: "codex" } agent. Two notification types are confirmed wedge triggers:

  1. account/rateLimits/updated — fired during a cron-triggered embedded_run. Hard wedge, instant: gateway becomes unreachable to AWS health checks within seconds.
  2. mcpServer/startupStatus/updated — fired when the codex plugin spawns MCP servers for a codex-runtime agent's tool-use chat turn. Wedge within seconds of the notification, even on a single chat turn with normal workspace-bootstrap reads + memory_search + tool use.

These are distinct from the bedrock-mantle model-ref claim bug fixed by #81511 in 2026.5.12. They're also distinct from the orphan-process leak tracked in #44790.

Root Cause

The cron-driven path (variant #2) is unrecoverable because the gateway never re-enters the loop to drain Telegram polls and respond to AWS health checks. The chat-driven path (variant #4) presents identically in our deployment.

Fix Action

Fix / Workaround

These are distinct from the bedrock-mantle model-ref claim bug fixed by #81511 in 2026.5.12. They're also distinct from the orphan-process leak tracked in #44790.

eventLoopDelayMaxMs=3147.8 from a single notification dispatch. Combined with the heavy first-turn tool use, enough to cross the wedge threshold.

TriggerNotificationOutcome
Active-memory embedded_run invoking amazon-bedrock/* with codex loaded (pre-2026.5.12)n/a (resolver hijack → bedrock-mantle 404 loop)Slow death spiral (~8 h) — FIXED by #81511 in 2026.5.12
Cron on codex-runtime agentaccount/rateLimits/updatedHard wedge, instant — this report, variant #2
Chat on codex-runtime agent + MCP tool usemcpServer/startupStatus/updatedWedge within seconds — this report, variant #4
Steady-state idle, codex+acpx loaded(only rawResponseItem/completed observed)Slow heap growth → OOM at ~2.0 GB after ~42 h uptime — separate, see #44790

Code Example

2026-05-12 09:01:09 [fetch-timeout] fetch timeout after 10000ms operation=fetchWithTimeout
  url=https://api.telegram.org/bot.../getMe

2026-05-12 09:01:22 [diagnostic] liveness warning:
  reasons=event_loop_delay,cpu
  eventLoopDelayP99Ms=2831.2 eventLoopDelayMaxMs=8011.1
  eventLoopUtilization=0.934 cpuCoreRatio=0.99
  work=[active=agent:valvest:cron:fd0c69ed-...(processing/embedded_run,q=0,age=4s
    last=codex_app_server:notification:account/rateLimits/updated)]

---

2026-05-14 06:10:39 [agent/embedded] workspace bootstrap file MEMORY.md is 14561 chars
  (limit 12000); truncating in injected context

2026-05-14 06:10:46 [diagnostic] liveness warning:
  reasons=event_loop_delay
  eventLoopDelayP99Ms=264.6 eventLoopDelayMaxMs=3147.8
  eventLoopUtilization=0.564
  work=[active=agent:valpeak:main(processing/embedded_run,q=1,age=2s
    last=codex_app_server:notification:mcpServer/startupStatus/updated)]

2026-05-14 06:11:11 [ws] ⇄ res ✓ sessions.list 70ms ...
[then 12 minutes of nothing — wedged hard]
RAW_BUFFERClick to expand / collapse

Summary

In @openclaw/codex 2026.5.7 (and still in 2026.5.12), the codex_app_server:notification handler appears to do synchronous work heavy enough to starve the Node event loop when invoked during an embedded_run cycle on a agentRuntime: { id: "codex" } agent. Two notification types are confirmed wedge triggers:

  1. account/rateLimits/updated — fired during a cron-triggered embedded_run. Hard wedge, instant: gateway becomes unreachable to AWS health checks within seconds.
  2. mcpServer/startupStatus/updated — fired when the codex plugin spawns MCP servers for a codex-runtime agent's tool-use chat turn. Wedge within seconds of the notification, even on a single chat turn with normal workspace-bootstrap reads + memory_search + tool use.

These are distinct from the bedrock-mantle model-ref claim bug fixed by #81511 in 2026.5.12. They're also distinct from the orphan-process leak tracked in #44790.

Affected versions

Reproduced on OpenClaw + @openclaw/codex 2026.5.7 and 2026.5.12 (stable). Not addressed by #81511.

Evidence

Variant #2: cron + account/rateLimits/updated

Trigger: a daily cron 0 9 * * * Australia/Sydney on a agentRuntime: { id: "codex" } agent (valvest, with openai/gpt-5.5). Within ~4 seconds of the cron firing:

2026-05-12 09:01:09 [fetch-timeout] fetch timeout after 10000ms operation=fetchWithTimeout
  url=https://api.telegram.org/bot.../getMe

2026-05-12 09:01:22 [diagnostic] liveness warning:
  reasons=event_loop_delay,cpu
  eventLoopDelayP99Ms=2831.2 eventLoopDelayMaxMs=8011.1
  eventLoopUtilization=0.934 cpuCoreRatio=0.99
  work=[active=agent:valvest:cron:fd0c69ed-...(processing/embedded_run,q=0,age=4s
    last=codex_app_server:notification:account/rateLimits/updated)]

Then zero further journal entries for the remaining ~8 hours of that boot. The OS was dead at the journal level from 09:01 onward; AWS StatusCheckFailed=5 per 5-min window continuously. Recovery required lightsail stop-instance --force + start-instance.

Variant #4: chat + MCP tool use + mcpServer/startupStatus/updated

Trigger: a routine chat DM ("What's today's workout?") to a @openclaw/codex-runtime agent that uses MCP (valpeak, with GitHub MCP). During the normal turn (memory_search + workspace bootstrap-file reads — MEMORY.md, SOUL.md, training-plan.md, daily notes — and the agent's first MCP tool call):

2026-05-14 06:10:39 [agent/embedded] workspace bootstrap file MEMORY.md is 14561 chars
  (limit 12000); truncating in injected context

2026-05-14 06:10:46 [diagnostic] liveness warning:
  reasons=event_loop_delay
  eventLoopDelayP99Ms=264.6 eventLoopDelayMaxMs=3147.8
  eventLoopUtilization=0.564
  work=[active=agent:valpeak:main(processing/embedded_run,q=1,age=2s
    last=codex_app_server:notification:mcpServer/startupStatus/updated)]

2026-05-14 06:11:11 [ws] ⇄ res ✓ sessions.list 70ms ...
[then 12 minutes of nothing — wedged hard]

eventLoopDelayMaxMs=3147.8 from a single notification dispatch. Combined with the heavy first-turn tool use, enough to cross the wedge threshold.

Risk ladder observed

TriggerNotificationOutcome
Active-memory embedded_run invoking amazon-bedrock/* with codex loaded (pre-2026.5.12)n/a (resolver hijack → bedrock-mantle 404 loop)Slow death spiral (~8 h) — FIXED by #81511 in 2026.5.12
Cron on codex-runtime agentaccount/rateLimits/updatedHard wedge, instant — this report, variant #2
Chat on codex-runtime agent + MCP tool usemcpServer/startupStatus/updatedWedge within seconds — this report, variant #4
Steady-state idle, codex+acpx loaded(only rawResponseItem/completed observed)Slow heap growth → OOM at ~2.0 GB after ~42 h uptime — separate, see #44790

Notification handlers observed to fire harmlessly (no wedge): rawResponseItem/completed. So it's not all codex_app_server:notification types — it appears to be specifically the ones that do non-trivial sync work in the dispatch path.

Reproduction

  1. Install @openclaw/codex and configure an agent with agentRuntime: { id: "codex" } and model: openai/gpt-5.5.
  2. For variant #2: schedule a cron that triggers an embedded_run on that agent (e.g. a daily report job). Wait for account/rateLimits/updated to fire (eventually inevitable — happens on every Codex auth state change).
  3. For variant #4: configure that agent to use an MCP server (e.g. GitHub MCP) for tool calls. Send a chat DM that triggers a tool-use turn. The first MCP server spawn fires mcpServer/startupStatus/updated.

The cron-driven path (variant #2) is unrecoverable because the gateway never re-enters the loop to drain Telegram polls and respond to AWS health checks. The chat-driven path (variant #4) presents identically in our deployment.

Hypothesis

The codex_app_server:notification dispatch path in @openclaw/codex does synchronous work (parse + dispatch + log + side-effects like cache updates) on the event-loop thread. For light notification types (rawResponseItem/completed) the work is small enough to fit in a normal loop tick; for account/rateLimits/updated and mcpServer/startupStatus/updated it's large enough to block for multiple seconds. Combined with a concurrent embedded_run (which is doing its own LLM-shaped work), the loop crosses the saturation threshold and never recovers.

Adjacent: openai/codex#17501 discusses exposing MCP startup notifications to JSONL — different surface, but suggests the notification path is rich enough on the OpenAI side to carry real payload.

Mitigation we're running

Until upstream resolves:

  • plugins.entries.codex.enabled: true (after 2026.5.12 upgrade — needed for active-memory's resolver to behave)
  • All 4 chat agents kept on amazon-bedrock/us.anthropic.claude-sonnet-4-6 (no agentRuntime) — codex plugin loaded but no agents codex-runtime-routed
  • All codex-routed crons in ~/.openclaw/cron/jobs.json disabled
  • One canary agent (oikos — no MCP, no Telegram traffic) currently on openai/gpt-5.5 for soak observation, but the canary can't actually exercise variant #4 because it has no MCP

Full forensic trail and four-variant decomposition: ValantisV/OpenClaw-Personal#57.

Suspected fix shape

Move the heavy work out of the notification dispatch path:

  • Either dispatch the notification to a worker thread / microtask queue, returning from the handler synchronously
  • Or split each handler so only the minimum bookkeeping happens synchronously and the rest defers

Light-touch alternative: add an event-loop budget check around the synchronous payload-processing call and bail early if eventLoopUtilization is already elevated — better to drop a notification than wedge the gateway.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: @openclaw/codex notification handlers (account/rateLimits/updated, mcpServer/startupStatus/updated) synchronously block Node event loop [1 pull requests]