codex - 💡(How to fix) Fix multi_agent wait_agent/spawn_agent can block for ~7.5h; wait_agent timeout_ms not enforced during runtime stall

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • spawn_agent should return promptly with an agent_id or fail with a bounded startup/initialization error.
  • If a lower-level Desktop/runtime/transport/tool-dispatch path stalls, the tool call should still be bounded by an end-to-end hard deadline and return a structured error/timeout.

Root Cause

The issue does not appear to be only a wait_agent per-tool bug, because spawn_agent calls stalled in the same quiet window and released at the same time. The evidence points to a shared multi-agent control-plane/tool-dispatch/runtime stall, with no end-to-end whole-tool deadline protecting the parent turn.

Fix Action

Fix / Workaround

The issue does not appear to be only a wait_agent per-tool bug, because spawn_agent calls stalled in the same quiet window and released at the same time. The evidence points to a shared multi-agent control-plane/tool-dispatch/runtime stall, with no end-to-end whole-tool deadline protecting the parent turn.

- `wait_agent(timeout_ms=300000)` should return within approximately 300 seconds plus small overhead, either with a completed status or a timeout result.
- `spawn_agent` should return promptly with an `agent_id` or fail with a bounded startup/initialization error.
- No multi-agent control-plane tool should be able to monopolize the parent turn indefinitely.
- If a lower-level Desktop/runtime/transport/tool-dispatch path stalls, the tool call should still be bounded by an end-to-end hard deadline and return a structured error/timeout.
  • wait_agent.timeout_ms not bounding the whole tool call;
  • adjacent spawn_agent calls stalling in the same quiet window;
  • likely missing end-to-end deadlines around shared multi-agent/tool-dispatch/control-plane paths.

Code Example

file: C:/Users/<redacted>/.codex/sessions/2026/05/28/rollout-2026-05-28T05-46-57-019e6ce8-205f-7c50-9544-3ac26e049e68.jsonl
call line: 497
output line: 498
call_id: call_FrWUUz6SfgcfCfH5vSeDkqsP
args: {"targets":["019e6cff-7f1e-7803-a50c-b5f2eb72971f"],"timeout_ms":300000}
start: 2026-05-28T05:18:20.235Z
end:   2026-05-28T12:48:37.931Z
elapsed: 27017.696s
output: {"status":{},"timed_out":true}

---

file: C:/Users/<redacted>/.codex/sessions/2026/05/28/rollout-2026-05-28T05-47-02-019e6ce8-34c0-7a33-8367-69330797145b.jsonl
call line: 567
output line: 571
call_id: call_KG8lMx4dpGHtXjaOLNacZ5RT
args: {"fork_context":false,"model":"gpt-5.5","reasoning_effort":"xhigh","message":"<redacted len=589>"}
start: 2026-05-28T05:18:50.645Z
end:   2026-05-28T12:48:45.487Z
elapsed: 26994.842s
output: {"agent_id":"019e6ea1-3421-79c3-8eb5-432bc82c7fbe"}

---

wait_agent timeout_ms=900000
call_id: call_V3GOrqlj7zG9EgRKHTC31QkB
elapsed: 27873.523s
output: timed_out=true

spawn_agent
call_id: call_T9P4PVazFbWVXgKvFl398m5f
elapsed: 26988.260s

---

- `wait_agent(timeout_ms=300000)` should return within approximately 300 seconds plus small overhead, either with a completed status or a timeout result.
- `spawn_agent` should return promptly with an `agent_id` or fail with a bounded startup/initialization error.
- No multi-agent control-plane tool should be able to monopolize the parent turn indefinitely.
- If a lower-level Desktop/runtime/transport/tool-dispatch path stalls, the tool call should still be bounded by an end-to-end hard deadline and return a structured error/timeout.
RAW_BUFFERClick to expand / collapse

What version of the Codex App are you using (From “About Codex” dialog)?

26.519.81530

What subscription do you have?

ChatGPT Pro

What platform is your computer?

Microsoft Windows NT 10.0.26200.0 x64

What issue are you seeing?

multi_agent_v1.wait_agent and adjacent spawn_agent tool calls can block the parent thread for hours. In the strongest observed case, wait_agent(timeout_ms=300000) returned only after 27,017.696 seconds, and still returned {"status":{},"timed_out":true}. An adjacent spawn_agent call in another subagent session took 26,994.842 seconds before returning an agent_id.

This appears to violate the user-visible timeout contract: a wait_agent call with timeout_ms=300000 should not block the parent for ~7.5 hours before reporting that it timed out.

The issue does not appear to be only a wait_agent per-tool bug, because spawn_agent calls stalled in the same quiet window and released at the same time. The evidence points to a shared multi-agent control-plane/tool-dispatch/runtime stall, with no end-to-end whole-tool deadline protecting the parent turn.

What steps can reproduce the bug?

I do not have a small deterministic repro, but I have concrete Codex JSONL evidence from local sessions.

Observed evidence:

  1. wait_agent(timeout_ms=300000) exceeded its requested timeout by hours.
file: C:/Users/<redacted>/.codex/sessions/2026/05/28/rollout-2026-05-28T05-46-57-019e6ce8-205f-7c50-9544-3ac26e049e68.jsonl
call line: 497
output line: 498
call_id: call_FrWUUz6SfgcfCfH5vSeDkqsP
args: {"targets":["019e6cff-7f1e-7803-a50c-b5f2eb72971f"],"timeout_ms":300000}
start: 2026-05-28T05:18:20.235Z
end:   2026-05-28T12:48:37.931Z
elapsed: 27017.696s
output: {"status":{},"timed_out":true}
  1. Adjacent spawn_agent call stalled in the same quiet window.
file: C:/Users/<redacted>/.codex/sessions/2026/05/28/rollout-2026-05-28T05-47-02-019e6ce8-34c0-7a33-8367-69330797145b.jsonl
call line: 567
output line: 571
call_id: call_KG8lMx4dpGHtXjaOLNacZ5RT
args: {"fork_context":false,"model":"gpt-5.5","reasoning_effort":"xhigh","message":"<redacted len=589>"}
start: 2026-05-28T05:18:50.645Z
end:   2026-05-28T12:48:45.487Z
elapsed: 26994.842s
output: {"agent_id":"019e6ea1-3421-79c3-8eb5-432bc82c7fbe"}
  1. Other same-window long calls:
wait_agent timeout_ms=900000
call_id: call_V3GOrqlj7zG9EgRKHTC31QkB
elapsed: 27873.523s
output: timed_out=true

spawn_agent
call_id: call_T9P4PVazFbWVXgKvFl398m5f
elapsed: 26988.260s

Local scan summary:

  • Scanned Codex JSONL logs under .codex/sessions and .codex/archived_sessions.
  • Paired thousands of function_call / function_call_output records by call_id.
  • The long stall window had no local Codex JSONL events from 2026-05-28T05:19:23.943Z to 2026-05-28T12:48:37.894Z.
  • Several stalled multi-agent calls released together at ~12:48Z.
  • Local Windows System/Application logs did not show sleep/resume/reboot/network transition evidence during the window.

Likely workload shape:

  • Multiple Codex Desktop subagents were running.
  • Parent/subagent sessions used multi_agent_v1.wait_agent and spawn_agent.
  • At least one wait_agent had a 5 minute timeout (300000ms), but the parent remained blocked for ~7.5 hours.

What is the expected behavior?

- `wait_agent(timeout_ms=300000)` should return within approximately 300 seconds plus small overhead, either with a completed status or a timeout result.
- `spawn_agent` should return promptly with an `agent_id` or fail with a bounded startup/initialization error.
- No multi-agent control-plane tool should be able to monopolize the parent turn indefinitely.
- If a lower-level Desktop/runtime/transport/tool-dispatch path stalls, the tool call should still be bounded by an end-to-end hard deadline and return a structured error/timeout.

Additional information

This looks distinct from, but related to, close_agent hangs such as:

This report is specifically about:

  • wait_agent.timeout_ms not bounding the whole tool call;
  • adjacent spawn_agent calls stalling in the same quiet window;
  • likely missing end-to-end deadlines around shared multi-agent/tool-dispatch/control-plane paths.

Public-source observations from openai/codex:

  • In codex-rs/core/src/tools/handlers/multi_agents/wait.rs, timeout_ms appears to wrap only the inner final-status wait. Other awaits, including event dispatch/status subscription/final event dispatch and outer tool execution, are not covered by a whole-tool deadline.
  • In codex-rs/core/src/tools/handlers/multi_agents/spawn.rs, spawn_agent appears to have no outer deadline around event dispatch, config/model/role work, spawn_agent_with_metadata, metadata reads, or final event dispatch.
  • The shared tool path also goes through tool runtime/registry/lifecycle/hook/event/persistence surfaces that appear to have unbounded awaits.

I cannot prove from public source alone whether the ~7.5h quiet window originated in Codex Desktop, app-server/gateway, Rust runtime lock/backpressure, lifecycle hooks, event dispatch, or persistence. But the public code appears to lack the hard end-to-end deadline needed to keep the parent turn responsive when any of those shared paths stalls.

Suggested fix direction:

  • Add a dispatcher/control-plane whole-tool deadline for multi-agent tools from tool invocation to function_call_output.
  • Start wait_agent timeout accounting before setup/event dispatch, or add a separate outer deadline that includes those phases.
  • Add bounded startup/initialization deadlines to spawn_agent.
  • Make collaboration event dispatch, lifecycle hooks, post-hooks, notification, and persistence paths bounded or best-effort where appropriate.
  • Add timing instrumentation for multi-agent tool phases so future reports can identify which await path stalled.

Related public issues that seem adjacent but not exact duplicates:

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING