codex - 💡(How to fix) Fix close_agent can hang forever after marking a subagent closed if the child thread never terminates

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

multi_agent_v1.close_agent can hang indefinitely when the target subagent thread has already been interrupted or otherwise cannot complete shutdown. The durable spawn edge can be marked closed, but the tool call then waits forever on the live thread termination future, leaving the in-memory agent registry slot counted and causing later spawn_agent calls to fail with agent thread limit reached.

Error Message

  • return a distinguishable error or partial-success status.

Root Cause

  • SQLite thread_spawn_edges.status becomes closed.
  • The tool never emits a successful response because it is still waiting for the thread loop.
  • The in-memory registry still counts the subagent, so new spawns hit the thread limit.
  • Retrying close_agent is not idempotent in practice; it re-enters the same unbounded wait.

Fix Action

Fix / Workaround

Local Mitigation Used

I did not patch the live process. I only cleaned up stale persistent spawn-edge rows for completed subagents after backing up the DB:

Code Example

SELECT parent_thread_id, child_thread_id, status
FROM thread_spawn_edges
WHERE child_thread_id = '019e7e3f-b65b-79e1-b364-755de66dae01';

---

019e7946-6c26-7f63-b5a3-a8169d9a6d38 | 019e7e3f-b65b-79e1-b364-755de66dae01 | closed

---

/Users/lume/.codex/sessions/2026/05/31/rollout-2026-05-31T20-36-05-019e7e3f-b65b-79e1-b364-755de66dae01.jsonl
...
event_msg task_started
turn_context
response_item user "<turn_aborted>..."
event_msg turn_aborted reason="interrupted" duration_ms=1227657

---

// codex-rs/core/src/tools/handlers/multi_agents/close_agent.rs
let result = Box::pin(session.services.agent_control.close_agent(agent_id))
    .await
    .map_err(|err| collab_agent_error(agent_id, err))
    .map(|_| ());

---

// codex-rs/core/src/agent/control.rs
let result = if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
    Ok(String::new())
} else {
    state.send_op(agent_id, Op::Shutdown {}).await
};
thread.wait_until_terminated().await;
...
self.state.release_spawned_thread(agent_id);

---

// codex-rs/core/src/agent/registry.rs
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    ...
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

---

/Volumes/LEXAR/Codex/codex-state-backups/state_5-before-subagent-edge-cleanup-20260531-225132.sqlite
RAW_BUFFERClick to expand / collapse

Summary

multi_agent_v1.close_agent can hang indefinitely when the target subagent thread has already been interrupted or otherwise cannot complete shutdown. The durable spawn edge can be marked closed, but the tool call then waits forever on the live thread termination future, leaving the in-memory agent registry slot counted and causing later spawn_agent calls to fail with agent thread limit reached.

Observed Locally

Thread:

  • Parent/root thread: 019e7946-6c26-7f63-b5a3-a8169d9a6d38
  • Stuck subagent: 019e7e3f-b65b-79e1-b364-755de66dae01
  • Nickname: Anscombe
  • Codex Desktop CLI version in the subagent rollout: 0.135.0-alpha.1
  • Parent/root source version in the same rollout: 0.133.0

Symptoms:

  • Repeated calls to close_agent({ target: "019e7e3f-b65b-79e1-b364-755de66dae01" }) did not return until the user interrupted the turn, after many minutes.
  • Later spawn_agent calls failed with collab spawn failed: agent thread limit reached.
  • The persistent SQLite state already showed the subagent edge as closed:
SELECT parent_thread_id, child_thread_id, status
FROM thread_spawn_edges
WHERE child_thread_id = '019e7e3f-b65b-79e1-b364-755de66dae01';

Result:

019e7946-6c26-7f63-b5a3-a8169d9a6d38 | 019e7e3f-b65b-79e1-b364-755de66dae01 | closed

The rollout file for the subagent ends with an interrupted turn, not a normal completion:

/Users/lume/.codex/sessions/2026/05/31/rollout-2026-05-31T20-36-05-019e7e3f-b65b-79e1-b364-755de66dae01.jsonl
...
event_msg task_started
turn_context
response_item user "<turn_aborted>..."
event_msg turn_aborted reason="interrupted" duration_ms=1227657

Code Path

The tool handler awaits AgentControl.close_agent(...) directly:

// codex-rs/core/src/tools/handlers/multi_agents/close_agent.rs
let result = Box::pin(session.services.agent_control.close_agent(agent_id))
    .await
    .map_err(|err| collab_agent_error(agent_id, err))
    .map(|_| ());

AgentControl.close_agent(...) persists the spawn edge as closed, then calls shutdown_agent_tree(...).

shutdown_agent_tree(...) calls shutdown_live_agent(...).

shutdown_live_agent(...) sends shutdown and then waits without a timeout:

// codex-rs/core/src/agent/control.rs
let result = if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
    Ok(String::new())
} else {
    state.send_op(agent_id, Op::Shutdown {}).await
};
thread.wait_until_terminated().await;
...
self.state.release_spawned_thread(agent_id);

The registry slot is released only after wait_until_terminated() returns:

// codex-rs/core/src/agent/registry.rs
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    ...
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

So if the child session loop never terminates, close_agent never returns and the counted live-agent slot is never released.

Why This Looks Like A Bug

The durable state and live registry can diverge:

  • SQLite thread_spawn_edges.status becomes closed.
  • The tool never emits a successful response because it is still waiting for the thread loop.
  • The in-memory registry still counts the subagent, so new spawns hit the thread limit.
  • Retrying close_agent is not idempotent in practice; it re-enters the same unbounded wait.

This creates a bad recovery loop for long-running maintainer sessions: the recommended cleanup action (close_agent) becomes the operation that wedges the turn.

Expected Behavior

close_agent should be bounded and idempotent:

  • If durable spawn-edge status is already closed, a repeated close should not block indefinitely.
  • Shutdown should have a timeout or force-detach path.
  • A stuck or interrupted child should release its AgentRegistry slot after a bounded shutdown attempt, or at least return a structured partial result such as shutdown_timed_out.

Possible Fix Direction

One conservative approach:

  1. Add a bounded timeout around thread.wait_until_terminated() in shutdown_live_agent().
  2. On timeout:
    • remove the thread from the live thread manager if safe,
    • release the spawned-thread registry slot,
    • return a distinguishable error or partial-success status.
  3. Make close_agent detect already-closed persistent edges and avoid redoing the blocking wait unless the caller explicitly asks for a force shutdown.
  4. Add regression coverage for:
    • child thread never terminates after Op::Shutdown,
    • durable edge already closed but live thread still present,
    • repeated close_agent returns promptly and does not keep the registry slot counted.

Local Mitigation Used

I did not patch the live process. I only cleaned up stale persistent spawn-edge rows for completed subagents after backing up the DB:

/Volumes/LEXAR/Codex/codex-state-backups/state_5-before-subagent-edge-cleanup-20260531-225132.sqlite

This is future-resume hygiene only; it does not guarantee the current in-memory registry counter is released without a process restart or a bounded shutdown fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix close_agent can hang forever after marking a subagent closed if the child thread never terminates