codex - 💡(How to fix) Fix close_agent can hang forever after marking a subagent closed if the child thread never terminates

StepCodex · 2026-05-31T15:54:24Z

[codex] multi agent v1.close agent can hang indefinitely when the target subagent thread has already been interrupted or otherwise cannot complete shutdown. Th… `multi_agent_v1.close_agent` can hang indefinitely when the target subagent thread has already been interrupted or otherwise cannot complete shutdown. The durable spawn edge can be marked `closed`, but the tool call then waits forever on the live thread termination future, leaving the in-memory agent registry slot counted and causing later `spawn_agent` calls to fail with `agent thread limit reached`. ## Fix / Workaround ## Local Mitigation Used I did not patch the live process. I only cleaned up stale persistent spawn-edge rows for completed subagents after backing up the DB: ## Summary `multi_agent_v1.close_agent` can hang indefinitely when the target subagent thread has already been interrupted or otherwise cannot complete shutdown. The durable spawn edge can be marked `closed`, but the tool call then waits forever on the live thread termination future, leaving the in-memory agent registry slot counted and causing later `spawn_agent` calls to fail with `agent thread limit reached`. ## Observed Locally Thread: - Parent/root thread: `019e7946-6c26-7f63-b5a3-a8169d9a6d38` - Stuck subagent: `019e7e3f-b65b-79e1-b364-755de66dae01` - Nickname: `Anscombe` - Codex Desktop CLI version in the subagent rollout: `0.135.0-alpha.1` - Parent/root source version in the same rollout: `0.133.0` Symptoms: - Repeated calls to `close_agent({ target: "019e7e3f-b65b-79e1-b364-755de66dae01" })` did not return until the user interrupted the turn, after many minutes. - Later `spawn_agent` calls failed with `collab spawn failed: agent thread limit reached`. - The persistent SQLite state already showed the subagent edge as closed: ```sql SELECT parent_thread_id, child_thread_id, status FROM thread_spawn_edges WHERE child_thread_id = '019e7e3f-b65b-79e1-b364-755de66dae01'; ``` Result: ```text 019e7946-6c26-7f63-b5a3-a8169d9a6d38 | 019e7e3f-b65b-79e1-b364-755de66dae01 | closed ``` The rollout file for the subagent ends with an interrupted turn, not a normal completion: ```text /Users/lume/.codex/sessions/2026/05/31/rollout-2026-05-31T20-36-05-019e7e3f-b65b-79e1-b364-755de66dae01.jsonl ... event_msg task_started turn_context response_item user " ..." event_msg turn_aborted reason="interrupted" duration_ms=1227657 ``` ## Code Path The tool handler awaits `AgentControl.close_agent(...)` directly: ```rust // codex-rs/core/src/tools/handlers/multi_agents/close_agent.rs let result = Box::pin(session.services.agent_control.close_agent(agent_id)) .await .map_err(|err| collab_agent_error(agent_id, err)) .map(|_| ()); ``` `AgentControl.close_agent(...)` persists the spawn edge as closed, then calls `shutdown_agent_tree(...)`. `shutdown_agent_tree(...)` calls `shutdown_live_agent(...)`. `shutdown_live_agent(...)` sends shutdown and then waits without a timeout: ```rust // codex-rs/core/src/agent/control.rs let result = if matches!(thread.agent_status().await, AgentStatus::Shutdown) { Ok(String::new()) } else { state.send_op(agent_id, Op::Shutdown {}).await }; thread.wait_until_terminated().await; ... self.state.release_spawned_thread(agent_id); ``` The registry slot is released only after `wait_until_terminated()` returns: ```rust // codex-rs/core/src/agent/registry.rs pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) { ... if removed_counted_agent { self.total_count.fetch_sub(1, Ordering::AcqRel); } } ``` So if the child session loop never terminates, `close_agent` never returns and the counted live-agent slot is never released. ## Why This Looks Like A Bug The durable state and live registry can diverge: - SQLite `thread_spawn_edges.status` becomes `closed`. - The tool never emits a successful response because it is still waiting for the thread loop. - The in-memory registry still counts the subagent, so new spawns hit the thread limit. - Retrying `close_agent` is not idempotent in practice; it re-enters the same unbounded wait. This creates a bad recovery loop for long-running maintainer sessions: the recommended cleanup action (`close_agent`) becomes the operation that wedges the turn. ## Expected Behavior `close_agent` should be bounded and idempotent: - If durable spawn-edge status is already `closed`, a repeated close should not block indefinitely. - Shutdown should have a timeout or force-detach path. - A stuck or interrupted child should release its `AgentRegistry` slot after a bounded shutdown attempt, or at least return a structured partial result such as `shutdown_timed_out`. ## Possible Fix Direction One conservative approach: 1. Add a bounded timeout around `thread.wait_until_terminated()` in `shutdown_live_agent()`. 2. On timeout: - remove the thread from the live thread manager if safe, - release the spawned-thread registry slot, - return a distinguishable error or partial-success status. 3. Make `close_agent` detect already-closed persistent e

Root Cause

SQLite thread_spawn_edges.status becomes closed.
The tool never emits a successful response because it is still waiting for the thread loop.
The in-memory registry still counts the subagent, so new spawns hit the thread limit.
Retrying close_agent is not idempotent in practice; it re-enters the same unbounded wait.

Code Example

SELECT parent_thread_id, child_thread_id, status
FROM thread_spawn_edges
WHERE child_thread_id = '019e7e3f-b65b-79e1-b364-755de66dae01';

---

019e7946-6c26-7f63-b5a3-a8169d9a6d38 | 019e7e3f-b65b-79e1-b364-755de66dae01 | closed

---

/Users/lume/.codex/sessions/2026/05/31/rollout-2026-05-31T20-36-05-019e7e3f-b65b-79e1-b364-755de66dae01.jsonl
...
event_msg task_started
turn_context
response_item user "<turn_aborted>..."
event_msg turn_aborted reason="interrupted" duration_ms=1227657

---

// codex-rs/core/src/tools/handlers/multi_agents/close_agent.rs
let result = Box::pin(session.services.agent_control.close_agent(agent_id))
    .await
    .map_err(|err| collab_agent_error(agent_id, err))
    .map(|_| ());

---

// codex-rs/core/src/agent/control.rs
let result = if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
    Ok(String::new())
} else {
    state.send_op(agent_id, Op::Shutdown {}).await
};
thread.wait_until_terminated().await;
...
self.state.release_spawned_thread(agent_id);

---

// codex-rs/core/src/agent/registry.rs
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    ...
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

---

/Volumes/LEXAR/Codex/codex-state-backups/state_5-before-subagent-edge-cleanup-20260531-225132.sqlite

Summary

multi_agent_v1.close_agent can hang indefinitely when the target subagent thread has already been interrupted or otherwise cannot complete shutdown. The durable spawn edge can be marked closed, but the tool call then waits forever on the live thread termination future, leaving the in-memory agent registry slot counted and causing later spawn_agent calls to fail with agent thread limit reached.

Observed Locally

Thread:

Parent/root thread: 019e7946-6c26-7f63-b5a3-a8169d9a6d38
Stuck subagent: 019e7e3f-b65b-79e1-b364-755de66dae01
Nickname: Anscombe
Codex Desktop CLI version in the subagent rollout: 0.135.0-alpha.1
Parent/root source version in the same rollout: 0.133.0

Symptoms:

Repeated calls to close_agent({ target: "019e7e3f-b65b-79e1-b364-755de66dae01" }) did not return until the user interrupted the turn, after many minutes.
Later spawn_agent calls failed with collab spawn failed: agent thread limit reached.
The persistent SQLite state already showed the subagent edge as closed:

SELECT parent_thread_id, child_thread_id, status
FROM thread_spawn_edges
WHERE child_thread_id = '019e7e3f-b65b-79e1-b364-755de66dae01';

Result:

019e7946-6c26-7f63-b5a3-a8169d9a6d38 | 019e7e3f-b65b-79e1-b364-755de66dae01 | closed

The rollout file for the subagent ends with an interrupted turn, not a normal completion:

/Users/lume/.codex/sessions/2026/05/31/rollout-2026-05-31T20-36-05-019e7e3f-b65b-79e1-b364-755de66dae01.jsonl
...
event_msg task_started
turn_context
response_item user "<turn_aborted>..."
event_msg turn_aborted reason="interrupted" duration_ms=1227657

Code Path

The tool handler awaits AgentControl.close_agent(...) directly:

// codex-rs/core/src/tools/handlers/multi_agents/close_agent.rs
let result = Box::pin(session.services.agent_control.close_agent(agent_id))
    .await
    .map_err(|err| collab_agent_error(agent_id, err))
    .map(|_| ());

AgentControl.close_agent(...) persists the spawn edge as closed, then calls shutdown_agent_tree(...).

shutdown_agent_tree(...) calls shutdown_live_agent(...).

shutdown_live_agent(...) sends shutdown and then waits without a timeout:

// codex-rs/core/src/agent/control.rs
let result = if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
    Ok(String::new())
} else {
    state.send_op(agent_id, Op::Shutdown {}).await
};
thread.wait_until_terminated().await;
...
self.state.release_spawned_thread(agent_id);

The registry slot is released only after wait_until_terminated() returns:

// codex-rs/core/src/agent/registry.rs
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    ...
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

So if the child session loop never terminates, close_agent never returns and the counted live-agent slot is never released.

Why This Looks Like A Bug

The durable state and live registry can diverge:

SQLite thread_spawn_edges.status becomes closed.
The tool never emits a successful response because it is still waiting for the thread loop.
The in-memory registry still counts the subagent, so new spawns hit the thread limit.
Retrying close_agent is not idempotent in practice; it re-enters the same unbounded wait.

This creates a bad recovery loop for long-running maintainer sessions: the recommended cleanup action (close_agent) becomes the operation that wedges the turn.

Expected Behavior

close_agent should be bounded and idempotent:

If durable spawn-edge status is already closed, a repeated close should not block indefinitely.
Shutdown should have a timeout or force-detach path.
A stuck or interrupted child should release its AgentRegistry slot after a bounded shutdown attempt, or at least return a structured partial result such as shutdown_timed_out.

Possible Fix Direction

One conservative approach:

Add a bounded timeout around thread.wait_until_terminated() in shutdown_live_agent().
On timeout:
- remove the thread from the live thread manager if safe,
- release the spawned-thread registry slot,
- return a distinguishable error or partial-success status.
Make close_agent detect already-closed persistent edges and avoid redoing the blocking wait unless the caller explicitly asks for a force shutdown.
Add regression coverage for:
- child thread never terminates after Op::Shutdown,
- durable edge already closed but live thread still present,
- repeated close_agent returns promptly and does not keep the registry slot counted.

Local Mitigation Used

I did not patch the live process. I only cleaned up stale persistent spawn-edge rows for completed subagents after backing up the DB:

/Volumes/LEXAR/Codex/codex-state-backups/state_5-before-subagent-edge-cleanup-20260531-225132.sqlite

This is future-resume hygiene only; it does not guarantee the current in-memory registry counter is released without a process restart or a bounded shutdown fix.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

codex - 💡(How to fix) Fix close_agent can hang forever after marking a subagent closed if the child thread never terminates

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Local Mitigation Used

Code Example

Summary

Observed Locally

Code Path

Why This Looks Like A Bug

Expected Behavior

Possible Fix Direction

Local Mitigation Used

Still need to ship something?

TRENDING