codex - 💡(How to fix) Fix Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI) [3 comments, 3 participants]

codex2026-04-17 13:09:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openai/codex#18335•Fetched 2026-04-18 05:55:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

labeled ×4commented ×3unlabeled ×2

In persistent session modes (app-server transport, interactive CLI REPL), the AgentRegistry.total_count used to enforce agent_max_threads is only decremented when close_agent is explicitly called. If a spawned agent reaches a terminal state (Completed, Shutdown, etc.) but the model does not call close_agent before the turn ends, the spawn slot is permanently leaked. Over multiple turns, leaked slots accumulate until total_count >= max_threads, after which all spawn_agent calls fail with AgentLimitReached — even though zero agents are actually running.

This does not affect codex exec because each invocation starts a fresh process with a new AgentRegistry.

Error Message

| codex-rs/protocol/src/error.rs | CodexErr::AgentLimitReached definition |

Root Cause

1. `total_count` increment path

AgentRegistry::reserve_spawn_slot() increments total_count via try_increment_spawned():

// codex-rs/core/src/agent/registry.rs:80-97
pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            return Err(CodexErr::AgentLimitReached { max_threads });
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { state: Arc::clone(self), active: true, ... })
}

2. `total_count` decrement path — only via explicit shutdown

release_spawned_thread() is the only path that decrements total_count:

// codex-rs/core/src/agent/registry.rs:99-119
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    let removed_counted_agent = { /* remove from agent_tree */ };
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

This is called from exactly two places:

AgentControl::shutdown_live_agent() (control.rs:676) — called by close_agent tool handler
AgentControl::handle_thread_request_result() (control.rs:655) — only on InternalAgentDied

3. `wait_agent` does NOT release slots

Both v1 (multi_agents/wait.rs) and v2 (multi_agents_v2/wait.rs) wait_agent handlers only observe status — they never call release_spawned_thread or shutdown_live_agent.

4. Turn completion does NOT release slots

There is no turn-boundary cleanup hook. When a turn completes, the AgentRegistry state (including total_count and agent_tree) is carried forward unchanged into the next turn.

5. `maybe_start_completion_watcher` does NOT release slots

The completion watcher (control.rs:898-971) watches for agents reaching final status and notifies the parent thread, but it does not call release_spawned_thread.

6. Why `codex exec` is immune

Each codex exec invocation starts a fresh OS process → fresh AgentRegistry → total_count starts at 0. Even with codex exec resume <thread_id>, the AgentRegistry is reconstructed from scratch (only agents with Open edge status are re-reserved via resume_agent_from_rollout).

Fix Action

Fix / Workaround

Severity: High for persistent sessions (app-server, interactive REPL)
Scope: Any multi-turn workflow that spawns agents across turns without perfectly closing each one
Workaround: None available to API consumers — the AgentRegistry is internal and not exposed. The only current workaround is to destroy and recreate the entire session, losing all conversation context.

Code Example

Round 1: spawned=24, close=14, live=7  → 10 slots leaked
Round 2: spawned=6,  close=4,  live=4  →  2 more leaked
Round 3: spawned=2,  close=2,  live=2  →  0 new leaked (but cumulative damage done)
Round 4–9: spawned=0, close=0          →  permanently stuck

---

// codex-rs/core/src/agent/registry.rs:80-97
pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            return Err(CodexErr::AgentLimitReached { max_threads });
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { state: Arc::clone(self), active: true, ... })
}

---

// codex-rs/core/src/agent/registry.rs:99-119
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    let removed_counted_agent = { /* remove from agent_tree */ };
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

---

pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            // Before failing, try to reclaim slots from finalized agents
            self.reap_finalized_agents();  // new method
            if !self.try_increment_spawned(max_threads) {
                return Err(CodexErr::AgentLimitReached { max_threads });
            }
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { ... })
}

RAW_BUFFERClick to expand / collapse

What version of Codex CLI is running?

codex-cli 0.121.0

What subscription do you have?

pro

Which model were you using?

No response

What platform is your computer?

No response

What terminal emulator and version are you using (if applicable)?

No response

What issue are you seeing?

Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI)

Codex version: b0324f9f0569ebfc5534fd6844971d9ae029c791

Summary

This does not affect codex exec because each invocation starts a fresh process with a new AgentRegistry.

Reproduction

Start a persistent session (app-server or interactive CLI).
In Turn 1, have the model spawn several agents (e.g. 6+). Let some of them complete without the model calling close_agent on each one (this happens naturally when the model runs out of tool-call budget, the turn times out, or the context window fills up before cleanup).
In Turn 2, attempt to spawn_agent — it fails with AgentLimitReached { max_threads: N } even though no agents are running.

Observed in production: Over 9 rounds in an app-server session, Round 1 spawned 24 agents but only closed 14. By Round 4, spawn_agent was permanently blocked (spawned=0, close=0) for the remainder of the session.

Round 1: spawned=24, close=14, live=7  → 10 slots leaked
Round 2: spawned=6,  close=4,  live=4  →  2 more leaked
Round 3: spawned=2,  close=2,  live=2  →  0 new leaked (but cumulative damage done)
Round 4–9: spawned=0, close=0          →  permanently stuck

Root Cause Analysis

1. `total_count` increment path

AgentRegistry::reserve_spawn_slot() increments total_count via try_increment_spawned():

// codex-rs/core/src/agent/registry.rs:80-97
pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            return Err(CodexErr::AgentLimitReached { max_threads });
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { state: Arc::clone(self), active: true, ... })
}

2. `total_count` decrement path — only via explicit shutdown

release_spawned_thread() is the only path that decrements total_count:

// codex-rs/core/src/agent/registry.rs:99-119
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    let removed_counted_agent = { /* remove from agent_tree */ };
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

This is called from exactly two places:

AgentControl::shutdown_live_agent() (control.rs:676) — called by close_agent tool handler
AgentControl::handle_thread_request_result() (control.rs:655) — only on InternalAgentDied

3. `wait_agent` does NOT release slots

Both v1 (multi_agents/wait.rs) and v2 (multi_agents_v2/wait.rs) wait_agent handlers only observe status — they never call release_spawned_thread or shutdown_live_agent.

4. Turn completion does NOT release slots

There is no turn-boundary cleanup hook. When a turn completes, the AgentRegistry state (including total_count and agent_tree) is carried forward unchanged into the next turn.

5. `maybe_start_completion_watcher` does NOT release slots

The completion watcher (control.rs:898-971) watches for agents reaching final status and notifies the parent thread, but it does not call release_spawned_thread.

6. Why `codex exec` is immune

Why the model doesn't always `close_agent`

The model is expected to call close_agent after wait_agent, but in practice this doesn't always happen:

The model may exhaust its tool-call budget before closing all agents
The turn may time out mid-execution
Context window compaction may drop the close_agent intent
With many concurrent agents (12+), the model may not track all IDs

This is especially problematic in automated/agentic workflows where turns are driven programmatically and agent counts are high.

Proposed Fix

Release spawn slots for agents in terminal state before enforcing the limit.

Modify AgentRegistry::reserve_spawn_slot() to reap finalized agents before checking the count:

pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            // Before failing, try to reclaim slots from finalized agents
            self.reap_finalized_agents();  // new method
            if !self.try_increment_spawned(max_threads) {
                return Err(CodexErr::AgentLimitReached { max_threads });
            }
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { ... })
}

Alternative approaches (non-exclusive):

Reap in reserve_spawn_slot (above) — minimal change, lazy cleanup only when needed.
Auto-release on turn completion — add a turn-boundary hook that calls release_spawned_thread for all agents in agent_tree whose status is final (Completed, Shutdown, Errored, NotFound).
Auto-release in maybe_start_completion_watcher — when the watcher observes a final status, also call release_spawned_thread (not just notify the parent).

Option 2 or 3 would keep total_count accurate at all times rather than only at spawn time.

Impact

Severity: High for persistent sessions (app-server, interactive REPL)
Scope: Any multi-turn workflow that spawns agents across turns without perfectly closing each one
Workaround: None available to API consumers — the AgentRegistry is internal and not exposed. The only current workaround is to destroy and recreate the entire session, losing all conversation context.

Files involved

File	Role
`codex-rs/core/src/agent/registry.rs`	`AgentRegistry`, `total_count`, `reserve_spawn_slot`, `release_spawned_thread`
`codex-rs/core/src/agent/control.rs`	`shutdown_live_agent`, `close_agent`, `maybe_start_completion_watcher`
`codex-rs/core/src/agent/status.rs`	`is_final()` — defines terminal agent states
`codex-rs/core/src/tools/handlers/multi_agents/wait.rs`	v1 `wait_agent` — does not release slots
`codex-rs/core/src/tools/handlers/multi_agents/close_agent.rs`	v1 `close_agent` — the only user-facing slot release path
`codex-rs/core/src/tools/handlers/multi_agents_v2/wait.rs`	v2 `wait_agent` — does not release slots
`codex-rs/core/src/tools/handlers/multi_agents_v2/close_agent.rs`	v2 `close_agent` — same as v1
`codex-rs/protocol/src/error.rs`	`CodexErr::AgentLimitReached` definition
`codex-rs/core/src/config/mod.rs`	`DEFAULT_AGENT_MAX_THREADS = Some(6)`

What steps can reproduce the bug?

What is the expected behavior?

No response

Additional information

No response

extent analysis

TL;DR

The proposed fix involves modifying AgentRegistry::reserve_spawn_slot() to reap finalized agents before checking the count, preventing agent spawn slots from leaking across turns in persistent sessions.

Guidance

Identify the root cause of the issue: the total_count in AgentRegistry is not decremented when an agent reaches a terminal state without an explicit close_agent call.
Modify AgentRegistry::reserve_spawn_slot() to call a new reap_finalized_agents() method before checking the count, as shown in the proposed fix.
Consider alternative approaches, such as auto-releasing on turn completion or in maybe_start_completion_watcher, to keep total_count accurate at all times.
Review the files involved, including registry.rs, control.rs, and status.rs, to ensure a thorough understanding of the changes required.

Example

pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            // Before failing, try to reclaim slots from finalized agents
            self.reap_finalized_agents();  // new method
            if !self.try_increment_spawned(max_threads) {
                return Err(CodexErr::AgentLimitReached { max_threads });
            }
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { ... })
}

Notes

The proposed fix assumes that the reap_finalized_agents() method will be implemented to correctly release spawn slots for agents in terminal states. Additionally, the alternative approaches mentioned may have different trade-offs and requirements, and should be carefully evaluated before implementation.

Recommendation

Apply the proposed fix by modifying AgentRegistry::reserve_spawn_slot() to reap finalized agents before checking the count, as this approach is minimal and targeted at the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

codex - 💡(How to fix) Fix Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI) [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

1. total_count increment path

2. total_count decrement path — only via explicit shutdown

3. wait_agent does NOT release slots

4. Turn completion does NOT release slots

5. maybe_start_completion_watcher does NOT release slots

6. Why codex exec is immune

Fix Action

Fix / Workaround

Code Example

What version of Codex CLI is running?

What subscription do you have?

Which model were you using?

What platform is your computer?

What terminal emulator and version are you using (if applicable)?

What issue are you seeing?

Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI)

Summary

Reproduction

Root Cause Analysis

1. total_count increment path

2. total_count decrement path — only via explicit shutdown

3. wait_agent does NOT release slots

4. Turn completion does NOT release slots

5. maybe_start_completion_watcher does NOT release slots

6. Why codex exec is immune

Why the model doesn't always close_agent

Proposed Fix

Impact

Files involved

What steps can reproduce the bug?

What is the expected behavior?

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `total_count` increment path

2. `total_count` decrement path — only via explicit shutdown

3. `wait_agent` does NOT release slots

5. `maybe_start_completion_watcher` does NOT release slots

6. Why `codex exec` is immune

1. `total_count` increment path

2. `total_count` decrement path — only via explicit shutdown

3. `wait_agent` does NOT release slots

5. `maybe_start_completion_watcher` does NOT release slots

6. Why `codex exec` is immune

Why the model doesn't always `close_agent`