codex - 💡(How to fix) Fix Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI) [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#18335Fetched 2026-04-18 05:55:45
View on GitHub
Comments
3
Participants
3
Timeline
9
Reactions
0
Author
Timeline (top)
labeled ×4commented ×3unlabeled ×2

In persistent session modes (app-server transport, interactive CLI REPL), the AgentRegistry.total_count used to enforce agent_max_threads is only decremented when close_agent is explicitly called. If a spawned agent reaches a terminal state (Completed, Shutdown, etc.) but the model does not call close_agent before the turn ends, the spawn slot is permanently leaked. Over multiple turns, leaked slots accumulate until total_count >= max_threads, after which all spawn_agent calls fail with AgentLimitReached — even though zero agents are actually running.

This does not affect codex exec because each invocation starts a fresh process with a new AgentRegistry.

Error Message

| codex-rs/protocol/src/error.rs | CodexErr::AgentLimitReached definition |

Root Cause

1. total_count increment path

AgentRegistry::reserve_spawn_slot() increments total_count via try_increment_spawned():

// codex-rs/core/src/agent/registry.rs:80-97
pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            return Err(CodexErr::AgentLimitReached { max_threads });
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { state: Arc::clone(self), active: true, ... })
}

2. total_count decrement path — only via explicit shutdown

release_spawned_thread() is the only path that decrements total_count:

// codex-rs/core/src/agent/registry.rs:99-119
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    let removed_counted_agent = { /* remove from agent_tree */ };
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

This is called from exactly two places:

  • AgentControl::shutdown_live_agent() (control.rs:676) — called by close_agent tool handler
  • AgentControl::handle_thread_request_result() (control.rs:655) — only on InternalAgentDied

3. wait_agent does NOT release slots

Both v1 (multi_agents/wait.rs) and v2 (multi_agents_v2/wait.rs) wait_agent handlers only observe status — they never call release_spawned_thread or shutdown_live_agent.

4. Turn completion does NOT release slots

There is no turn-boundary cleanup hook. When a turn completes, the AgentRegistry state (including total_count and agent_tree) is carried forward unchanged into the next turn.

5. maybe_start_completion_watcher does NOT release slots

The completion watcher (control.rs:898-971) watches for agents reaching final status and notifies the parent thread, but it does not call release_spawned_thread.

6. Why codex exec is immune

Each codex exec invocation starts a fresh OS process → fresh AgentRegistrytotal_count starts at 0. Even with codex exec resume <thread_id>, the AgentRegistry is reconstructed from scratch (only agents with Open edge status are re-reserved via resume_agent_from_rollout).

Fix Action

Fix / Workaround

  • Severity: High for persistent sessions (app-server, interactive REPL)
  • Scope: Any multi-turn workflow that spawns agents across turns without perfectly closing each one
  • Workaround: None available to API consumers — the AgentRegistry is internal and not exposed. The only current workaround is to destroy and recreate the entire session, losing all conversation context.

Code Example

Round 1: spawned=24, close=14, live=710 slots leaked
Round 2: spawned=6,  close=4,  live=42 more leaked
Round 3: spawned=2,  close=2,  live=20 new leaked (but cumulative damage done)
Round 49: spawned=0, close=0          →  permanently stuck

---

// codex-rs/core/src/agent/registry.rs:80-97
pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            return Err(CodexErr::AgentLimitReached { max_threads });
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { state: Arc::clone(self), active: true, ... })
}

---

// codex-rs/core/src/agent/registry.rs:99-119
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    let removed_counted_agent = { /* remove from agent_tree */ };
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

---

pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            // Before failing, try to reclaim slots from finalized agents
            self.reap_finalized_agents();  // new method
            if !self.try_increment_spawned(max_threads) {
                return Err(CodexErr::AgentLimitReached { max_threads });
            }
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { ... })
}
RAW_BUFFERClick to expand / collapse

What version of Codex CLI is running?

codex-cli 0.121.0

What subscription do you have?

pro

Which model were you using?

No response

What platform is your computer?

No response

What terminal emulator and version are you using (if applicable)?

No response

What issue are you seeing?

Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI)

Codex version: b0324f9f0569ebfc5534fd6844971d9ae029c791

Summary

In persistent session modes (app-server transport, interactive CLI REPL), the AgentRegistry.total_count used to enforce agent_max_threads is only decremented when close_agent is explicitly called. If a spawned agent reaches a terminal state (Completed, Shutdown, etc.) but the model does not call close_agent before the turn ends, the spawn slot is permanently leaked. Over multiple turns, leaked slots accumulate until total_count >= max_threads, after which all spawn_agent calls fail with AgentLimitReached — even though zero agents are actually running.

This does not affect codex exec because each invocation starts a fresh process with a new AgentRegistry.

Reproduction

  1. Start a persistent session (app-server or interactive CLI).
  2. In Turn 1, have the model spawn several agents (e.g. 6+). Let some of them complete without the model calling close_agent on each one (this happens naturally when the model runs out of tool-call budget, the turn times out, or the context window fills up before cleanup).
  3. In Turn 2, attempt to spawn_agent — it fails with AgentLimitReached { max_threads: N } even though no agents are running.

Observed in production: Over 9 rounds in an app-server session, Round 1 spawned 24 agents but only closed 14. By Round 4, spawn_agent was permanently blocked (spawned=0, close=0) for the remainder of the session.

Round 1: spawned=24, close=14, live=7  → 10 slots leaked
Round 2: spawned=6,  close=4,  live=4  →  2 more leaked
Round 3: spawned=2,  close=2,  live=2  →  0 new leaked (but cumulative damage done)
Round 4–9: spawned=0, close=0          →  permanently stuck

Root Cause Analysis

1. total_count increment path

AgentRegistry::reserve_spawn_slot() increments total_count via try_increment_spawned():

// codex-rs/core/src/agent/registry.rs:80-97
pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            return Err(CodexErr::AgentLimitReached { max_threads });
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { state: Arc::clone(self), active: true, ... })
}

2. total_count decrement path — only via explicit shutdown

release_spawned_thread() is the only path that decrements total_count:

// codex-rs/core/src/agent/registry.rs:99-119
pub(crate) fn release_spawned_thread(&self, thread_id: ThreadId) {
    let removed_counted_agent = { /* remove from agent_tree */ };
    if removed_counted_agent {
        self.total_count.fetch_sub(1, Ordering::AcqRel);
    }
}

This is called from exactly two places:

  • AgentControl::shutdown_live_agent() (control.rs:676) — called by close_agent tool handler
  • AgentControl::handle_thread_request_result() (control.rs:655) — only on InternalAgentDied

3. wait_agent does NOT release slots

Both v1 (multi_agents/wait.rs) and v2 (multi_agents_v2/wait.rs) wait_agent handlers only observe status — they never call release_spawned_thread or shutdown_live_agent.

4. Turn completion does NOT release slots

There is no turn-boundary cleanup hook. When a turn completes, the AgentRegistry state (including total_count and agent_tree) is carried forward unchanged into the next turn.

5. maybe_start_completion_watcher does NOT release slots

The completion watcher (control.rs:898-971) watches for agents reaching final status and notifies the parent thread, but it does not call release_spawned_thread.

6. Why codex exec is immune

Each codex exec invocation starts a fresh OS process → fresh AgentRegistrytotal_count starts at 0. Even with codex exec resume <thread_id>, the AgentRegistry is reconstructed from scratch (only agents with Open edge status are re-reserved via resume_agent_from_rollout).

Why the model doesn't always close_agent

The model is expected to call close_agent after wait_agent, but in practice this doesn't always happen:

  • The model may exhaust its tool-call budget before closing all agents
  • The turn may time out mid-execution
  • Context window compaction may drop the close_agent intent
  • With many concurrent agents (12+), the model may not track all IDs

This is especially problematic in automated/agentic workflows where turns are driven programmatically and agent counts are high.

Proposed Fix

Release spawn slots for agents in terminal state before enforcing the limit.

Modify AgentRegistry::reserve_spawn_slot() to reap finalized agents before checking the count:

pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            // Before failing, try to reclaim slots from finalized agents
            self.reap_finalized_agents();  // new method
            if !self.try_increment_spawned(max_threads) {
                return Err(CodexErr::AgentLimitReached { max_threads });
            }
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { ... })
}

Alternative approaches (non-exclusive):

  1. Reap in reserve_spawn_slot (above) — minimal change, lazy cleanup only when needed.
  2. Auto-release on turn completion — add a turn-boundary hook that calls release_spawned_thread for all agents in agent_tree whose status is final (Completed, Shutdown, Errored, NotFound).
  3. Auto-release in maybe_start_completion_watcher — when the watcher observes a final status, also call release_spawned_thread (not just notify the parent).

Option 2 or 3 would keep total_count accurate at all times rather than only at spawn time.

Impact

  • Severity: High for persistent sessions (app-server, interactive REPL)
  • Scope: Any multi-turn workflow that spawns agents across turns without perfectly closing each one
  • Workaround: None available to API consumers — the AgentRegistry is internal and not exposed. The only current workaround is to destroy and recreate the entire session, losing all conversation context.

Files involved

FileRole
codex-rs/core/src/agent/registry.rsAgentRegistry, total_count, reserve_spawn_slot, release_spawned_thread
codex-rs/core/src/agent/control.rsshutdown_live_agent, close_agent, maybe_start_completion_watcher
codex-rs/core/src/agent/status.rsis_final() — defines terminal agent states
codex-rs/core/src/tools/handlers/multi_agents/wait.rsv1 wait_agent — does not release slots
codex-rs/core/src/tools/handlers/multi_agents/close_agent.rsv1 close_agent — the only user-facing slot release path
codex-rs/core/src/tools/handlers/multi_agents_v2/wait.rsv2 wait_agent — does not release slots
codex-rs/core/src/tools/handlers/multi_agents_v2/close_agent.rsv2 close_agent — same as v1
codex-rs/protocol/src/error.rsCodexErr::AgentLimitReached definition
codex-rs/core/src/config/mod.rsDEFAULT_AGENT_MAX_THREADS = Some(6)

What steps can reproduce the bug?

n

What is the expected behavior?

No response

Additional information

No response

extent analysis

TL;DR

The proposed fix involves modifying AgentRegistry::reserve_spawn_slot() to reap finalized agents before checking the count, preventing agent spawn slots from leaking across turns in persistent sessions.

Guidance

  • Identify the root cause of the issue: the total_count in AgentRegistry is not decremented when an agent reaches a terminal state without an explicit close_agent call.
  • Modify AgentRegistry::reserve_spawn_slot() to call a new reap_finalized_agents() method before checking the count, as shown in the proposed fix.
  • Consider alternative approaches, such as auto-releasing on turn completion or in maybe_start_completion_watcher, to keep total_count accurate at all times.
  • Review the files involved, including registry.rs, control.rs, and status.rs, to ensure a thorough understanding of the changes required.

Example

pub(crate) fn reserve_spawn_slot(
    self: &Arc<Self>,
    max_threads: Option<usize>,
) -> Result<SpawnReservation> {
    if let Some(max_threads) = max_threads {
        if !self.try_increment_spawned(max_threads) {
            // Before failing, try to reclaim slots from finalized agents
            self.reap_finalized_agents();  // new method
            if !self.try_increment_spawned(max_threads) {
                return Err(CodexErr::AgentLimitReached { max_threads });
            }
        }
    } else {
        self.total_count.fetch_add(1, Ordering::AcqRel);
    }
    Ok(SpawnReservation { ... })
}

Notes

The proposed fix assumes that the reap_finalized_agents() method will be implemented to correctly release spawn slots for agents in terminal states. Additionally, the alternative approaches mentioned may have different trade-offs and requirements, and should be carefully evaluated before implementation.

Recommendation

Apply the proposed fix by modifying AgentRegistry::reserve_spawn_slot() to reap finalized agents before checking the count, as this approach is minimal and targeted at the root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Bug: Agent spawn slots leak across turns in persistent sessions (app-server / interactive CLI) [3 comments, 3 participants]