codex - 💡(How to fix) Fix First-turn API call blocks until pending MCP tools/list times out [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#19556Fetched 2026-04-26 05:15:08
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
0
Timeline (top)
labeled ×4commented ×2cross-referenced ×1unlabeled ×1

Root Cause

Root cause is in codex-rs/codex-mcp/src/mcp_connection_manager.rs:

Fix Action

Fix / Workaround

The first turn dispatches the API call without waiting on tools/list from every server. A slow MCP's tools become available on the next turn once its startup completes or times out.

L547  task_started   2026-04-25T11:37:53.449Z
L548  user msg       2026-04-25T11:38:04.592Z   (+11.1s)
L549  turn_context   2026-04-25T11:38:04.593Z
L550  user prompt    2026-04-25T11:38:20.234Z   (+15.6s)
L551  user_message event
L552  token_count    2026-04-25T11:38:20.895Z   (API call dispatched, +27.4s)
L553  agent_message  2026-04-25T11:38:25.580Z
  • AsyncManagedClient::listed_tools() (around L630) awaits self.client() for any server without a startup_snapshot, blocking on the full per-server startup future.
  • McpConnectionManager::list_all_tools() (L925) iterates every client and awaits each in turn, so one stalled server gates the whole map.
  • Session::run_turn calls list_all_tools() (codex-rs/core/src/session/turn.rs:190) before the API call.
  • The existing test list_all_tools_blocks_while_client_is_pending_without_startup_snapshot (mcp_connection_manager_tests.rs:687) asserts the blocking behavior, so the fix is a contract change to that test rather than a one-line patch.

Code Example

import json, sys
for raw in sys.stdin:
    line = raw.strip()
    if not line:
        continue
    req = json.loads(line)
    method = req.get("method")
    if method == "initialize":
        print(json.dumps({
            "jsonrpc": "2.0",
            "id": req["id"],
            "result": {
                "protocolVersion": "2024-11-05",
                "capabilities": {"tools": {"listChanged": False}},
                "serverInfo": {"name": "hang", "version": "0.1.0"},
            },
        }), flush=True)
    # tools/list and everything else: no response

---

codex resume <session-id> \
  -c 'mcp_servers.hang.command="python3"' \
  -c 'mcp_servers.hang.args=["/abs/path/tools_list_hang_mcp_server.py"]'
> Reply with OK.

---

L547  task_started   2026-04-25T11:37:53.449Z
L548  user msg       2026-04-25T11:38:04.592Z   (+11.1s)
L549  turn_context   2026-04-25T11:38:04.593Z
L550  user prompt    2026-04-25T11:38:20.234Z   (+15.6s)
L551  user_message event
L552  token_count    2026-04-25T11:38:20.895Z   (API call dispatched, +27.4s)
L553  agent_message  2026-04-25T11:38:25.580Z

---

MCP client for `hang` failed to start: MCP startup failed:
  timed out awaiting tools/list after 30s
RAW_BUFFERClick to expand / collapse

What version of Codex CLI is running?

codex-cli 0.124.0 (also reproduced on 0.123.0; symptoms reported as far back as 0.114.0 in #14470, #14627)

What subscription do you have?

Pro

Which model were you using?

gpt-5.5 with reasoning xhigh

What platform is your computer?

macOS 26.5, arm64

What terminal emulator and version are you using (if applicable)?

Terminal CLI session

What issue are you seeing?

When any configured MCP server stalls on tools/list (or completes initialize slowly), the first turn after launch or session resume sits idle for the full startup_timeout_sec (default 30s) before any model API call goes out. The TUI shows Working (Xs) for the whole window without saying what it is waiting on. Ctrl+C during the wait aborts the turn with no assistant output.

In the original session that triggered this report (codex-cli 0.123.0, 4 newly added MCPs in ~/.codex/config.toml), the wait stretched to ~340s and ~470s on consecutive retries before I gave up and closed the session. The 0.124.0 wait is bounded by the 30s timeout but the same code path stalls every first turn.

What steps can reproduce the bug?

Save this as tools_list_hang_mcp_server.py (zero deps, stdlib only):

import json, sys
for raw in sys.stdin:
    line = raw.strip()
    if not line:
        continue
    req = json.loads(line)
    method = req.get("method")
    if method == "initialize":
        print(json.dumps({
            "jsonrpc": "2.0",
            "id": req["id"],
            "result": {
                "protocolVersion": "2024-11-05",
                "capabilities": {"tools": {"listChanged": False}},
                "serverInfo": {"name": "hang", "version": "0.1.0"},
            },
        }), flush=True)
    # tools/list and everything else: no response

Then resume any prior session with this server attached and submit a prompt:

codex resume <session-id> \
  -c 'mcp_servers.hang.command="python3"' \
  -c 'mcp_servers.hang.args=["/abs/path/tools_list_hang_mcp_server.py"]'
> Reply with OK.

A fresh session (codex with the same -c overrides) reproduces it too.

What is the expected behavior?

The first turn dispatches the API call without waiting on tools/list from every server. A slow MCP's tools become available on the next turn once its startup completes or times out.

Additional information

Rollout for the resumed turn shows the gap concretely:

L547  task_started   2026-04-25T11:37:53.449Z
L548  user msg       2026-04-25T11:38:04.592Z   (+11.1s)
L549  turn_context   2026-04-25T11:38:04.593Z
L550  user prompt    2026-04-25T11:38:20.234Z   (+15.6s)
L551  user_message event
L552  token_count    2026-04-25T11:38:20.895Z   (API call dispatched, +27.4s)
L553  agent_message  2026-04-25T11:38:25.580Z

A normal turn on the same session without the hanging MCP takes task_started to token_count under 700ms.

After the timeout fires the TUI shows:

⚠ MCP client for `hang` failed to start: MCP startup failed:
  timed out awaiting tools/list after 30s

That message arrives 30 seconds after the user submitted, with Working shown the whole time.

Root cause is in codex-rs/codex-mcp/src/mcp_connection_manager.rs:

  • AsyncManagedClient::listed_tools() (around L630) awaits self.client() for any server without a startup_snapshot, blocking on the full per-server startup future.
  • McpConnectionManager::list_all_tools() (L925) iterates every client and awaits each in turn, so one stalled server gates the whole map.
  • Session::run_turn calls list_all_tools() (codex-rs/core/src/session/turn.rs:190) before the API call.
  • The existing test list_all_tools_blocks_while_client_is_pending_without_startup_snapshot (mcp_connection_manager_tests.rs:687) asserts the blocking behavior, so the fix is a contract change to that test rather than a one-line patch.

The codex_apps server already has a startup_snapshot cache via load_startup_cached_codex_apps_tools_snapshot, so it is not affected. User-defined MCPs always go through the blocking path on first turn.

Possibly the same root cause behind older reports that lacked a deterministic repro: #14470, #14627, #11015. Each describes a silent first-turn hang with MCP servers active but never produced a reliable trigger.

I have a working patch (skip clients with startup_complete == false in listed_tools(), fall back to existing startup_snapshot) plus updated tests. Happy to open a PR if invited.

extent analysis

TL;DR

The issue can be fixed by modifying the listed_tools() function in mcp_connection_manager.rs to skip clients with startup_complete == false and fall back to the existing startup_snapshot.

Guidance

  • The root cause of the issue is the blocking behavior of AsyncManagedClient::listed_tools() and McpConnectionManager::list_all_tools() when a server is stalled on tools/list.
  • To fix the issue, modify the listed_tools() function to skip clients with startup_complete == false and use the existing startup_snapshot instead.
  • Update the test list_all_tools_blocks_while_client_is_pending_without_startup_snapshot to reflect the new behavior.
  • Verify the fix by running the test and checking that the first turn no longer hangs when a server is stalled on tools/list.

Example

No code snippet is provided as the fix involves modifying existing code and adding new tests.

Notes

The fix involves changing the contract of the listed_tools() function and updating the corresponding test. This change may have implications for other parts of the codebase that rely on the current behavior.

Recommendation

Apply the workaround by modifying the listed_tools() function and updating the test, as the issue is caused by a specific implementation detail and not a version-specific bug.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix First-turn API call blocks until pending MCP tools/list times out [2 comments, 2 participants]