claude-code - 💡(How to fix) Fix Long-running synchronous MCP tool calls (>~10–12 min) trigger 'session stopped responding' UI watchdog and orphan MCP server processes

StepCodex · 2026-05-12T19:33:35Z

[claude-code] When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces: × Something went wrong Try sendi… When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces: > **× Something went wrong** > Try sending your message again. If it keeps happening, share feedback so we can investigate. > **The session stopped responding. Send your message again to resume with a fresh process.** > [Try again] This appears to be a UI health-check declaring the worker dead, NOT an MCP-level timeout, NOT an Anthropic API timeout (the `API_TIMEOUT_MS=900000` env var is for the Anthropic API only), and NOT user cancellation. The triggering MCP server process is left running indefinitely — orphaned — and repeat incidents accumulate stuck processes. The failure mode is reproducible whenever an MCP server legitimately needs more wall time than the watchdog tolerates, which is realistic for reasoning-heavy servers like Codex with high reasoning effort. ## Fix / Workaround ## Workaround (validated) ## Summary When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces: > **× Something went wrong** > Try sending your message again. If it keeps happening, share feedback so we can investigate. > **The session stopped responding. Send your message again to resume with a fresh process.** > [Try again] This appears to be a UI health-check declaring the worker dead, NOT an MCP-level timeout, NOT an Anthropic API timeout (the `API_TIMEOUT_MS=900000` env var is for the Anthropic API only), and NOT user cancellation. The triggering MCP server process is left running indefinitely — orphaned — and repeat incidents accumulate stuck processes. The failure mode is reproducible whenever an MCP server legitimately needs more wall time than the watchdog tolerates, which is realistic for reasoning-heavy servers like Codex with high reasoning effort. ## Reproduce Environment used for the investigation: - Claude Code 2.1.128, Windows 11, entrypoint `claude-desktop` - Model: Opus 4.7 [1M] - Codex CLI 0.125.0 installed and registered as an MCP server (`codex mcp-server`) - `~/.codex/config.toml`: `model = "gpt-5.4"`, `model_reasoning_effort = "xhigh"` Steps: 1. Invoke `mcp__codex__codex` with a prompt that requires Codex to read 4–6 files via its sandbox and produce a structured review (e.g., adversarially review a ~500-line plan document against a design doc, cross-checking 3–4 other files for consistency). 2. Wait. Codex's own session log (`~/.codex/sessions/ /rollout-...jsonl`) shows continuous productive work — `agent_message` progress notes, `function_call_output` from sandbox reads, multiple turns each with `token_count` events. 3. At around the 10–13 minute mark, Codex enters a long internal API call (between its log events) and stops emitting fresh log lines. 4. Some time after that, Claude Code's UI displays the watchdog error above. The MCP `tool_use` never receives a `tool_result`. 5. The user's next message becomes a `last-prompt` entry in the transcript, treated by Claude as a new turn. The original tool result is permanently lost. 6. `codex.exe` continues running. Memory stays loaded (50–90 MB working set). CPU is near zero (process is waiting on its API call). Repeat incidents accumulate — we observed 16+ orphaned `codex.exe` after several days of these crashes. We reproduced the failure twice in the same Claude session today, plus once on a separate project, all with comparable Codex review workloads. ## Evidence (from our investigation) - **Claude transcript**: the MCP `tool_use` for `mcp__codex__codex` has no matching `tool_result`. Total `tool_use` count is 1 higher than `tool_result` count. No `is_error: true` tied to the tool_use. - **Codex session log**: ends without a `task_complete` event. The last entries are typically `response_item` events with `type: reasoning` and large encrypted_content payloads — Codex is mid-reasoning, not done. - **`API_TIMEOUT_MS=900000`** (15 min) is set in the runtime env, but does not apply: in one case the hung call's Codex log went silent at 10:18 of wall time, and the watchdog UI appeared shortly after — under the 15-min API timeout. In another case, 12:27 of wall time. Both well below 15 min. - **Process state**: `Get-Process codex` confirms the MCP server is alive and idle (CPU not growing), holding state in memory, with no signal that Claude Code told it to exit. - **No MCP-level cancellation or timeout event** is recorded anywhere we can see. ## Expected behavior (any of these would help) 1. **Increase the watchdog tolerance for MCP tool calls.** ~10–15 min is below what reasoning-heavy MCP servers realistically need on complex tasks. Even doubling it would meaningfully reduce the failure rate for high-reasoning-effort Codex use. 2. **Provide a heartbeat mechanism over MCP stdio** that an MCP server can use to signal "still working"

When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces:

× Something went wrong Try sending your message again. If it keeps happening, share feedback so we can investigate. The session stopped responding. Send your message again to resume with a fresh process. [Try again]

This appears to be a UI health-check declaring the worker dead, NOT an MCP-level timeout, NOT an Anthropic API timeout (the API_TIMEOUT_MS=900000 env var is for the Anthropic API only), and NOT user cancellation. The triggering MCP server process is left running indefinitely — orphaned — and repeat incidents accumulate stuck processes.

The failure mode is reproducible whenever an MCP server legitimately needs more wall time than the watchdog tolerates, which is realistic for reasoning-heavy servers like Codex with high reasoning effort.

Error Message

Some time after that, Claude Code's UI displays the watchdog error above. The MCP tool_use never receives a tool_result.
Improve the error text and recovery. Current text suggests retry, but retry hits the same issue immediately. Surfacing "long-running MCP tool exceeded N seconds; the tool server has been killed and orphaned" would help users (and plugin authors) diagnose the actual cause.

Screenshots of the UI error text

Root Cause

Codex with model_reasoning_effort = "xhigh" is a power-user configuration that gives substantially deeper code review. Any plugin that wants to use Codex as a peer reviewer on complex prompts will hit this ceiling under that config. We expect the same shape of failure for any reasoning-heavy MCP server (research agents, large-context analyzers, etc.).

Summary

When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces:

× Something went wrong Try sending your message again. If it keeps happening, share feedback so we can investigate. The session stopped responding. Send your message again to resume with a fresh process. [Try again]

Reproduce

Environment used for the investigation:

Claude Code 2.1.128, Windows 11, entrypoint claude-desktop
Model: Opus 4.7 [1M]
Codex CLI 0.125.0 installed and registered as an MCP server (codex mcp-server)
~/.codex/config.toml: model = "gpt-5.4", model_reasoning_effort = "xhigh"

Steps:

Invoke mcp__codex__codex with a prompt that requires Codex to read 4–6 files via its sandbox and produce a structured review (e.g., adversarially review a ~500-line plan document against a design doc, cross-checking 3–4 other files for consistency).
Wait. Codex's own session log (~/.codex/sessions/<date>/rollout-...jsonl) shows continuous productive work — agent_message progress notes, function_call_output from sandbox reads, multiple turns each with token_count events.
At around the 10–13 minute mark, Codex enters a long internal API call (between its log events) and stops emitting fresh log lines.
Some time after that, Claude Code's UI displays the watchdog error above. The MCP tool_use never receives a tool_result.
The user's next message becomes a last-prompt entry in the transcript, treated by Claude as a new turn. The original tool result is permanently lost.
codex.exe continues running. Memory stays loaded (50–90 MB working set). CPU is near zero (process is waiting on its API call). Repeat incidents accumulate — we observed 16+ orphaned codex.exe after several days of these crashes.

We reproduced the failure twice in the same Claude session today, plus once on a separate project, all with comparable Codex review workloads.

Evidence (from our investigation)

Claude transcript: the MCP tool_use for mcp__codex__codex has no matching tool_result. Total tool_use count is 1 higher than tool_result count. No is_error: true tied to the tool_use.
Codex session log: ends without a task_complete event. The last entries are typically response_item events with type: reasoning and large encrypted_content payloads — Codex is mid-reasoning, not done.
API_TIMEOUT_MS=900000 (15 min) is set in the runtime env, but does not apply: in one case the hung call's Codex log went silent at 10:18 of wall time, and the watchdog UI appeared shortly after — under the 15-min API timeout. In another case, 12:27 of wall time. Both well below 15 min.
Process state: Get-Process codex confirms the MCP server is alive and idle (CPU not growing), holding state in memory, with no signal that Claude Code told it to exit.
No MCP-level cancellation or timeout event is recorded anywhere we can see.

Expected behavior (any of these would help)

Increase the watchdog tolerance for MCP tool calls. ~10–15 min is below what reasoning-heavy MCP servers realistically need on complex tasks. Even doubling it would meaningfully reduce the failure rate for high-reasoning-effort Codex use.
Provide a heartbeat mechanism over MCP stdio that an MCP server can use to signal "still working" without producing a tool_result. This would let well-behaved long-running MCP servers stay alive past the watchdog.
Clean up MCP server processes when declaring sessions dead. The orphaned-process accumulation is its own bug; even if the watchdog itself can't be changed, killing the MCP child on watchdog-fire would prevent the leak.
Improve the error text and recovery. Current text suggests retry, but retry hits the same issue immediately. Surfacing "long-running MCP tool exceeded N seconds; the tool server has been killed and orphaned" would help users (and plugin authors) diagnose the actual cause.

Workaround (validated)

For plugin authors hitting this: invoke the underlying CLI directly via Bash with run_in_background: true, output to a file, parse the JSONL event stream for progress. This sidesteps the synchronous MCP path entirely because the worker is never blocked.

We validated this with the exact same Codex review workload that previously hung MCP (multi-file review at xhigh reasoning):

Wall time: 14 minutes 44 seconds (≥ 30% past the empirical watchdog ceiling)
Process exit: clean (exit code 0)
Result file: 5 KB of structured review with 7 categorized findings
Token usage on the final turn: 635,963 input (506,880 cached, 80% hit), 22,928 output, 18,102 reasoning
No UI errors, no orphaned process

This is a real solution path but requires plugin authors to step outside the official MCP interface, which is unfortunate. A first-class fix in Claude Code would be preferable.

Why this matters

Related artifacts

If useful, we have:

Full Codex session logs (~/.codex/sessions/...) showing the productive work right up to the watchdog firing
Claude transcript showing the unmatched tool_use
PowerShell process state output showing the orphaned codex.exe processes
Screenshots of the UI error text

Happy to share these if it would help diagnose; redacted them from this issue to keep it focused.

🤖 Filed from a debugging session with Claude Code.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Long-running synchronous MCP tool calls (>~10–12 min) trigger 'session stopped responding' UI watchdog and orphan MCP server processes

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround (validated)

Summary

Reproduce

Evidence (from our investigation)

Expected behavior (any of these would help)

Workaround (validated)

Why this matters

Related artifacts

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Long-running synchronous MCP tool calls (>~10–12 min) trigger 'session stopped responding' UI watchdog and orphan MCP server processes

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround (validated)

Summary

Reproduce

Evidence (from our investigation)

Expected behavior (any of these would help)

Workaround (validated)

Why this matters

Related artifacts

Still need to ship something?

RELATED_DISCOVERY

TRENDING