claude-code - 💡(How to fix) Fix Long-running synchronous MCP tool calls (>~10–12 min) trigger 'session stopped responding' UI watchdog and orphan MCP server processes

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces:

× Something went wrong Try sending your message again. If it keeps happening, share feedback so we can investigate. The session stopped responding. Send your message again to resume with a fresh process. [Try again]

This appears to be a UI health-check declaring the worker dead, NOT an MCP-level timeout, NOT an Anthropic API timeout (the API_TIMEOUT_MS=900000 env var is for the Anthropic API only), and NOT user cancellation. The triggering MCP server process is left running indefinitely — orphaned — and repeat incidents accumulate stuck processes.

The failure mode is reproducible whenever an MCP server legitimately needs more wall time than the watchdog tolerates, which is realistic for reasoning-heavy servers like Codex with high reasoning effort.

Error Message

  1. Some time after that, Claude Code's UI displays the watchdog error above. The MCP tool_use never receives a tool_result.
  2. Improve the error text and recovery. Current text suggests retry, but retry hits the same issue immediately. Surfacing "long-running MCP tool exceeded N seconds; the tool server has been killed and orphaned" would help users (and plugin authors) diagnose the actual cause.
  • Screenshots of the UI error text

Root Cause

Codex with model_reasoning_effort = "xhigh" is a power-user configuration that gives substantially deeper code review. Any plugin that wants to use Codex as a peer reviewer on complex prompts will hit this ceiling under that config. We expect the same shape of failure for any reasoning-heavy MCP server (research agents, large-context analyzers, etc.).

Fix Action

Fix / Workaround

Workaround (validated)

RAW_BUFFERClick to expand / collapse

Summary

When a synchronous MCP tool call blocks Claude Code's worker process for roughly 10–15 minutes, the UI surfaces:

× Something went wrong Try sending your message again. If it keeps happening, share feedback so we can investigate. The session stopped responding. Send your message again to resume with a fresh process. [Try again]

This appears to be a UI health-check declaring the worker dead, NOT an MCP-level timeout, NOT an Anthropic API timeout (the API_TIMEOUT_MS=900000 env var is for the Anthropic API only), and NOT user cancellation. The triggering MCP server process is left running indefinitely — orphaned — and repeat incidents accumulate stuck processes.

The failure mode is reproducible whenever an MCP server legitimately needs more wall time than the watchdog tolerates, which is realistic for reasoning-heavy servers like Codex with high reasoning effort.

Reproduce

Environment used for the investigation:

  • Claude Code 2.1.128, Windows 11, entrypoint claude-desktop
  • Model: Opus 4.7 [1M]
  • Codex CLI 0.125.0 installed and registered as an MCP server (codex mcp-server)
  • ~/.codex/config.toml: model = "gpt-5.4", model_reasoning_effort = "xhigh"

Steps:

  1. Invoke mcp__codex__codex with a prompt that requires Codex to read 4–6 files via its sandbox and produce a structured review (e.g., adversarially review a ~500-line plan document against a design doc, cross-checking 3–4 other files for consistency).
  2. Wait. Codex's own session log (~/.codex/sessions/<date>/rollout-...jsonl) shows continuous productive work — agent_message progress notes, function_call_output from sandbox reads, multiple turns each with token_count events.
  3. At around the 10–13 minute mark, Codex enters a long internal API call (between its log events) and stops emitting fresh log lines.
  4. Some time after that, Claude Code's UI displays the watchdog error above. The MCP tool_use never receives a tool_result.
  5. The user's next message becomes a last-prompt entry in the transcript, treated by Claude as a new turn. The original tool result is permanently lost.
  6. codex.exe continues running. Memory stays loaded (50–90 MB working set). CPU is near zero (process is waiting on its API call). Repeat incidents accumulate — we observed 16+ orphaned codex.exe after several days of these crashes.

We reproduced the failure twice in the same Claude session today, plus once on a separate project, all with comparable Codex review workloads.

Evidence (from our investigation)

  • Claude transcript: the MCP tool_use for mcp__codex__codex has no matching tool_result. Total tool_use count is 1 higher than tool_result count. No is_error: true tied to the tool_use.
  • Codex session log: ends without a task_complete event. The last entries are typically response_item events with type: reasoning and large encrypted_content payloads — Codex is mid-reasoning, not done.
  • API_TIMEOUT_MS=900000 (15 min) is set in the runtime env, but does not apply: in one case the hung call's Codex log went silent at 10:18 of wall time, and the watchdog UI appeared shortly after — under the 15-min API timeout. In another case, 12:27 of wall time. Both well below 15 min.
  • Process state: Get-Process codex confirms the MCP server is alive and idle (CPU not growing), holding state in memory, with no signal that Claude Code told it to exit.
  • No MCP-level cancellation or timeout event is recorded anywhere we can see.

Expected behavior (any of these would help)

  1. Increase the watchdog tolerance for MCP tool calls. ~10–15 min is below what reasoning-heavy MCP servers realistically need on complex tasks. Even doubling it would meaningfully reduce the failure rate for high-reasoning-effort Codex use.
  2. Provide a heartbeat mechanism over MCP stdio that an MCP server can use to signal "still working" without producing a tool_result. This would let well-behaved long-running MCP servers stay alive past the watchdog.
  3. Clean up MCP server processes when declaring sessions dead. The orphaned-process accumulation is its own bug; even if the watchdog itself can't be changed, killing the MCP child on watchdog-fire would prevent the leak.
  4. Improve the error text and recovery. Current text suggests retry, but retry hits the same issue immediately. Surfacing "long-running MCP tool exceeded N seconds; the tool server has been killed and orphaned" would help users (and plugin authors) diagnose the actual cause.

Workaround (validated)

For plugin authors hitting this: invoke the underlying CLI directly via Bash with run_in_background: true, output to a file, parse the JSONL event stream for progress. This sidesteps the synchronous MCP path entirely because the worker is never blocked.

We validated this with the exact same Codex review workload that previously hung MCP (multi-file review at xhigh reasoning):

  • Wall time: 14 minutes 44 seconds (≥ 30% past the empirical watchdog ceiling)
  • Process exit: clean (exit code 0)
  • Result file: 5 KB of structured review with 7 categorized findings
  • Token usage on the final turn: 635,963 input (506,880 cached, 80% hit), 22,928 output, 18,102 reasoning
  • No UI errors, no orphaned process

This is a real solution path but requires plugin authors to step outside the official MCP interface, which is unfortunate. A first-class fix in Claude Code would be preferable.

Why this matters

Codex with model_reasoning_effort = "xhigh" is a power-user configuration that gives substantially deeper code review. Any plugin that wants to use Codex as a peer reviewer on complex prompts will hit this ceiling under that config. We expect the same shape of failure for any reasoning-heavy MCP server (research agents, large-context analyzers, etc.).

Related artifacts

If useful, we have:

  • Full Codex session logs (~/.codex/sessions/...) showing the productive work right up to the watchdog firing
  • Claude transcript showing the unmatched tool_use
  • PowerShell process state output showing the orphaned codex.exe processes
  • Screenshots of the UI error text

Happy to share these if it would help diagnose; redacted them from this issue to keep it focused.

🤖 Filed from a debugging session with Claude Code.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Long-running synchronous MCP tool calls (>~10–12 min) trigger 'session stopped responding' UI watchdog and orphan MCP server processes