claude-code - 💡(How to fix) Fix [BUG] auto-memory resolver walks up to ancestor-encoded project directory instead of cwd-encoded path [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#53734Fetched 2026-04-28 06:48:27
View on GitHub
Comments
2
Participants
3
Timeline
9
Reactions
0
Timeline (top)
labeled ×4cross-referenced ×3commented ×2

Claude Code's auto-memory path resolution appears to walk up to an ancestor-encoded project directory instead of the cwd-encoded project directory in at least some multi-level project trees. When both ancestor-encoded and cwd-encoded project directories exist under ~/.claude/projects/ with populated memory/MEMORY.md files, the harness's # auto memory system-prompt section references the ancestor path for some cwds while correctly referencing the cwd-encoded path for sibling cwds at equal depth. This produces a silent cross-session memory contamination surface for multi-agent or multi-sub-project workflows.

Error Message

NSC PDF processing — CUDA context loss (system-instability evidence)

File "C:\OB1\recipes\pdf-to-wiki-export.venv\Lib\site-packages\surya\foundation\cache\dynamic_ops.py", line 330, in _decode_update .repeat(batch_size, 1) ~~~~~~^^^^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: unknown error Search for cudaErrorUnknown' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Recognizing Text: 100%|#########9| 3461/3462 [1:06:01<00:01, 1.14s/it] Failed at page 3461 of 3462 (99.97%) after 1h6min runtime. Owner observed system "acted like it rebooted" during this window. Subsequent PDFs in the same batch processed cleanly post-event (GPU recovered).

MCP server logs — v5.26 contradiction-cascade thrashing during the same window

v5.26 cascade fired: label=neutral v5.26 cascade fired: label=neutral v5.26 cascade fired: label=entailment [... many similar lines ...] v5.26 cascade fired: label=neutral v5.26 cascade to llama3 failed: AbortError: The signal has been aborted v5.26 cascade fired: label=neutral v5.26 cascade fired: label=neutral v5.26 cascade to llama3 failed: AbortError: The signal has been aborted v5.26 cascade fired: label=neutral Pattern: contradiction-detection cascade (DeBERTa NLI → llama3 fallback) abort-class failures during sustained GPU saturation. Cascade calls llama3 synchronously during capture; CPU-fallback inference exceeded the 30s timeout. Pod itself stayed Running 2d3h, 0 restarts — no server-side crash; only inference-layer thrash.

System telemetry at peak (nvidia-smi)

NVIDIA GeForce RTX 5070, 11389 MiB (used) / 467 MiB (free) / 12227 MiB (total), driver 581.80 96% VRAM saturated. cu128 PyTorch stack on Blackwell.

Ollama state during the event (GET /api/ps)

{"models":[{"name":"llama3:latest","size":4894744576, "size_vram":0, "context_length":4096, ...}]} size_vram: 0 = llama3 evicted from VRAM, running on CPU/RAM. Ollama process working set: ~5.5 GB system RAM.

Recovery (post-event)

NVIDIA GeForce RTX 5070, 2661 MiB (used) / 9195 MiB (free) / 12227 MiB (total) {"models":[]} # /api/ps - all models unloaded 8.7 GB VRAM freed. System back to baseline within minutes.

Root Cause

In multi-agent or multi-sub-project workflows where sibling cwds share an ancestor, this produces cross-session memory contamination: agent A's fresh session can load agent B's memory (because both walk up to the same ancestor). Without a defensive application-layer gate, this is load-bearing for agent-identity integrity.

Fix Action

Workaround

  • REST API fallback to the MCP server works normally; only the MCP transport is affected.
  • Restarting the Claude Code session re-establishes MCP transport (no in-session re-connect path observed).

Code Example

<project-root>/
├── team-a/
│   └── agent-1/    ← cwd A
└── team-b/
    └── agent-2/    ← cwd B

---

NSC PDF processing — CUDA context loss (system-instability evidence)

File "C:\OB1\recipes\pdf-to-wiki-export\.venv\Lib\site-packages\surya\foundation\cache\dynamic_ops.py", line 330, in _decode_update
    .repeat(batch_size, 1)
     ~~~~~~^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: unknown error
Search for `cudaErrorUnknown' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Recognizing Text: 100%|#########9| 3461/3462 [1:06:01<00:01,  1.14s/it]
Failed at page 3461 of 3462 (99.97%) after 1h6min runtime. Owner observed system "acted like it rebooted" during this window. Subsequent PDFs in the same batch processed cleanly post-event (GPU recovered).

MCP server logs — v5.26 contradiction-cascade thrashing during the same window

v5.26 cascade fired: label=neutral
v5.26 cascade fired: label=neutral
v5.26 cascade fired: label=entailment
[... many similar lines ...]
v5.26 cascade fired: label=neutral
v5.26 cascade to llama3 failed: AbortError: The signal has been aborted
v5.26 cascade fired: label=neutral
v5.26 cascade fired: label=neutral
v5.26 cascade to llama3 failed: AbortError: The signal has been aborted
v5.26 cascade fired: label=neutral
Pattern: contradiction-detection cascade (DeBERTa NLI → llama3 fallback) abort-class failures during sustained GPU saturation. Cascade calls llama3 synchronously during capture; CPU-fallback inference exceeded the 30s timeout. Pod itself stayed Running 2d3h, 0 restarts — no server-side crash; only inference-layer thrash.

System telemetry at peak (nvidia-smi)

NVIDIA GeForce RTX 5070, 11389 MiB (used) / 467 MiB (free) / 12227 MiB (total), driver 581.80
96% VRAM saturated. cu128 PyTorch stack on Blackwell.

Ollama state during the event (GET /api/ps)

{"models":[{"name":"llama3:latest","size":4894744576,
"size_vram":0, "context_length":4096, ...}]}
size_vram: 0 = llama3 evicted from VRAM, running on CPU/RAM. Ollama process working set: ~5.5 GB system RAM.

Recovery (post-event)

NVIDIA GeForce RTX 5070, 2661 MiB (used) / 9195 MiB (free) / 12227 MiB (total)
{"models":[]}  # /api/ps - all models unloaded
8.7 GB VRAM freed. System back to baseline within minutes.
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

Summary

Claude Code's auto-memory path resolution appears to walk up to an ancestor-encoded project directory instead of the cwd-encoded project directory in at least some multi-level project trees. When both ancestor-encoded and cwd-encoded project directories exist under ~/.claude/projects/ with populated memory/MEMORY.md files, the harness's # auto memory system-prompt section references the ancestor path for some cwds while correctly referencing the cwd-encoded path for sibling cwds at equal depth. This produces a silent cross-session memory contamination surface for multi-agent or multi-sub-project workflows.

Environment

  • Claude Code: VSCode native extension build, version 2.1.120 (also reproducible on CLI 2.1.120 per cross-agent reports)
  • Platform: Windows 11, PowerShell / Git Bash
  • Shell: bash (Git Bash for Windows)
  • Projects directory: ~/.claude/projects/ (resolved to %USERPROFILE%\.claude\projects\)

Directory-name encoding observed

The harness encodes an absolute cwd path into a project-directory name under ~/.claude/projects/ by replacing each path separator and delimiter with a dash (-). Empirically observed across 18 project directories on one machine:

cwdencoded directory
c:/Projects/my-workspacec--Projects-my-workspace
c:/Projects/my-workspace/sub/team/agentc--Projects-my-workspace-sub-team-agent
C:\Users\<user>\C--Users-<user>
c:/path/.hidden-subdirc--path--hidden-subdir

Rule: each of :, /, \, . in the cwd path becomes a single -. Consecutive delimiters (e.g., :/) produce consecutive dashes. Letter case is preserved.

Steps to reproduce

In a multi-level project tree with two sibling subdirectories at equal depth under a shared ancestor:

<project-root>/
├── team-a/
│   └── agent-1/    ← cwd A
└── team-b/
    └── agent-2/    ← cwd B
  1. Open Claude Code at cwd A. Harness creates ~/.claude/projects/<project-root-encoded>-team-a-agent-1/. Populate ~/.claude/projects/<project-root-encoded>-team-a-agent-1/memory/MEMORY.md with distinct content (e.g., title line # Memory — Agent 1).
  2. Open Claude Code at cwd B. Harness creates ~/.claude/projects/<project-root-encoded>-team-b-agent-2/. Populate that directory's memory/MEMORY.md with distinct content (e.g., title # Memory — Agent 2).
  3. Create an ancestor-encoded project directory ~/.claude/projects/<project-root-encoded>/ and populate its memory/MEMORY.md with content distinct from both (e.g., title # Memory — Ancestor).
  4. Open a fresh Claude Code session at cwd A. Inspect the harness-injected # auto memory system-prompt section in the transcript: it may reference <project-root-encoded>-team-a-agent-1/memory/ (correct) OR <project-root-encoded>/memory/ (walk-up to ancestor).
  5. Open a fresh Claude Code session at cwd B. Observe whether the auto-memory reference matches cwd B's cwd-encoded path or walks up.

Expected behavior

For any cwd, the # auto memory template should reference ~/.claude/projects/<cwd-encoded>/memory/ — the project directory whose encoded name is derived from the current cwd. No walk-up fall-through to ancestor-encoded paths.

If a cwd-encoded memory directory does not yet exist, it should be created at the cwd-encoded location on first write (never at an ancestor).

Actual behavior

  • Some cwds: the template references the cwd-encoded path (correct).
  • Other cwds at equal depth in the same parent tree: the template references an ancestor-encoded path (walk-up), even when the cwd-encoded path exists with a populated memory/ subdirectory.
  • The differential appears stable across fresh sessions — an affected cwd remains affected; an unaffected cwd remains unaffected. (See "Hypotheses" below.)

Impact

In single-project workflows where only one cwd at a time is active, the behavior may go unnoticed — the wrong memory loads silently.

In multi-agent or multi-sub-project workflows where sibling cwds share an ancestor, this produces cross-session memory contamination: agent A's fresh session can load agent B's memory (because both walk up to the same ancestor). Without a defensive application-layer gate, this is load-bearing for agent-identity integrity.

We documented 8 cross-agent contamination incidents over approximately one month of ecosystem operation on this machine before we mitigated by eliminating the ancestor memory/ directory. Each incident was caught by an application-layer gate that verifies the title of the auto-loaded MEMORY.md matches the expected agent name; without that gate, the contamination would have gone unnoticed until memory content was acted on.

Workaround

  1. Ensure no ancestor-encoded project directory under ~/.claude/projects/ has a memory/ subdirectory (rename to memory.bak-<timestamp> if present). With the ancestor absent, walk-up fall-through resolves to a nonexistent path and loads nothing — loss of auto-memory for affected cwds, but no contamination.
  2. Add a session-start discipline in CLAUDE.md or equivalent: before reading or writing to the auto-referenced memory directory, verify the title of MEMORY.md matches the expected project / agent identity. If mismatched, stop and alert the user.

Workaround (1) has an operational cost: agents must use absolute paths to their cwd-encoded memory directory via the Read / Write tools, ignoring the harness template's reference.

Workaround (2) has a per-deployment workflow tax: each agent or project-identity has to carry the gate, and a silent contamination before the gate fires is still possible on short-write paths.

A platform-level fix would eliminate both costs.

Suggested fix

Generate the # auto memory path template deterministically from the current cwd at session start, using the canonical cwd-encoding algorithm already observable in the ~/.claude/projects/ directory-naming convention. No walk-up fall-through. If the target directory does not exist, create it at the cwd-encoded path on first write.

If walk-up fall-through is intentional (e.g., to support a monorepo-style shared memory across sub-projects), we suggest:

  • Documenting it explicitly in the # auto memory template text. (Current template says "This directory already exists" but the referenced directory did not exist at the cwd-encoded path in our affected cwds — a contradiction with observed behavior.)
  • Making it opt-in via explicit user configuration (e.g., .claude/settings.json key memory.walk_up_parents: true), rather than default behavior.
  • Making it consistent across all cwds at equal depth in a given tree (not sibling-differential as observed).

Hypotheses for the differential

Two possibilities observed but not verified against source:

Hypothesis A (cached template). The harness caches the auto-memory template when the cwd's project directory is first created. If a parent-encoded memory/ existed at that time — even if it is later removed — the cached template persists, and subsequent sessions for that cwd continue to reference the stale ancestor path.

Hypothesis B (longest-ancestor-with-memory). The harness uses a path-resolution algorithm that picks the longest existing ancestor-encoded project directory with a memory/ subdirectory, rather than always picking the cwd-encoded path. In this model, cwds whose ancestor-encoded directory has a memory/ subdir resolve to the ancestor; cwds whose ancestor does not resolve to the cwd-encoded path.

Either hypothesis produces the observed sibling-differential behavior. The fix recommended above (deterministic cwd-encoded resolution, no walk-up) is robust to both.

Additional context

  • This is not a security-critical bug (no data leak to external parties).
  • It is load-bearing for AI-assistant ecosystem integrity: memory is a persistent identity / context store, and silent cross-session contamination at that layer causes identity-drift in workflows that rely on per-project or per-agent memory.
  • A workaround is available, but the workaround taxes every deployment with a session-start verification gate.

Thank you for Claude Code — the auto-memory feature is valuable; this specific resolver behavior is the only surface we've had to defensively gate against in our multi-project workflow.

What Should Happen?

Expected behavior

When an MCP server is healthy but its responses are temporarily slow (e.g. due to upstream resource pressure on the host), Claude Code should:

  1. NOT pre-emptively drop the MCP transport. Issuing a system-reminder that declares the server "disconnected" — when the server is actually up and reachable — misrepresents the failure mode and forces the user into REST fallback when retry would have succeeded.

  2. Implement bounded retry/backoff on transient transport timeouts before declaring the connection dead. Even a single retry with a modest backoff would catch the "MCP server overloaded for 30s, then recovers" case which we observed.

  3. Provide an in-session reconnect path. Currently, once the harness declares MCP disconnected, the only recovery is restarting the Claude Code session entirely. A "reconnect MCP" mechanism (manual trigger via slash-command, or automatic on next tool call after N seconds) would let users recover without losing session context.

  4. Surface a less-aggressive system-reminder when transport is degraded but not dead — e.g. "MCP transport slow (last response N seconds); will retry" rather than "MCP server disconnected; use REST fallback." Users can then make an informed choice about waiting vs falling back.

  5. Make the MCP-client HTTP timeout configurable (env var or settings). Different MCP servers have different latency profiles; a one-size timeout will misfire on legitimate slow operations.

  6. Verify server liveness before declaring transport dead. A simple HTTP HEAD or health-check probe would distinguish "server unreachable" from "server slow" — these are very different failure modes that currently trigger the same harness response.

The underlying principle: Claude Code's MCP client should treat slow responses and disconnections as different events with different remediation. Today they collapse to the same "disconnected" treatment, which is misleading and fragile.

Error Messages/Logs

NSC PDF processing — CUDA context loss (system-instability evidence)

File "C:\OB1\recipes\pdf-to-wiki-export\.venv\Lib\site-packages\surya\foundation\cache\dynamic_ops.py", line 330, in _decode_update
    .repeat(batch_size, 1)
     ~~~~~~^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: unknown error
Search for `cudaErrorUnknown' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Recognizing Text: 100%|#########9| 3461/3462 [1:06:01<00:01,  1.14s/it]
Failed at page 3461 of 3462 (99.97%) after 1h6min runtime. Owner observed system "acted like it rebooted" during this window. Subsequent PDFs in the same batch processed cleanly post-event (GPU recovered).

MCP server logs — v5.26 contradiction-cascade thrashing during the same window

v5.26 cascade fired: label=neutral
v5.26 cascade fired: label=neutral
v5.26 cascade fired: label=entailment
[... many similar lines ...]
v5.26 cascade fired: label=neutral
v5.26 cascade to llama3 failed: AbortError: The signal has been aborted
v5.26 cascade fired: label=neutral
v5.26 cascade fired: label=neutral
v5.26 cascade to llama3 failed: AbortError: The signal has been aborted
v5.26 cascade fired: label=neutral
Pattern: contradiction-detection cascade (DeBERTa NLI → llama3 fallback) abort-class failures during sustained GPU saturation. Cascade calls llama3 synchronously during capture; CPU-fallback inference exceeded the 30s timeout. Pod itself stayed Running 2d3h, 0 restarts — no server-side crash; only inference-layer thrash.

System telemetry at peak (nvidia-smi)

NVIDIA GeForce RTX 5070, 11389 MiB (used) / 467 MiB (free) / 12227 MiB (total), driver 581.80
96% VRAM saturated. cu128 PyTorch stack on Blackwell.

Ollama state during the event (GET /api/ps)

{"models":[{"name":"llama3:latest","size":4894744576,
"size_vram":0, "context_length":4096, ...}]}
size_vram: 0 = llama3 evicted from VRAM, running on CPU/RAM. Ollama process working set: ~5.5 GB system RAM.

Recovery (post-event)

NVIDIA GeForce RTX 5070, 2661 MiB (used) / 9195 MiB (free) / 12227 MiB (total)
{"models":[]}  # /api/ps - all models unloaded
8.7 GB VRAM freed. System back to baseline within minutes.

Steps to Reproduce

Environment

  • OS: Windows 11
  • Claude Code: VSCode extension (Windows installer build)
  • MCP server: local HTTP MCP, port 8000 (Deno + Postgres + pgvector)
  • Hardware: NVIDIA RTX 5070 (12 GB VRAM), driver 581.80
  • ML stack: PyTorch + cu128 (Blackwell), Ollama on host (llama3 8B Q4_0)

Setup

  1. Run a Claude Code session with an MCP server whose tools invoke synchronous compute (in our case: contradiction-detection cascade that calls Ollama llama3 inline during capture).
  2. Confirm MCP tools register and work normally — make several captures and searches; verify they complete in expected latency (<3s).

Trigger

  1. In a separate process on the same host, start a sustained GPU-heavy workload that saturates VRAM (we used marker_single processing a 3000+ page PDF; any GPU job that holds >90% VRAM for 10+ minutes should suffice).
  2. Verify the MCP server's downstream inference layer is forced to CPU fallback. Check via: curl http://localhost:11434/api/ps Look for "size_vram": 0 on the loaded model — confirms eviction from VRAM to CPU/RAM.
  3. Continue normal Claude Code session activity in parallel — keep issuing MCP tool calls (captures, searches) at modest cadence.

Observed

  1. After some duration of degraded MCP-tool latency (individual responses taking 10-30s instead of 1-3s), the Claude Code harness injects a system-reminder stating the MCP server has disconnected.
  2. The MCP tool list (e.g. mcp__<server>__* entries) becomes inaccessible. ToolSearch returns no match for previously-loaded MCP tools.
  3. The system-reminder fires pre-emptively — not as a runtime error from a failed tool call, but as an injected reminder before the next tool invocation.

Expected

  • Transient MCP transport slowness should trigger retry/re-establish logic, or at minimum a less-aggressive system-reminder ("transport slow, retry?" vs "disconnected, fallback now").
  • The MCP server itself remains healthy throughout (verifiable via HTTP health endpoint); only the harness-side transport is dropped.

Server-side ground truth (for differential diagnosis)

kubectl get pods -n <namespace>

pod restart count: 0; uptime continuous across the disconnect window

curl http://localhost:8000/health

returns {"status":"ok"} throughout

Workaround

  • REST API fallback to the MCP server works normally; only the MCP transport is affected.
  • Restarting the Claude Code session re-establishes MCP transport (no in-session re-connect path observed).

Reproducibility caveat

Reproducing on-demand requires the specific upstream-pressure setup. The class is likely "any synchronous MCP tool whose response latency exceeds the harness MCP-client timeout under load," not specifically Ollama or contradiction-detection. A simpler synthetic reproducer might be an MCP server that artificially injects N-second sleeps on each tool call.

Not confirmed as regression. This is the first time we've explicitly captured + documented an "MCP transport disconnect mid-session" event with full forensic detail (server-side logs, GPU/VRAM telemetry, cascade-thrash correlation). However we cannot rule out that this class of failure has been occurring intermittently for some time and just wasn't surfaced — agents previously fell back to REST or restarted sessions without filing a precise observation.

We don't have version-pinned evidence of "worked in Claude Code version X, broken in version Y."

Claude Model

Opus

Is this a regression?

I don't know

Last Working Version

No response

Claude Code Version

VSCode extension: anthropic.claude-code 2.1.120

Platform

Anthropic API

Operating System

Windows

Terminal/Shell

VS Code integrated terminal

Additional Information

Single-event framing (load-bearing diagnostic insight)

Three symptoms occurred in the same window on the same host:

  1. MCP transport disconnect (this bug)
  2. Sustained inference-layer cascade thrash on the connected MCP server (DeBERTa NLI → llama3 fallback hitting 30s timeouts repeatedly)
  3. CUDA "unknown error" in an unrelated GPU process (PDF processing, marker_single)

Owner observed: "the entire system acted like it rebooted" during the event. Recovery: subsequent operations on a clean GPU completed normally within minutes.

The strongest framing of this bug is therefore not just "MCP transport drops" but "Claude Code subsystems do not gracefully degrade under upstream resource pressure on the host." A fix that addresses MCP transport timeout/retry alone would help, but the broader class might warrant resilience review across other Claude Code subsystems too.

Suggested mitigation directions (customer perspective)

  • Configurable MCP-client HTTP timeout (env var or settings)
  • In-session auto-reconnect on transient transport failure (current behavior: pre-emptive system-reminder + drop; no retry)
  • Less-aggressive system-reminder shape — "transport slow, retry?" rather than "disconnected, REST-fallback now"
  • Optional transport-health telemetry so users see degradation before the cliff

Multi-agent operational context

This host runs 7+ concurrent Claude Code agent sessions in normal operation (different working directories, different MCP servers, often overlapping HTTP traffic to local services). Bugs in MCP transport resilience compound across the fleet — one agent's degraded session amplifies cross-agent coordination latency. Fix value scales with multi-agent deployments.

Companion observation (separate but related)

[Include if Sabretooth's auto-memory walk-up draft is being submitted in parallel:] We have a separate platform-bug draft on Claude Code's auto-memory resolver walking up the cwd-encoded tree past the agent's deep bucket. Same broader class as this bug: Claude Code path-resolution / subsystem behavior is non-uniform and surprising under nominally-supported configurations. Reference: [link to sister submission if filed; or describe inline if cross-filing].

extent analysis

TL;DR

The Claude Code MCP transport disconnects pre-emptively under upstream resource pressure, and a fix is needed to implement bounded retry/backoff on transient transport timeouts.

Guidance

  • Review the MCP-client HTTP timeout configuration to ensure it's suitable for the specific MCP server's latency profile.
  • Implement a retry mechanism with a modest backoff for transient transport timeouts to prevent pre-emptive disconnections.
  • Consider adding an in-session reconnect path to allow users to recover from transport failures without losing session context.
  • Surface a less-aggressive system-reminder for degraded transport, such as "transport slow, retry?" instead of "disconnected, fallback now".
  • Verify server liveness before declaring transport dead to distinguish between slow responses and disconnections.

Example

No specific code example is provided, as the issue requires a review of the Claude Code MCP transport implementation and configuration.

Notes

The issue is not a security-critical bug, but it's load-bearing for AI-assistant ecosystem integrity. A workaround is available, but it taxes every deployment with a session-start verification gate. A platform-level fix would eliminate these costs.

Recommendation

Apply a workaround by configuring the MCP-client HTTP timeout and implementing a retry mechanism, as the issue is not explicitly stated to be fixed in a specific version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When an MCP server is healthy but its responses are temporarily slow (e.g. due to upstream resource pressure on the host), Claude Code should:

  1. NOT pre-emptively drop the MCP transport. Issuing a system-reminder that declares the server "disconnected" — when the server is actually up and reachable — misrepresents the failure mode and forces the user into REST fallback when retry would have succeeded.

  2. Implement bounded retry/backoff on transient transport timeouts before declaring the connection dead. Even a single retry with a modest backoff would catch the "MCP server overloaded for 30s, then recovers" case which we observed.

  3. Provide an in-session reconnect path. Currently, once the harness declares MCP disconnected, the only recovery is restarting the Claude Code session entirely. A "reconnect MCP" mechanism (manual trigger via slash-command, or automatic on next tool call after N seconds) would let users recover without losing session context.

  4. Surface a less-aggressive system-reminder when transport is degraded but not dead — e.g. "MCP transport slow (last response N seconds); will retry" rather than "MCP server disconnected; use REST fallback." Users can then make an informed choice about waiting vs falling back.

  5. Make the MCP-client HTTP timeout configurable (env var or settings). Different MCP servers have different latency profiles; a one-size timeout will misfire on legitimate slow operations.

  6. Verify server liveness before declaring transport dead. A simple HTTP HEAD or health-check probe would distinguish "server unreachable" from "server slow" — these are very different failure modes that currently trigger the same harness response.

The underlying principle: Claude Code's MCP client should treat slow responses and disconnections as different events with different remediation. Today they collapse to the same "disconnected" treatment, which is misleading and fragile.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [BUG] auto-memory resolver walks up to ancestor-encoded project directory instead of cwd-encoded path [2 comments, 3 participants]