litellm - 💡(How to fix) Fix MCP tool discovery fails with remote Streamable HTTP servers — asyncio.CancelledError / anyio TaskGroup race [1 comments, 2 participants]

litellm2026-03-05 20:26:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#22928•Fetched 2026-04-08 00:39:23

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jerudnik

Participants

arthikrangan

jerudnik

Timeline (top)

commented ×1cross-referenced ×1subscribed ×1

MCP tool discovery systematically fails when using remote servers over Streamable HTTP (transport: "http") or SSE (transport: "sse"). All list_tools calls are cancelled before completion, resulting in 0 tools discovered at startup.

Error Message

async def handle_get_stream(...): try: # existing logic except Exception: logger.debug("GET stream failed, continuing without server push") return # Don't propagate to TaskGroup

Root Cause

This is the same underlying issue as #20715. The partial fix (switching asyncio.wait_for → anyio.fail_after in _fetch_tools_with_timeout) addressed Root Cause #1, but Root Cause #2 persists:

The MCP SDK's handle_get_stream (in mcp/client/streamable_http.py) opens a long-lived SSE GET connection as part of the anyio.TaskGroup. When this GET connection fails or is rejected (common with remote servers that don't support server-initiated messages), it propagates an exception through the TaskGroup, which cancels the sibling post_writer task — and with it, the list_tools call.

Fix Action

Workaround

Patching handle_get_stream to be a no-op inside the container resolves the issue (confirmed by #20715 reporter).

Code Example

mcp_servers:
  exa:
    url: "http://host.containers.internal:20200/mcp"
    transport: "http"
    description: "Exa AI web search"
    allow_all_keys: true
  # ... 15 more servers

---

MCP client list_tools was cancelled for google_docs
Timeout while listing tools from context7
MCP client list_tools was cancelled for exa

---

async def handle_get_stream(...):
       try:
           # existing logic
       except Exception:
           logger.debug("GET stream failed, continuing without server push")
           return  # Don't propagate to TaskGroup

---

# Before:
   await asyncio.wait_for(client.run_with_session(_noop), timeout=10.0)
   # After:
   with anyio.fail_after(10.0):
       await client.run_with_session(_noop)

RAW_BUFFERClick to expand / collapse

Description

Environment

LiteLLM version: v1.81.14-stable (also reproduced on main-latest / same version)
Deployment: Podman container on NixOS
MCP servers: 16 servers (13 via mcp-proxy stdio→SSE/HTTP bridges on localhost, 3 direct HTTP)
Transport tested: Both sse and http (Streamable HTTP) — same failure

Configuration

mcp_servers:
  exa:
    url: "http://host.containers.internal:20200/mcp"
    transport: "http"
    description: "Exa AI web search"
    allow_all_keys: true
  # ... 15 more servers

Observed Behavior

Every 30 seconds, LiteLLM attempts tool discovery and every server fails:

MCP client list_tools was cancelled for google_docs
Timeout while listing tools from context7
MCP client list_tools was cancelled for exa

The error alternates between CancelledError and TimeoutError, but the cancellation happens within 1-2 seconds — well before the 30s anyio.fail_after deadline. This is not a real timeout.

Root Cause

Code path

_fetch_tools_with_timeout() → anyio.fail_after(30.0) → client.list_tools()
list_tools() → run_with_session() → opens TaskGroup with:
- post_writer (handles request/response — this is the useful one)
- handle_get_stream (opens SSE GET — this fails for remote servers)
handle_get_stream fails → TaskGroup cancels post_writer → list_tools raises CancelledError

Health check also affected

_run_health_check at line ~2495 still uses asyncio.wait_for() (not anyio.fail_after), which compounds the issue.

Suggested Fix

In MCP SDK (mcp/client/streamable_http.py): Make handle_get_stream a no-op or catch its exceptions without propagating to the TaskGroup:

async def handle_get_stream(...):
    try:
        # existing logic
    except Exception:
        logger.debug("GET stream failed, continuing without server push")
        return  # Don't propagate to TaskGroup

In LiteLLM (mcp_server_manager.py): Replace asyncio.wait_for() in _run_health_check (line ~2495) with anyio.fail_after() for consistency:

# Before:
await asyncio.wait_for(client.run_with_session(_noop), timeout=10.0)
# After:
with anyio.fail_after(10.0):
    await client.run_with_session(_noop)

Make timeouts configurable: The 30.0s in _fetch_tools_with_timeout is hardcoded. PR #22287 adds env vars for this, but it hasn't landed in a stable release yet.

Workaround

Patching handle_get_stream to be a no-op inside the container resolves the issue (confirmed by #20715 reporter).

extent analysis

Fix Plan

To resolve the MCP tool discovery issue, follow these steps:

Modify the MCP SDK:

In mcp/client/streamable_http.py, update the handle_get_stream function to catch exceptions without propagating to the TaskGroup:

async def handle_get_stream(...):
    try:
        # existing logic
    except Exception:
        logger.debug("GET stream failed, continuing without server push")
        return  # Don't propagate to TaskGroup

Update LiteLLM:

In mcp_server_manager.py, replace asyncio.wait_for() with anyio.fail_after() in the _run_health_check method:

# Before:
await asyncio.wait_for(client.run_with_session(_noop), timeout=10.0)
# After:
with anyio.fail_after(10.0):
    await client.run_with_session(_noop)

Make timeouts configurable:
- Use environment variables to configure timeouts, as proposed in PR #22287.

Verification

To verify the fix:

Restart the LiteLLM service after applying the changes.
Monitor the logs for successful tool discovery and absence of CancelledError or TimeoutError messages.
Test the health check to ensure it completes without errors.

Extra Tips

Consider implementing retries with exponential backoff for handling temporary connection issues.
Review the MCP SDK and LiteLLM code for similar issues that may be caused by unhandled exceptions in TaskGroups.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #serialization error #model compatibility #GPU setup #container setup #orchestration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix MCP tool discovery fails with remote Streamable HTTP servers — asyncio.CancelledError / anyio TaskGroup race [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Description

Environment

Configuration

Observed Behavior

Root Cause

Code path

Health check also affected

Suggested Fix

Workaround

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix MCP tool discovery fails with remote Streamable HTTP servers — asyncio.CancelledError / anyio TaskGroup race [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Description

Environment

Configuration

Observed Behavior

Root Cause

Code path

Health check also affected

Suggested Fix

Workaround

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING