litellm - 💡(How to fix) Fix MCP tool discovery fails with remote Streamable HTTP servers — asyncio.CancelledError / anyio TaskGroup race [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#22928Fetched 2026-04-08 00:39:23
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
1
Author
Timeline (top)
commented ×1cross-referenced ×1subscribed ×1

MCP tool discovery systematically fails when using remote servers over Streamable HTTP (transport: "http") or SSE (transport: "sse"). All list_tools calls are cancelled before completion, resulting in 0 tools discovered at startup.

Error Message

async def handle_get_stream(...): try: # existing logic except Exception: logger.debug("GET stream failed, continuing without server push") return # Don't propagate to TaskGroup

Root Cause

This is the same underlying issue as #20715. The partial fix (switching asyncio.wait_foranyio.fail_after in _fetch_tools_with_timeout) addressed Root Cause #1, but Root Cause #2 persists:

The MCP SDK's handle_get_stream (in mcp/client/streamable_http.py) opens a long-lived SSE GET connection as part of the anyio.TaskGroup. When this GET connection fails or is rejected (common with remote servers that don't support server-initiated messages), it propagates an exception through the TaskGroup, which cancels the sibling post_writer task — and with it, the list_tools call.

Fix Action

Workaround

Patching handle_get_stream to be a no-op inside the container resolves the issue (confirmed by #20715 reporter).

Code Example

mcp_servers:
  exa:
    url: "http://host.containers.internal:20200/mcp"
    transport: "http"
    description: "Exa AI web search"
    allow_all_keys: true
  # ... 15 more servers

---

MCP client list_tools was cancelled for google_docs
Timeout while listing tools from context7
MCP client list_tools was cancelled for exa

---

async def handle_get_stream(...):
       try:
           # existing logic
       except Exception:
           logger.debug("GET stream failed, continuing without server push")
           return  # Don't propagate to TaskGroup

---

# Before:
   await asyncio.wait_for(client.run_with_session(_noop), timeout=10.0)
   # After:
   with anyio.fail_after(10.0):
       await client.run_with_session(_noop)
RAW_BUFFERClick to expand / collapse

Description

MCP tool discovery systematically fails when using remote servers over Streamable HTTP (transport: "http") or SSE (transport: "sse"). All list_tools calls are cancelled before completion, resulting in 0 tools discovered at startup.

Environment

  • LiteLLM version: v1.81.14-stable (also reproduced on main-latest / same version)
  • Deployment: Podman container on NixOS
  • MCP servers: 16 servers (13 via mcp-proxy stdio→SSE/HTTP bridges on localhost, 3 direct HTTP)
  • Transport tested: Both sse and http (Streamable HTTP) — same failure

Configuration

mcp_servers:
  exa:
    url: "http://host.containers.internal:20200/mcp"
    transport: "http"
    description: "Exa AI web search"
    allow_all_keys: true
  # ... 15 more servers

Observed Behavior

Every 30 seconds, LiteLLM attempts tool discovery and every server fails:

MCP client list_tools was cancelled for google_docs
Timeout while listing tools from context7
MCP client list_tools was cancelled for exa

The error alternates between CancelledError and TimeoutError, but the cancellation happens within 1-2 seconds — well before the 30s anyio.fail_after deadline. This is not a real timeout.

Root Cause

This is the same underlying issue as #20715. The partial fix (switching asyncio.wait_foranyio.fail_after in _fetch_tools_with_timeout) addressed Root Cause #1, but Root Cause #2 persists:

The MCP SDK's handle_get_stream (in mcp/client/streamable_http.py) opens a long-lived SSE GET connection as part of the anyio.TaskGroup. When this GET connection fails or is rejected (common with remote servers that don't support server-initiated messages), it propagates an exception through the TaskGroup, which cancels the sibling post_writer task — and with it, the list_tools call.

Code path

  1. _fetch_tools_with_timeout()anyio.fail_after(30.0)client.list_tools()
  2. list_tools()run_with_session() → opens TaskGroup with:
    • post_writer (handles request/response — this is the useful one)
    • handle_get_stream (opens SSE GET — this fails for remote servers)
  3. handle_get_stream fails → TaskGroup cancels post_writerlist_tools raises CancelledError

Health check also affected

_run_health_check at line ~2495 still uses asyncio.wait_for() (not anyio.fail_after), which compounds the issue.

Suggested Fix

  1. In MCP SDK (mcp/client/streamable_http.py): Make handle_get_stream a no-op or catch its exceptions without propagating to the TaskGroup:

    async def handle_get_stream(...):
        try:
            # existing logic
        except Exception:
            logger.debug("GET stream failed, continuing without server push")
            return  # Don't propagate to TaskGroup
  2. In LiteLLM (mcp_server_manager.py): Replace asyncio.wait_for() in _run_health_check (line ~2495) with anyio.fail_after() for consistency:

    # Before:
    await asyncio.wait_for(client.run_with_session(_noop), timeout=10.0)
    # After:
    with anyio.fail_after(10.0):
        await client.run_with_session(_noop)
  3. Make timeouts configurable: The 30.0s in _fetch_tools_with_timeout is hardcoded. PR #22287 adds env vars for this, but it hasn't landed in a stable release yet.

Workaround

Patching handle_get_stream to be a no-op inside the container resolves the issue (confirmed by #20715 reporter).

extent analysis

Fix Plan

To resolve the MCP tool discovery issue, follow these steps:

  1. Modify the MCP SDK:
    • In mcp/client/streamable_http.py, update the handle_get_stream function to catch exceptions without propagating to the TaskGroup:
      async def handle_get_stream(...):
          try:
              # existing logic
          except Exception:
              logger.debug("GET stream failed, continuing without server push")
              return  # Don't propagate to TaskGroup
  2. Update LiteLLM:
    • In mcp_server_manager.py, replace asyncio.wait_for() with anyio.fail_after() in the _run_health_check method:
      # Before:
      await asyncio.wait_for(client.run_with_session(_noop), timeout=10.0)
      # After:
      with anyio.fail_after(10.0):
          await client.run_with_session(_noop)
  3. Make timeouts configurable:
    • Use environment variables to configure timeouts, as proposed in PR #22287.

Verification

To verify the fix:

  • Restart the LiteLLM service after applying the changes.
  • Monitor the logs for successful tool discovery and absence of CancelledError or TimeoutError messages.
  • Test the health check to ensure it completes without errors.

Extra Tips

  • Consider implementing retries with exponential backoff for handling temporary connection issues.
  • Review the MCP SDK and LiteLLM code for similar issues that may be caused by unhandled exceptions in TaskGroups.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING