litellm - ✅(Solved) Fix [Bug]: content_block_start dropped during /v1/messages → GitHub Copilot streaming, triggering non-streaming fallback + output truncation [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24765Fetched 2026-04-08 01:49:17
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×4commented ×1

Error Message

The failure is deterministic for large outputs (e.g., Write tool calls producing >~400 lines) and creates an unrecoverable retry loop: streaming error → non-streaming fallback → output truncation → retry → same streaming error. | Attempt | stop_reason | output_tokens | Error | [ERROR] Error streaming, falling back to non-streaming mode: Content block not found

Root Cause

The /v1/messagesgithub_copilot path translates requests through two format conversions:

Claude Code → Anthropic SSE → [LiteLLM translates to OpenAI] → GitHub Copilot → [OpenAI SSE response] → [LiteLLM translates back to Anthropic SSE] → Claude Code

During this double translation, a content_block_start event is dropped for large streaming responses. Claude Code maintains an internal array of content blocks indexed by content_block_start events. When a content_block_delta arrives for an index with no start entry, it throws:

RangeError("Content block not found")

Claude Code catches this and falls back to non-streaming (stream=false). The GitHub Copilot API enforces a max_non_streaming_output_tokens cap (lower than the streaming max_output_tokens) on the /chat/completions endpoint, truncating the Write tool's JSON mid-parameter.

PR fix notes

PR #69: Fix 5 reliability bugs in gateway routing

Description (problem / solution / changelog)

Summary

Five reliability bugs in `server/routes/gateway.ts` grounded in LiteLLM/Portkey issue patterns:

  1. Weighted random was mathematically broken. The old sort comparator `Math.random() * totalWeight - a.weight - ...` was not a valid weighted sampler. Replaced with weighted-without-replacement shuffle. QA'd empirically: 10k trials at 90/10 → 90.2%/9.8% first-pick (was basically random before). 50/30/20 → 49.8%/30.3%/19.9%.
  2. No target cooldown tracking. Failing targets retried every request. Added in-memory cooldown: 3 consecutive failures → 30s park. Cooling targets kept as last-resort tail so we never return zero candidates. Matches LiteLLM's core reliability pattern.
  3. Retry-After ignored on 429s. Parallels LiteLLM #16286. Added `parseRetryAfterMs()` for both delta-seconds and HTTP-date, capped at 60s. 429 feeds the cooldown map with the provider's window; other 4xx don't (client wrong, not target).
  4. `configCache.clear()` was global — any user's CRUD op wiped every user's cache. Replaced with `invalidateUserCache(userId)`.
  5. Budget footgun: `remaining = 2000` would start a doomed attempt with `timeoutMs = 1000`. Raised `MIN_ATTEMPT_HEADROOM_MS` to 3000.

Why

Research against LiteLLM issues: #19985 fallback retry cycle reset, #15955 timeout type bug, #23546 infinite retry loop, #16286 retry-after dropped, #24765 streaming fallback loop. Portkey #1023 unhandled promise rejections. Our code shared several failure modes.

Test plan

  • `gateway.ts` imports cleanly under tsx
  • Weighted shuffle distribution verified empirically over 10k trials
  • Doesn't touch `server/index.ts`
  • Manual: point gateway at two providers, kill one, verify cooldown kicks in after 3 failures
  • Manual: trigger 429 with Retry-After, verify respected

Open gaps (bigger refactors, NOT in this PR)

  • Streaming support — currently silently drops `stream: true` and always returns JSON. Matches LiteLLM #6532 / #24765 failure pattern. Important for Hetty/Julian.
  • Redis-backed cooldown for horizontal scale (in-memory breaks across Vercel instances)
  • Per-project budget caps

🤖 Generated with Claude Code

Changed files

  • server/routes/gateway.ts (modified, +148/-23)

PR #71: Add SSE streaming support to gateway (openai + openai-compatible)

Description (problem / solution / changelog)

Summary

Gateway silently dropped `stream:true` — this was the LiteLLM #6532 / #24765 failure pattern flagged as open in #69. Streaming is the default for most real LLM client usage; Julian/Hetty will immediately notice.

Changes

  • New optional `sendStream()` on `ProviderAdapter` interface
  • `server/providers/openai-stream.ts` — shared SSE proxy helper with usage extraction
  • openai + openai-compatible (togetherai, nebius, groq, fireworks, deepseek) wire up streaming via the shared helper
  • Forces `stream_options.include_usage` so the final chunk carries token counts for billing
  • Gateway dispatches `stream:true` requests to `executeStreamWithFallback`
  • Pre-commit failures (no bytes written) still fall back to the next target
  • Post-commit failures write an OpenAI-shaped `stream_error` chunk + `[DONE]` and close — fallback impossible after headers commit
  • SSE headers: `content-type: text/event-stream`, `cache-control: no-cache, no-transform`, `connection: keep-alive`, `x-accel-buffering: no`

Test plan

  • `npm run lint` — 0 errors
  • SDK tests — 45/45 passing
  • QA harness with mocked fetch, 3 scenarios:
    1. Happy path — 4 SSE chunks proxied, usage 12/5 parsed from final chunk, sink ended cleanly
    2. Pre-commit HTTP 500 — error body returned, sink never touched (caller can fall back)
    3. Mid-stream network error — committed sink gets `stream_error` + `[DONE]` event, then closes
  • Manual: real OpenAI key + curl with `stream:true`, verify SSE bytes arrive
  • Manual: kill primary provider mid-connection test (unplug wifi after first chunk) and verify error chunk surfaces in client

Open gaps (follow-ups, not in this PR)

  • Anthropic streaming — different SSE format, explicit 501 for now
  • Google streaming — different format, explicit 501 for now
  • Request body capture for streamed responses is `null` — would need buffered accumulation

🤖 Generated with Claude Code

Changed files

  • server/providers/openai-compatible.ts (modified, +19/-0)
  • server/providers/openai-stream.ts (added, +194/-0)
  • server/providers/openai.ts (modified, +19/-0)
  • server/providers/types.ts (modified, +22/-0)
  • server/routes/gateway.ts (modified, +226/-0)

Code Example

Claude CodeAnthropic SSE[LiteLLM translates to OpenAI]GitHub Copilot[OpenAI SSE response][LiteLLM translates back to Anthropic SSE]Claude Code

---

RangeError("Content block not found")

---

"supported_endpoints": ["/v1/messages", "/chat/completions"]

---

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: github_copilot/claude-sonnet-4.6

general_settings:
  enable_anthropic_routes: true

---

[ERROR] Error streaming, falling back to non-streaming mode: Content block not found

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When Claude Code sends streaming requests through LiteLLM proxy's /v1/messages endpoint routed to GitHub Copilot (github_copilot/claude-* models), the Anthropic SSE translation layer intermittently drops content_block_start events for large responses. This causes Claude Code to throw RangeError("Content block not found") and fall back to non-streaming mode, where a lower output token cap truncates large tool calls.

The failure is deterministic for large outputs (e.g., Write tool calls producing >~400 lines) and creates an unrecoverable retry loop: streaming error → non-streaming fallback → output truncation → retry → same streaming error.

Root cause

The /v1/messagesgithub_copilot path translates requests through two format conversions:

Claude Code → Anthropic SSE → [LiteLLM translates to OpenAI] → GitHub Copilot → [OpenAI SSE response] → [LiteLLM translates back to Anthropic SSE] → Claude Code

During this double translation, a content_block_start event is dropped for large streaming responses. Claude Code maintains an internal array of content blocks indexed by content_block_start events. When a content_block_delta arrives for an index with no start entry, it throws:

RangeError("Content block not found")

Claude Code catches this and falls back to non-streaming (stream=false). The GitHub Copilot API enforces a max_non_streaming_output_tokens cap (lower than the streaming max_output_tokens) on the /chat/completions endpoint, truncating the Write tool's JSON mid-parameter.

Downstream consequence

Three consecutive Write tool calls failed identically:

Attemptstop_reasonoutput_tokensError
1max_tokens16,000InputValidationError: required parameter content is missing
2max_tokens16,000same
3max_tokens16,000same

The non-streaming cap is confirmed by direct testing: same model, same max_tokens=20000 — non-streaming returns 16,000 (capped), streaming returns 20,000 (not capped).

Response format fingerprinting

The two code paths produce distinct response formats, confirming which path was used:

streamContent-TypeModel in responseHas total_tokens?
truetext/event-streamclaude-sonnet-4.6No
falseapplication/jsongithub_copilot/Claude Sonnet 4.6Yes

All three failed Write calls had the non-streaming fingerprint, confirming the fallback triggered.

Bypass discovery

GitHub Copilot's /models API shows Claude models support both endpoints:

"supported_endpoints": ["/v1/messages", "/chat/completions"]

When calling GitHub Copilot's native /v1/messages endpoint directly (bypassing the OpenAI translation), a non-streaming request with max_tokens=20000 returned 20,000+ tokens — the non-streaming cap does not apply on the native Anthropic endpoint. This suggests the cleanest fix may be to route Claude requests through GitHub Copilot's native /v1/messages endpoint, avoiding the SSE double-translation entirely.

Related issues

  • #21128 — AnthropicStreamWrapper hardcodes type: "text" at index 0, breaking extended thinking (same adapter subsystem, different trigger)
  • #24134 / #24721 — Anthropic adapter drops input_json_delta during content block transitions (same adapter, closed)
  • #13373 / #14315 — Claude Code "Streaming fallback triggered" through LiteLLM (same symptom, closed)
  • #24004 — Mid-stream fallback not supported for anthropic_messages route type (open)

Environment

  • Claude Code 2.1.85/2.1.86
  • LiteLLM 1.81.1
  • Provider: GitHub Copilot (github_copilot/claude-sonnet-4.6, github_copilot/claude-opus-4.6-1m)
  • OS: Windows 11

Steps to Reproduce

  1. Configure LiteLLM with Claude models routed through GitHub Copilot:
model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: github_copilot/claude-sonnet-4.6

general_settings:
  enable_anthropic_routes: true
  1. Use Claude Code (VS Code extension or CLI) with ANTHROPIC_BASE_URL=http://localhost:4000

  2. Ask Claude Code to write a file exceeding ~400 lines in a single Write tool call

  3. Observe in the Claude Code extension log:

[ERROR] Error streaming, falling back to non-streaming mode: Content block not found
  1. Observe "Write failed" — the Write tool's content parameter is missing due to output truncation on the non-streaming fallback path

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.81.1

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To fix the issue, we need to route Claude requests through GitHub Copilot's native /v1/messages endpoint, avoiding the SSE double-translation entirely. Here are the steps:

  • Update the LiteLLM configuration to use the native GitHub Copilot endpoint:
model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: github_copilot/claude-sonnet-4.6
      endpoint: /v1/messages
  • Modify the Claude Code to use the updated endpoint:
import requests

# ...

def send_request(model, prompt):
    url = f"{ANTHROPIC_BASE_URL}/v1/messages"
    headers = {"Content-Type": "application/json"}
    data = {"model": model, "prompt": prompt}
    response = requests.post(url, headers=headers, json=data)
    # ...
  • Update the Anthropic adapter to handle the native endpoint:
class AnthropicAdapter:
    # ...

    def send_request(self, model, prompt):
        url = f"{self.base_url}/v1/messages"
        # ...

Verification

To verify the fix, follow these steps:

  • Restart the LiteLLM proxy and Claude Code
  • Send a large Write tool request (>400 lines) through Claude Code
  • Check the Claude Code log for any errors related to content block not found
  • Verify that the Write tool call is successful and the output is not truncated

Extra Tips

  • Make sure to update the LiteLLM configuration and Claude Code to use the native GitHub Copilot endpoint
  • Test the fix with different models and prompts to ensure it works consistently
  • Monitor the Claude Code log for any errors related to content block not found and adjust the fix as needed

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - ✅(Solved) Fix [Bug]: content_block_start dropped during /v1/messages → GitHub Copilot streaming, triggering non-streaming fallback + output truncation [2 pull requests, 1 comments, 2 participants]