litellm - ✅(Solved) Fix [Bug]: content_block_start dropped during /v1/messages → GitHub Copilot streaming, triggering non-streaming fallback + output truncation [2 pull requests, 1 comments, 2 participants]

PBDMSFT · 2026-03-29T23:14:47Z

[litellm] PR 69: Fix 5 reliability bugs in gateway routing - Repository: katrinalaszlo/observe - Author: katrinalaszlo - State: closed | merged: True - Link: h… # PR #69: Fix 5 reliability bugs in gateway routing - Repository: katrinalaszlo/observe - Author: katrinalaszlo - State: closed | merged: True - Link: https://github.com/katrinalaszlo/observe/pull/69 ## Description (problem / solution / changelog) ## Summary Five reliability bugs in \`server/routes/gateway.ts\` grounded in LiteLLM/Portkey issue patterns: 1. **Weighted random was mathematically broken.** The old sort comparator \`Math.random() * totalWeight - a.weight - ...\` was not a valid weighted sampler. Replaced with weighted-without-replacement shuffle. QA'd empirically: 10k trials at 90/10 → 90.2%/9.8% first-pick (was basically random before). 50/30/20 → 49.8%/30.3%/19.9%. 2. **No target cooldown tracking.** Failing targets retried every request. Added in-memory cooldown: 3 consecutive failures → 30s park. Cooling targets kept as last-resort tail so we never return zero candidates. Matches LiteLLM's core reliability pattern. 3. **Retry-After ignored on 429s.** Parallels LiteLLM [#16286](https://github.com/BerriAI/litellm/issues/16286). Added \`parseRetryAfterMs()\` for both delta-seconds and HTTP-date, capped at 60s. 429 feeds the cooldown map with the provider's window; other 4xx don't (client wrong, not target). 4. **\`configCache.clear()\` was global** — any user's CRUD op wiped every user's cache. Replaced with \`invalidateUserCache(userId)\`. 5. **Budget footgun:** \`remaining = 2000\` would start a doomed attempt with \`timeoutMs = 1000\`. Raised \`MIN_ATTEMPT_HEADROOM_MS\` to 3000. ## Why Research against LiteLLM issues: [#19985](https://github.com/BerriAI/litellm/issues/19985) fallback retry cycle reset, [#15955](https://github.com/BerriAI/litellm/issues/15955) timeout type bug, [#23546](https://github.com/BerriAI/litellm/issues/23546) infinite retry loop, [#16286](https://github.com/BerriAI/litellm/issues/16286) retry-after dropped, [#24765](https://github.com/BerriAI/litellm/issues/24765) streaming fallback loop. Portkey [#1023](https://github.com/Portkey-AI/gateway/issues/1023) unhandled promise rejections. Our code shared several failure modes. ## Test plan - [x] \`gateway.ts\` imports cleanly under tsx - [x] Weighted shuffle distribution verified empirically over 10k trials - [x] Doesn't touch \`server/index.ts\` - [ ] Manual: point gateway at two providers, kill one, verify cooldown kicks in after 3 failures - [ ] Manual: trigger 429 with Retry-After, verify respected ## Open gaps (bigger refactors, NOT in this PR) - **Streaming support** — currently silently drops \`stream: true\` and always returns JSON. Matches LiteLLM [#6532](https://github.com/BerriAI/litellm/issues/6532) / [#24765](https://github.com/BerriAI/litellm/issues/24765) failure pattern. Important for Hetty/Julian. - **Redis-backed cooldown** for horizontal scale (in-memory breaks across Vercel instances) - **Per-project budget caps** 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Changed files - `server/routes/gateway.ts` (modified, +148/-23) --- # PR #71: Add SSE streaming support to gateway (openai + openai-compatible) - Repository: katrinalaszlo/observe - Author: katrinalaszlo - State: closed | merged: True - Link: https://github.com/katrinalaszlo/observe/pull/71 ## Description (problem / solution / changelog) ## Summary Gateway silently dropped \`stream:true\` — this was the [LiteLLM #6532](https://github.com/BerriAI/litellm/issues/6532) / [#24765](https://github.com/BerriAI/litellm/issues/24765) failure pattern flagged as open in #69. Streaming is the default for most real LLM client usage; Julian/Hetty will immediately notice. ## Changes - New optional \`sendStream()\` on \`ProviderAdapter\` interface - \`server/providers/openai-stream.ts\` — shared SSE proxy helper with usage extraction - openai + openai-compatible (togetherai, nebius, groq, fireworks, deepseek) wire up streaming via the shared helper - Forces \`stream_options.include_usage\` so the final chunk carries token counts for billing - Gateway dispatches \`stream:true\` requests to \`executeStreamWithFallback\` - Pre-commit failures (no bytes written) still fall back to the next target - Post-commit failures write an OpenAI-shaped \`stream_error\` chunk + \`[DONE]\` and close — fallback impossible after headers commit - SSE headers: \`content-type: text/event-stream\`, \`cache-control: no-cache, no-transform\`, \`connection: keep-alive\`, \`x-accel-buffering: no\` ## Test plan - [x] \`npm run lint\` — 0 errors - [x] SDK tests — 45/45 passing - [x] QA harness with mocked fetch, 3 scenarios: 1. **Happy path** — 4 SSE chunks proxied, usage 12/5 parsed from final chunk, sink ended cleanly 2. **Pre-commit HTTP 500** — error body returned, sink never touched (caller can fall back) 3. **Mid-stream network error** — committed sink gets \`stream_error\` + \`[DONE]\` event, then closes

litellm2026-03-29 23:14:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#24765•Fetched 2026-04-08 01:49:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

PBDMSFT

Participants

adelnobel

PBDMSFT

Timeline (top)

labeled ×4commented ×1

Error Message

The failure is deterministic for large outputs (e.g., Write tool calls producing >~400 lines) and creates an unrecoverable retry loop: streaming error → non-streaming fallback → output truncation → retry → same streaming error. | Attempt | stop_reason | output_tokens | Error | [ERROR] Error streaming, falling back to non-streaming mode: Content block not found

Root Cause

The /v1/messages → github_copilot path translates requests through two format conversions:

Claude Code → Anthropic SSE → [LiteLLM translates to OpenAI] → GitHub Copilot → [OpenAI SSE response] → [LiteLLM translates back to Anthropic SSE] → Claude Code

During this double translation, a content_block_start event is dropped for large streaming responses. Claude Code maintains an internal array of content blocks indexed by content_block_start events. When a content_block_delta arrives for an index with no start entry, it throws:

RangeError("Content block not found")

Claude Code catches this and falls back to non-streaming (stream=false). The GitHub Copilot API enforces a max_non_streaming_output_tokens cap (lower than the streaming max_output_tokens) on the /chat/completions endpoint, truncating the Write tool's JSON mid-parameter.

Code Example

Claude Code → Anthropic SSE → [LiteLLM translates to OpenAI] → GitHub Copilot → [OpenAI SSE response] → [LiteLLM translates back to Anthropic SSE] → Claude Code

---

RangeError("Content block not found")

---

"supported_endpoints": ["/v1/messages", "/chat/completions"]

---

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: github_copilot/claude-sonnet-4.6

general_settings:
  enable_anthropic_routes: true

---

[ERROR] Error streaming, falling back to non-streaming mode: Content block not found

---

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When Claude Code sends streaming requests through LiteLLM proxy's /v1/messages endpoint routed to GitHub Copilot (github_copilot/claude-* models), the Anthropic SSE translation layer intermittently drops content_block_start events for large responses. This causes Claude Code to throw RangeError("Content block not found") and fall back to non-streaming mode, where a lower output token cap truncates large tool calls.

Root cause

The /v1/messages → github_copilot path translates requests through two format conversions:

Claude Code → Anthropic SSE → [LiteLLM translates to OpenAI] → GitHub Copilot → [OpenAI SSE response] → [LiteLLM translates back to Anthropic SSE] → Claude Code

RangeError("Content block not found")

Downstream consequence

Three consecutive Write tool calls failed identically:

Attempt	stop_reason	output_tokens	Error
1	max_tokens	16,000	`InputValidationError: required parameter content is missing`
2	max_tokens	16,000	same
3	max_tokens	16,000	same

The non-streaming cap is confirmed by direct testing: same model, same max_tokens=20000 — non-streaming returns 16,000 (capped), streaming returns 20,000 (not capped).

Response format fingerprinting

The two code paths produce distinct response formats, confirming which path was used:

stream	Content-Type	Model in response	Has total_tokens?
true	text/event-stream	`claude-sonnet-4.6`	No
false	application/json	`github_copilot/Claude Sonnet 4.6`	Yes

All three failed Write calls had the non-streaming fingerprint, confirming the fallback triggered.

Bypass discovery

GitHub Copilot's /models API shows Claude models support both endpoints:

"supported_endpoints": ["/v1/messages", "/chat/completions"]

When calling GitHub Copilot's native /v1/messages endpoint directly (bypassing the OpenAI translation), a non-streaming request with max_tokens=20000 returned 20,000+ tokens — the non-streaming cap does not apply on the native Anthropic endpoint. This suggests the cleanest fix may be to route Claude requests through GitHub Copilot's native /v1/messages endpoint, avoiding the SSE double-translation entirely.

Related issues

#21128 — AnthropicStreamWrapper hardcodes type: "text" at index 0, breaking extended thinking (same adapter subsystem, different trigger)
#24134 / #24721 — Anthropic adapter drops input_json_delta during content block transitions (same adapter, closed)
#13373 / #14315 — Claude Code "Streaming fallback triggered" through LiteLLM (same symptom, closed)
#24004 — Mid-stream fallback not supported for anthropic_messages route type (open)

Environment

Claude Code 2.1.85/2.1.86
LiteLLM 1.81.1
Provider: GitHub Copilot (github_copilot/claude-sonnet-4.6, github_copilot/claude-opus-4.6-1m)
OS: Windows 11

Steps to Reproduce

Configure LiteLLM with Claude models routed through GitHub Copilot:

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: github_copilot/claude-sonnet-4.6

general_settings:
  enable_anthropic_routes: true

Use Claude Code (VS Code extension or CLI) with ANTHROPIC_BASE_URL=http://localhost:4000
Ask Claude Code to write a file exceeding ~400 lines in a single Write tool call
Observe in the Claude Code extension log:

[ERROR] Error streaming, falling back to non-streaming mode: Content block not found

Observe "Write failed" — the Write tool's content parameter is missing due to output truncation on the non-streaming fallback path

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.81.1

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To fix the issue, we need to route Claude requests through GitHub Copilot's native /v1/messages endpoint, avoiding the SSE double-translation entirely. Here are the steps:

Update the LiteLLM configuration to use the native GitHub Copilot endpoint:

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: github_copilot/claude-sonnet-4.6
      endpoint: /v1/messages

Modify the Claude Code to use the updated endpoint:

import requests

# ...

def send_request(model, prompt):
    url = f"{ANTHROPIC_BASE_URL}/v1/messages"
    headers = {"Content-Type": "application/json"}
    data = {"model": model, "prompt": prompt}
    response = requests.post(url, headers=headers, json=data)
    # ...

Update the Anthropic adapter to handle the native endpoint:

class AnthropicAdapter:
    # ...

    def send_request(self, model, prompt):
        url = f"{self.base_url}/v1/messages"
        # ...

Verification

To verify the fix, follow these steps:

Restart the LiteLLM proxy and Claude Code
Send a large Write tool request (>400 lines) through Claude Code
Check the Claude Code log for any errors related to content block not found
Verify that the Write tool call is successful and the output is not truncated

Extra Tips

Make sure to update the LiteLLM configuration and Claude Code to use the native GitHub Copilot endpoint
Test the fix with different models and prompts to ensure it works consistently
Monitor the Claude Code log for any errors related to content block not found and adjust the fix as needed

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #retriever error #indexing error #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.