litellm - 💡(How to fix) Fix [Bug]: Streaming responses are buffered when using LiteLLM proxy with SGLang

litellm2026-04-20 11:47:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

import asyncio, time, json, aiohttp

async def test(base_url, api_key):
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {"model": "GLM-5.1-FP8", "messages": [{"role": "user", "content": "Explain recursion"}], "stream": True}
    times = []
    start = time.perf_counter()
    async with aiohttp.ClientSession() as s:
        async with s.post(f"{base_url}/chat/completions", json=payload, headers=headers) as resp:
            async for line in resp.content:
                line = line.decode().strip()
                if not line.startswith("data: ") or line[6:] == "[DONE]":
                    continue
                delta = json.loads(line[6:])["choices"][0]["delta"]
                c = delta.get("content") or delta.get("reasoning_content")
                if c:
                    times.append((time.perf_counter() - start) * 1000)
    gaps = [times[i] - times[i-1] for i in range(1, len(times))]
    print(f"TTFT={times[0]:.0f}ms  gap_avg={sum(gaps)/len(gaps):.0f}ms")

asyncio.run(test("<your-litellm-endpoint>", "<your-key>"))

---

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Streaming still gets buffered through LiteLLM proxy

I'm running sglang behind LiteLLM proxy. Direct requests stream fine tokens come in one by one. Through LiteLLM though, everything arrives at once after the full response is done. Makes it useless for chat apps.

Same model, same prompt, just different endpoints:

Direct sglang: TTFT=272ms, inter-token gap avg=24ms streaming works.
Through LiteLLM: TTFT=1740ms, inter-token gap avg=0ms all 49 tokens arrive at the same millisecond.

Here's a minimal repro:

import asyncio, time, json, aiohttp

async def test(base_url, api_key):
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {"model": "GLM-5.1-FP8", "messages": [{"role": "user", "content": "Explain recursion"}], "stream": True}
    times = []
    start = time.perf_counter()
    async with aiohttp.ClientSession() as s:
        async with s.post(f"{base_url}/chat/completions", json=payload, headers=headers) as resp:
            async for line in resp.content:
                line = line.decode().strip()
                if not line.startswith("data: ") or line[6:] == "[DONE]":
                    continue
                delta = json.loads(line[6:])["choices"][0]["delta"]
                c = delta.get("content") or delta.get("reasoning_content")
                if c:
                    times.append((time.perf_counter() - start) * 1000)
    gaps = [times[i] - times[i-1] for i in range(1, len(times))]
    print(f"TTFT={times[0]:.0f}ms  gap_avg={sum(gaps)/len(gaps):.0f}ms")

asyncio.run(test("<your-litellm-endpoint>", "<your-key>"))

LiteLLM 1.81.12
sglang / GLM-5.1-FP8
Python 3.12

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

No response

What LiteLLM version are you on ?

1.81.12

Twitter / LinkedIn details

No response

extent analysis

TL;DR

The issue can be mitigated by modifying the LiteLLM proxy configuration or the client-side code to handle streaming responses properly.

Guidance

Verify that the LiteLLM proxy is configured to support streaming responses and check its documentation for any specific settings or flags that need to be enabled.
Investigate the aiohttp client library to see if there are any options or parameters that can be adjusted to handle streaming responses more efficiently, such as setting a smaller buffer size or using a different streaming mode.
Consider adding logging or debugging statements to the client-side code to inspect the response headers and body, which may provide clues about why the streaming is not working as expected.
Check the LiteLLM proxy logs for any errors or warnings related to streaming or buffering.

Example

No code example is provided as the issue seems to be related to the configuration or interaction between the client and the LiteLLM proxy, rather than a specific code snippet.

Notes

The root cause of the issue is unclear, but it appears to be related to how the LiteLLM proxy handles streaming responses. Further investigation and debugging are needed to determine the exact cause and find a solution.

Recommendation

Apply workaround: Modify the client-side code or the LiteLLM proxy configuration to handle streaming responses properly, as the issue seems to be related to buffering or streaming mode.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Streaming responses are buffered when using LiteLLM proxy with SGLang

Recommended Tools

GitHub issue graph ai analysis

Code Example

Check for existing issues

What happened?

Streaming still gets buffered through LiteLLM proxy

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Streaming responses are buffered when using LiteLLM proxy with SGLang

Recommended Tools

GitHub issue graph ai analysis

Code Example

Check for existing issues

What happened?

Streaming still gets buffered through LiteLLM proxy

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING