litellm - 💡(How to fix) Fix [Bug]: Streaming SSE output differs from upstream for /v1/messages and /v1/responses

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Environment details:

  • LiteLLM image: ghcr.io/berriai/litellm:v1.83.10-stable.patch.1
  • Observed response header on inference endpoints: x-litellm-version: 1.83.10
  • LiteLLM base URL (inside compose network): http://litellm:4000
  • Upstream backend: http://upstream/v1 (LMStudio v0.4.12)
  • Model under test:
    • Upstream model id: nvidia/nemotron-3-nano-4b
    • LiteLLM model group: nemotron-3-nano-4b (configured to route to the upstream model)

v1.83.10-stable.patch.1

Code Example

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  store_model_in_db: true
litellm_settings:
  model_list: []
  request_timeout: 6000
router_settings:
  routing_strategy: simple-shuffle
  model_group_alias: {}
  fallbacks: []

---

curl -sS -N http://upstream/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","messages":[{"role":"user","content":"hi"}],"max_tokens":32,"stream":true}'

curl -sS -N http://upstream/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","input":"hi","max_output_tokens":32,"stream":true}'

---

docker compose exec -T gateway python - <<'PY'
import os, httpx
BASE='http://litellm:4000'
MASTER=os.environ['LITELLM_MASTER_KEY']
headers={'Authorization': f'Bearer {MASTER}', 'Content-Type':'application/json'}

with httpx.stream('POST', f'{BASE}/v1/messages', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "messages":[{"role":"user","content":"hi"}],
  "max_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)

with httpx.stream('POST', f'{BASE}/v1/responses', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "input":"hi",
  "max_output_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)
PY

---

Upstream: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type":"message_start",...}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
...


LiteLLM: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type": "message_start", ...}

event: content_block_delta
data: {"type": "content_block_delta", "index": -1, "delta": {"type": "text_delta", "text": "\n\nHel"}}
...


Upstream: `POST /v1/responses` (`stream=true`) (trimmed)

event: response.created
data: {"type":"response.created",...}

event: response.output_item.added
data: {"type":"response.output_item.added",...}

event: response.content_part.added
data: {"type":"response.content_part.added",...}
...


LiteLLM: `POST /v1/responses` (`stream=true`) (trimmed)

data: {"type":"response.output_text.delta","item_id":"resp_...","output_index":0,"content_index":0,"delta":"\nHell","model":"nemotron-3-nano-4b"}
...

data: {"type":"response.completed",...}
data: [DONE]
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

In our environment, the upstream backend (http://upstream/v1) appears to return well-formed streaming SSE events for:

  1. Anthropic-compatible POST /v1/messages with stream: true
  2. OpenAI Responses POST /v1/responses with stream: true

However, when sending the same requests to LiteLLM directly (not via our gateway), the streaming SSE output differs in ways that some strict client SDKs cannot parse.

Environment details:

  • LiteLLM image: ghcr.io/berriai/litellm:v1.83.10-stable.patch.1
  • Observed response header on inference endpoints: x-litellm-version: 1.83.10
  • LiteLLM base URL (inside compose network): http://litellm:4000
  • Upstream backend: http://upstream/v1 (LMStudio v0.4.12)
  • Model under test:
    • Upstream model id: nvidia/nemotron-3-nano-4b
    • LiteLLM model group: nemotron-3-nano-4b (configured to route to the upstream model)

LiteLLM runtime config (as deployed):

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  store_model_in_db: true
litellm_settings:
  model_list: []
  request_timeout: 6000
router_settings:
  routing_strategy: simple-shuffle
  model_group_alias: {}
  fallbacks: []

Observed differences:

  1. POST /v1/messages streaming
  • Upstream sends content_block_start(index=0) before any content_block_delta(index=0).
  • LiteLLM sends content_block_delta events with index: -1, and in this repro we did not observe a preceding content_block_start.
  1. POST /v1/responses streaming
  • Upstream includes lifecycle events like response.created, and output structure events like response.output_item.added and response.content_part.added before response.output_text.delta.
  • LiteLLM begins the stream with response.output_text.delta immediately, and uses item_id values of the form resp_....

Client impact we observed in practice (examples):

  • Anthropic SDKs may raise: text part -1 not found
  • OpenAI Responses aggregators may raise: text part resp_... not found

Steps to Reproduce

Notes:

  • LiteLLM requests require Authorization: Bearer <LITELLM_MASTER_KEY>.
  • The commands below are simplified to focus on streaming behavior.

Upstream (direct):

curl -sS -N http://upstream/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","messages":[{"role":"user","content":"hi"}],"max_tokens":32,"stream":true}'

curl -sS -N http://upstream/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","input":"hi","max_output_tokens":32,"stream":true}'

LiteLLM (direct, from inside the compose network):

docker compose exec -T gateway python - <<'PY'
import os, httpx
BASE='http://litellm:4000'
MASTER=os.environ['LITELLM_MASTER_KEY']
headers={'Authorization': f'Bearer {MASTER}', 'Content-Type':'application/json'}

with httpx.stream('POST', f'{BASE}/v1/messages', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "messages":[{"role":"user","content":"hi"}],
  "max_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)

with httpx.stream('POST', f'{BASE}/v1/responses', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "input":"hi",
  "max_output_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)
PY

Relevant log output

Upstream: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type":"message_start",...}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
...


LiteLLM: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type": "message_start", ...}

event: content_block_delta
data: {"type": "content_block_delta", "index": -1, "delta": {"type": "text_delta", "text": "\n\nHel"}}
...


Upstream: `POST /v1/responses` (`stream=true`) (trimmed)

event: response.created
data: {"type":"response.created",...}

event: response.output_item.added
data: {"type":"response.output_item.added",...}

event: response.content_part.added
data: {"type":"response.content_part.added",...}
...


LiteLLM: `POST /v1/responses` (`stream=true`) (trimmed)

data: {"type":"response.output_text.delta","item_id":"resp_...","output_index":0,"content_index":0,"delta":"\nHell","model":"nemotron-3-nano-4b"}
...

data: {"type":"response.completed",...}
data: [DONE]

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.10-stable.patch.1

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Streaming SSE output differs from upstream for /v1/messages and /v1/responses