litellm - 💡(How to fix) Fix [Bug]: Streaming SSE output differs from upstream for /v1/messages and /v1/responses

Fix Action

Fix / Workaround

Environment details:

LiteLLM image: ghcr.io/berriai/litellm:v1.83.10-stable.patch.1
Observed response header on inference endpoints: x-litellm-version: 1.83.10
LiteLLM base URL (inside compose network): http://litellm:4000
Upstream backend: http://upstream/v1 (LMStudio v0.4.12)
Model under test:
- Upstream model id: nvidia/nemotron-3-nano-4b
- LiteLLM model group: nemotron-3-nano-4b (configured to route to the upstream model)

v1.83.10-stable.patch.1

Code Example

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  store_model_in_db: true
litellm_settings:
  model_list: []
  request_timeout: 6000
router_settings:
  routing_strategy: simple-shuffle
  model_group_alias: {}
  fallbacks: []

---

curl -sS -N http://upstream/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","messages":[{"role":"user","content":"hi"}],"max_tokens":32,"stream":true}'

curl -sS -N http://upstream/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","input":"hi","max_output_tokens":32,"stream":true}'

---

docker compose exec -T gateway python - <<'PY'
import os, httpx
BASE='http://litellm:4000'
MASTER=os.environ['LITELLM_MASTER_KEY']
headers={'Authorization': f'Bearer {MASTER}', 'Content-Type':'application/json'}

with httpx.stream('POST', f'{BASE}/v1/messages', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "messages":[{"role":"user","content":"hi"}],
  "max_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)

with httpx.stream('POST', f'{BASE}/v1/responses', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "input":"hi",
  "max_output_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)
PY

---

Upstream: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type":"message_start",...}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
...


LiteLLM: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type": "message_start", ...}

event: content_block_delta
data: {"type": "content_block_delta", "index": -1, "delta": {"type": "text_delta", "text": "\n\nHel"}}
...


Upstream: `POST /v1/responses` (`stream=true`) (trimmed)

event: response.created
data: {"type":"response.created",...}

event: response.output_item.added
data: {"type":"response.output_item.added",...}

event: response.content_part.added
data: {"type":"response.content_part.added",...}
...


LiteLLM: `POST /v1/responses` (`stream=true`) (trimmed)

data: {"type":"response.output_text.delta","item_id":"resp_...","output_index":0,"content_index":0,"delta":"\nHell","model":"nemotron-3-nano-4b"}
...

data: {"type":"response.completed",...}
data: [DONE]

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

In our environment, the upstream backend (http://upstream/v1) appears to return well-formed streaming SSE events for:

Anthropic-compatible POST /v1/messages with stream: true
OpenAI Responses POST /v1/responses with stream: true

However, when sending the same requests to LiteLLM directly (not via our gateway), the streaming SSE output differs in ways that some strict client SDKs cannot parse.

Environment details:

LiteLLM image: ghcr.io/berriai/litellm:v1.83.10-stable.patch.1
Observed response header on inference endpoints: x-litellm-version: 1.83.10
LiteLLM base URL (inside compose network): http://litellm:4000
Upstream backend: http://upstream/v1 (LMStudio v0.4.12)
Model under test:
- Upstream model id: nvidia/nemotron-3-nano-4b
- LiteLLM model group: nemotron-3-nano-4b (configured to route to the upstream model)

LiteLLM runtime config (as deployed):

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  store_model_in_db: true
litellm_settings:
  model_list: []
  request_timeout: 6000
router_settings:
  routing_strategy: simple-shuffle
  model_group_alias: {}
  fallbacks: []

Observed differences:

POST /v1/messages streaming

Upstream sends content_block_start(index=0) before any content_block_delta(index=0).
LiteLLM sends content_block_delta events with index: -1, and in this repro we did not observe a preceding content_block_start.

POST /v1/responses streaming

Upstream includes lifecycle events like response.created, and output structure events like response.output_item.added and response.content_part.added before response.output_text.delta.
LiteLLM begins the stream with response.output_text.delta immediately, and uses item_id values of the form resp_....

Client impact we observed in practice (examples):

Anthropic SDKs may raise: text part -1 not found
OpenAI Responses aggregators may raise: text part resp_... not found

Steps to Reproduce

Notes:

LiteLLM requests require Authorization: Bearer <LITELLM_MASTER_KEY>.
The commands below are simplified to focus on streaming behavior.

Upstream (direct):

curl -sS -N http://upstream/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","messages":[{"role":"user","content":"hi"}],"max_tokens":32,"stream":true}'

curl -sS -N http://upstream/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-nano-4b","input":"hi","max_output_tokens":32,"stream":true}'

LiteLLM (direct, from inside the compose network):

docker compose exec -T gateway python - <<'PY'
import os, httpx
BASE='http://litellm:4000'
MASTER=os.environ['LITELLM_MASTER_KEY']
headers={'Authorization': f'Bearer {MASTER}', 'Content-Type':'application/json'}

with httpx.stream('POST', f'{BASE}/v1/messages', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "messages":[{"role":"user","content":"hi"}],
  "max_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)

with httpx.stream('POST', f'{BASE}/v1/responses', headers=headers, json={
  "model":"nemotron-3-nano-4b",
  "input":"hi",
  "max_output_tokens":32,
  "stream":True
}, timeout=20) as s:
  for line in s.iter_lines():
    print(line)
PY

Relevant log output

Upstream: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type":"message_start",...}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
...


LiteLLM: `POST /v1/messages` (`stream=true`) (trimmed)

event: message_start
data: {"type": "message_start", ...}

event: content_block_delta
data: {"type": "content_block_delta", "index": -1, "delta": {"type": "text_delta", "text": "\n\nHel"}}
...


Upstream: `POST /v1/responses` (`stream=true`) (trimmed)

event: response.created
data: {"type":"response.created",...}

event: response.output_item.added
data: {"type":"response.output_item.added",...}

event: response.content_part.added
data: {"type":"response.content_part.added",...}
...


LiteLLM: `POST /v1/responses` (`stream=true`) (trimmed)

data: {"type":"response.output_text.delta","item_id":"resp_...","output_index":0,"content_index":0,"delta":"\nHell","model":"nemotron-3-nano-4b"}
...

data: {"type":"response.completed",...}
data: [DONE]

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.10-stable.patch.1

Twitter / LinkedIn details

No response

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Streaming SSE output differs from upstream for /v1/messages and /v1/responses

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Streaming SSE output differs from upstream for /v1/messages and /v1/responses

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

RELATED_DISCOVERY

TRENDING