litellm - ✅(Solved) Fix [Bug]: Streaming usage lost when backend (vLLM) sends usage in a separate empty-choices chunk after finish_reason [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25389Fetched 2026-04-09 07:52:19
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
labeled ×2

Root Cause

vLLM delivers usage in a separate, trailing SSE chunk that has an empty choices array, sent after the chunk that carries finish_reason: stop: data: {"choices":[{"finish_reason":"stop","delta":{"content":""},...}], ...} ← chunk N-1 data: {"choices":[], "usage":{"prompt_tokens":26,"completion_tokens":242,"total_tokens":268}} ← chunk N data: [DONE] LiteLLM's stream consumer stops processing after it sees finish_reason in chunk N-1, so chunk N (which contains the usage) is never read. As a result:

The usage is not captured internally by LiteLLM. The Gemini translation layer has nothing to convert into usageMetadata, so the field is absent from the response forwarded to the client.

Fix Action

Workaround

A lightweight reverse proxy can be placed between LiteLLM and vLLM to merge the two trailing chunks before LiteLLM sees them: gemini-cli → LiteLLM (:4000) → merge-proxy (:8082) → vLLM (:8081) The merge proxy buffers the finish_reason chunk and, if the next chunk is a usage-only chunk (choices: []), injects the usage field into the buffered chunk before forwarding. This is a workaround, not a fix — ideally LiteLLM handles this natively. Alternatively, setting fake_stream: true on the model avoids the issue entirely by using a non-streaming request to vLLM, at the cost of increased time-to-first-token.

PR fix notes

PR #25410: fix(streaming): preserve usage from trailing empty-choices chunks (vLLM)

Description (problem / solution / changelog)

Relevant issues

Fixes #25389

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

vLLM (and some other OpenAI-compatible backends) sends usage data in a trailing SSE chunk with an empty choices array, after the finish_reason: "stop" chunk:

data: {"choices":[{"finish_reason":"stop","delta":{"content":""}}]}   ← chunk N-1
data: {"choices":[], "usage":{"prompt_tokens":26,...}}                 ← chunk N
data: [DONE]

In chunk_creator(), the else branch for empty choices returned None when stream_options wasn't set. This meant the usage chunk was never accumulated in self.chunks, so calculate_total_usage() had nothing to work with and usage was lost from _hidden_params.

The fix: when the empty-choices chunk carries usage data, return it so __next__() accumulates it in self.chunks. The existing downstream logic (lines 1884-1904) already handles this correctly — it strips usage from the response, recreates it, checks emptiness, and continues without yielding the chunk to the caller.

Before: _hidden_params["usage"] shows zeros or is missing for vLLM streaming responses. After: _hidden_params["usage"] correctly reflects the provider-reported token counts.

Test

Added test_usage_chunk_empty_choices_vllm_pattern which simulates the exact vLLM 3-chunk streaming pattern. It verifies both that:

  1. Usage data is preserved in _hidden_params with correct token counts
  2. No empty-choices chunk leaks to the caller

All 51 streaming handler tests pass.

Changed files

  • litellm/litellm_core_utils/streaming_handler.py (modified, +6/-0)
  • tests/test_litellm/litellm_core_utils/test_streaming_handler.py (modified, +86/-0)
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Bug Description

When using LiteLLM proxy with gemini_api: true (to serve Gemini-compatible endpoints for clients like gemini-cli) forwarding to a vLLM backend, the usage / usageMetadata field is missing from the response received by the client.

Root Cause

vLLM delivers usage in a separate, trailing SSE chunk that has an empty choices array, sent after the chunk that carries finish_reason: stop: data: {"choices":[{"finish_reason":"stop","delta":{"content":""},...}], ...} ← chunk N-1 data: {"choices":[], "usage":{"prompt_tokens":26,"completion_tokens":242,"total_tokens":268}} ← chunk N data: [DONE] LiteLLM's stream consumer stops processing after it sees finish_reason in chunk N-1, so chunk N (which contains the usage) is never read. As a result:

The usage is not captured internally by LiteLLM. The Gemini translation layer has nothing to convert into usageMetadata, so the field is absent from the response forwarded to the client.

Environment

LiteLLM version: latest (reproduced on recent releases) Backend: vLLM (OpenAI-compatible endpoint) Client: gemini-cli (expects Gemini-format usageMetadata) Config: gemini_api: true, stream_options.include_usage: true

Expected Behavior

LiteLLM should continue reading the stream until data: [DONE] rather than stopping at the first finish_reason chunk. The trailing usage chunk should be consumed, merged, and correctly translated into usageMetadata when the Gemini compatibility layer is active.

Actual Behavior

LiteLLM stops consuming the stream upon receiving finish_reason: stop. The subsequent usage-only chunk is discarded, and no usage information reaches the client.

Workaround

A lightweight reverse proxy can be placed between LiteLLM and vLLM to merge the two trailing chunks before LiteLLM sees them: gemini-cli → LiteLLM (:4000) → merge-proxy (:8082) → vLLM (:8081) The merge proxy buffers the finish_reason chunk and, if the next chunk is a usage-only chunk (choices: []), injects the usage field into the buffered chunk before forwarding. This is a workaround, not a fix — ideally LiteLLM handles this natively. Alternatively, setting fake_stream: true on the model avoids the issue entirely by using a non-streaming request to vLLM, at the cost of increased time-to-first-token.

Suggested Fix

In the SSE stream consumer, do not close the stream on finish_reason. Instead, keep reading until data: [DONE] is received, and collect any usage data that arrives in subsequent chunks.

Steps to Reproduce

Start a vLLM server and confirm it returns a trailing usage chunk (curl example below). Configure LiteLLM proxy with gemini_api: true pointing to the vLLM backend. Send a streaming request via gemini-cli (or any Gemini-compatible client). Observe that the response contains no usageMetadata.

Verify vLLM trailing-chunk behavior directly: curl http://localhost:8081/v1beta/models/GLM-4.7:streamGenerateContent \ -H "Content-Type: application/json" \ -d '{ "contents": [ { "parts": [{"text": "你好,帮我写一个快速排序,保存在main.py中"}] } ] }' The last two data chunks before [DONE] will look exactly like the example above.

Relevant log output

What part of LiteLLM is this about?

No response

What LiteLLM version are you on ?

v1.83.3

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Modify the SSE stream consumer in LiteLLM to continue reading the stream until data: [DONE] is received, rather than stopping at the first finish_reason chunk.

Guidance

  • Identify the SSE stream consumer code in LiteLLM and update it to not close the stream on finish_reason.
  • Verify that the updated code correctly handles the trailing usage chunk by checking for the presence of usageMetadata in the response.
  • Consider implementing a temporary workaround using a reverse proxy to merge the trailing chunks, as described in the issue.
  • Test the fix by sending a streaming request via gemini-cli and observing that the response contains the expected usageMetadata field.

Example

No code snippet is provided, as the issue does not include the relevant code from the SSE stream consumer.

Notes

The suggested fix assumes that the issue is solely due to the premature closure of the SSE stream consumer. However, other factors may be contributing to the problem, and additional debugging may be necessary.

Recommendation

Apply the suggested fix to the SSE stream consumer code, as it directly addresses the identified issue and should allow LiteLLM to correctly handle the trailing usage chunk.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING