litellm - ✅(Solved) Fix reasoning_tokens exceeds completion_tokens in streaming responses from vLLM [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24526Fetched 2026-04-08 01:27:21
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×1cross-referenced ×1labeled ×1referenced ×1

Fix Action

Fixed

PR fix notes

PR #24546: fix(streaming): ensure completion_tokens >= reasoning_tokens in streaming usage

Description (problem / solution / changelog)

Summary

Fixes #24526

Some backends like vLLM report reasoning_tokens and completion_tokens separately in streaming responses instead of including reasoning tokens in the completion_tokens total. This violates the OpenAI API standard where completion_tokens should always be >= reasoning_tokens (since reasoning tokens are a subset of completion tokens).

Example from the issue (vLLM serving Qwen3.5-397B):

completion_tokens=64, reasoning_tokens=79   # reasoning_tokens > completion_tokens!
completion_tokens=68, reasoning_tokens=71
completion_tokens=135, reasoning_tokens=152

Root cause: ChunkProcessor.calculate_usage() in streaming_chunk_builder_utils.py passes through the backend-reported completion_tokens and reasoning_tokens values without validating that completion_tokens >= reasoning_tokens.

Fix: After assembling the usage object, if reasoning_tokens > completion_tokens, add reasoning_tokens to completion_tokens (since the backend clearly excluded them from the total) and recalculate total_tokens.

  • When reasoning_tokens > completion_tokens: adjusts completion_tokens = completion_tokens + reasoning_tokens and updates total_tokens
  • When reasoning_tokens <= completion_tokens: no change (backend already follows OpenAI convention)

Test plan

  • Added test_streaming_reasoning_tokens_exceeds_completion_tokens — verifies that when vLLM reports completion_tokens=64, reasoning_tokens=79, the fix produces completion_tokens=143 (64+79) with correct total_tokens
  • Added test_streaming_reasoning_tokens_within_completion_tokens_unchanged — verifies that when the backend already reports completion_tokens >= reasoning_tokens (e.g., OpenAI), values are left unchanged
  • All 11 tests in test_streaming_chunk_builder_utils.py pass

Changed files

  • litellm/litellm_core_utils/streaming_chunk_builder_utils.py (modified, +20/-0)
  • litellm/llms/openai/chat/guardrail_translation/handler.py (modified, +3/-3)
  • litellm/proxy/management_helpers/audit_logs.py (modified, +9/-5)
  • tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py (modified, +119/-0)

Code Example

model_list:
  - model_name: Qwen/Qwen3.5-397B-A17B-FP8
    litellm_params:
      model: openai/Qwen/Qwen3.5-397B-A17B-FP8
      api_base: http://<vllm-host>:8000/v1
      api_key: "no-key-required"

---

curl -X POST http://<litellm-host>:4000/v1/chat/completions \
  -H "Authorization: Bearer <key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-397B-A17B-FP8",
    "messages": [{"role": "user", "content": "Write a short poem about clouds"}],
    "stream": true
  }'

---

completion_tokens=64, reasoning_tokens=79
completion_tokens=68, reasoning_tokens=71
completion_tokens=78, reasoning_tokens=83
completion_tokens=38, reasoning_tokens=43
completion_tokens=135, reasoning_tokens=152

---

"usage": {
  "completion_tokens": 150,  // should be total (reasoning + output)
  "completion_tokens_details": {
    "reasoning_tokens": 80   // should be <= completion_tokens
  }
}
RAW_BUFFERClick to expand / collapse

Bug Description

When proxying streaming chat completion requests from vLLM (serving Qwen3.5-397B-A17B-FP8), LiteLLM reports reasoning_tokens greater than completion_tokens in the usage response. Per the OpenAI API standard, completion_tokens should be the total output tokens (including reasoning), and reasoning_tokens should be a subset within completion_tokens_details.

Steps to Reproduce

  1. Configure LiteLLM to proxy a vLLM instance serving a reasoning model (e.g. Qwen3.5-397B):
model_list:
  - model_name: Qwen/Qwen3.5-397B-A17B-FP8
    litellm_params:
      model: openai/Qwen/Qwen3.5-397B-A17B-FP8
      api_base: http://<vllm-host>:8000/v1
      api_key: "no-key-required"
  1. Send a streaming chat completion request:
curl -X POST http://<litellm-host>:4000/v1/chat/completions \
  -H "Authorization: Bearer <key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-397B-A17B-FP8",
    "messages": [{"role": "user", "content": "Write a short poem about clouds"}],
    "stream": true
  }'
  1. Observe the final streaming chunk's usage field.

Examples from production logs

completion_tokens=64, reasoning_tokens=79
completion_tokens=68, reasoning_tokens=71
completion_tokens=78, reasoning_tokens=83
completion_tokens=38, reasoning_tokens=43
completion_tokens=135, reasoning_tokens=152

In all cases, reasoning_tokens > completion_tokens, which violates the OpenAI standard:

"usage": {
  "completion_tokens": 150,  // should be total (reasoning + output)
  "completion_tokens_details": {
    "reasoning_tokens": 80   // should be <= completion_tokens
  }
}

Impact

Downstream tools that validate server token counts see this as inconsistent and clamp output tokens to 0, breaking token throughput metrics.

Environment

  • LiteLLM version: v1.81.12-stable (ghcr.io/berriai/litellm-database:main-v1.81.12-stable)
  • Backend: vLLM serving Qwen/Qwen3.5-397B-A17B-FP8 via OpenAI-compatible API
  • Request type: streaming chat completions
  • LiteLLM config: model configured as openai/Qwen/Qwen3.5-397B-A17B-FP8

Expected Behavior

completion_tokens should always be >= reasoning_tokens, representing the total output token count.

Actual Behavior

completion_tokens appears to exclude reasoning tokens, while reasoning_tokens includes them (or counts them separately), resulting in reasoning_tokens > completion_tokens.

extent analysis

Fix Plan

To resolve the issue, we need to adjust the token counting logic in LiteLLM to ensure completion_tokens includes both the output and reasoning tokens.

  1. Update the token counting function:

    • Modify the function responsible for calculating completion_tokens and reasoning_tokens to correctly account for the total output tokens.
    • Ensure that completion_tokens is the sum of all output tokens, including those from reasoning.
  2. Adjust the usage response:

    • Update the code generating the usage field in the response to reflect the corrected completion_tokens and reasoning_tokens values.
    • Verify that completion_tokens is always greater than or equal to reasoning_tokens.

Example Code Snippet (Python):

def calculate_tokens(output, reasoning_output):
    # Calculate total completion tokens
    completion_tokens = len(output) + len(reasoning_output)
    
    # Calculate reasoning tokens
    reasoning_tokens = len(reasoning_output)
    
    return completion_tokens, reasoning_tokens

def generate_usage_response(completion_tokens, reasoning_tokens):
    usage = {
        "completion_tokens": completion_tokens,
        "completion_tokens_details": {
            "reasoning_tokens": reasoning_tokens
        }
    }
    return usage

# Example usage
output = "This is an example output."
reasoning_output = "Reasoning behind the output."
completion_tokens, reasoning_tokens = calculate_tokens(output, reasoning_output)
usage_response = generate_usage_response(completion_tokens, reasoning_tokens)
print(usage_response)

Verification

To verify the fix, send a streaming chat completion request and check the final streaming chunk's usage field. Ensure that completion_tokens is greater than or equal to reasoning_tokens.

Extra Tips

  • Review the OpenAI API documentation to ensure compliance with the latest standards.
  • Consider adding unit tests to validate the token counting logic and usage response generation.
  • Monitor production logs for any further discrepancies in token counts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING