litellm - ✅(Solved) Fix reasoning_tokens exceeds completion_tokens in streaming responses from vLLM [1 pull requests, 1 comments, 2 participants]

litellm2026-03-24 16:55:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#24526•Fetched 2026-04-08 01:27:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ofektiko

Participants

mehmoodosman

ofektiko

Timeline (top)

commented ×1cross-referenced ×1labeled ×1referenced ×1

Fix Action

Fixed

Fixed by PR: fix(streaming): ensure completion_tokens >= reasoning_tokens in streaming usage (https://github.com/BerriAI/litellm/pull/24546)

PR fix notes

PR #24546: fix(streaming): ensure completion_tokens >= reasoning_tokens in streaming usage

Repository: BerriAI/litellm
Author: Krishnachaitanyakc
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/24546

Description (problem / solution / changelog)

Summary

Fixes #24526

Some backends like vLLM report reasoning_tokens and completion_tokens separately in streaming responses instead of including reasoning tokens in the completion_tokens total. This violates the OpenAI API standard where completion_tokens should always be >= reasoning_tokens (since reasoning tokens are a subset of completion tokens).

Example from the issue (vLLM serving Qwen3.5-397B):

completion_tokens=64, reasoning_tokens=79   # reasoning_tokens > completion_tokens!
completion_tokens=68, reasoning_tokens=71
completion_tokens=135, reasoning_tokens=152

Root cause: ChunkProcessor.calculate_usage() in streaming_chunk_builder_utils.py passes through the backend-reported completion_tokens and reasoning_tokens values without validating that completion_tokens >= reasoning_tokens.

Fix: After assembling the usage object, if reasoning_tokens > completion_tokens, add reasoning_tokens to completion_tokens (since the backend clearly excluded them from the total) and recalculate total_tokens.

When reasoning_tokens > completion_tokens: adjusts completion_tokens = completion_tokens + reasoning_tokens and updates total_tokens
When reasoning_tokens <= completion_tokens: no change (backend already follows OpenAI convention)

Test plan

Added test_streaming_reasoning_tokens_exceeds_completion_tokens — verifies that when vLLM reports completion_tokens=64, reasoning_tokens=79, the fix produces completion_tokens=143 (64+79) with correct total_tokens
Added test_streaming_reasoning_tokens_within_completion_tokens_unchanged — verifies that when the backend already reports completion_tokens >= reasoning_tokens (e.g., OpenAI), values are left unchanged
All 11 tests in test_streaming_chunk_builder_utils.py pass

Changed files

litellm/litellm_core_utils/streaming_chunk_builder_utils.py (modified, +20/-0)
litellm/llms/openai/chat/guardrail_translation/handler.py (modified, +3/-3)
litellm/proxy/management_helpers/audit_logs.py (modified, +9/-5)
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py (modified, +119/-0)

Code Example

model_list:
  - model_name: Qwen/Qwen3.5-397B-A17B-FP8
    litellm_params:
      model: openai/Qwen/Qwen3.5-397B-A17B-FP8
      api_base: http://<vllm-host>:8000/v1
      api_key: "no-key-required"

---

curl -X POST http://<litellm-host>:4000/v1/chat/completions \
  -H "Authorization: Bearer <key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-397B-A17B-FP8",
    "messages": [{"role": "user", "content": "Write a short poem about clouds"}],
    "stream": true
  }'

---

completion_tokens=64, reasoning_tokens=79
completion_tokens=68, reasoning_tokens=71
completion_tokens=78, reasoning_tokens=83
completion_tokens=38, reasoning_tokens=43
completion_tokens=135, reasoning_tokens=152

---

"usage": {
  "completion_tokens": 150,  // should be total (reasoning + output)
  "completion_tokens_details": {
    "reasoning_tokens": 80   // should be <= completion_tokens
  }
}

RAW_BUFFERClick to expand / collapse

Bug Description

When proxying streaming chat completion requests from vLLM (serving Qwen3.5-397B-A17B-FP8), LiteLLM reports reasoning_tokens greater than completion_tokens in the usage response. Per the OpenAI API standard, completion_tokens should be the total output tokens (including reasoning), and reasoning_tokens should be a subset within completion_tokens_details.

Steps to Reproduce

Configure LiteLLM to proxy a vLLM instance serving a reasoning model (e.g. Qwen3.5-397B):

model_list:
  - model_name: Qwen/Qwen3.5-397B-A17B-FP8
    litellm_params:
      model: openai/Qwen/Qwen3.5-397B-A17B-FP8
      api_base: http://<vllm-host>:8000/v1
      api_key: "no-key-required"

Send a streaming chat completion request:

curl -X POST http://<litellm-host>:4000/v1/chat/completions \
  -H "Authorization: Bearer <key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-397B-A17B-FP8",
    "messages": [{"role": "user", "content": "Write a short poem about clouds"}],
    "stream": true
  }'

Observe the final streaming chunk's usage field.

Examples from production logs

completion_tokens=64, reasoning_tokens=79
completion_tokens=68, reasoning_tokens=71
completion_tokens=78, reasoning_tokens=83
completion_tokens=38, reasoning_tokens=43
completion_tokens=135, reasoning_tokens=152

In all cases, reasoning_tokens > completion_tokens, which violates the OpenAI standard:

"usage": {
  "completion_tokens": 150,  // should be total (reasoning + output)
  "completion_tokens_details": {
    "reasoning_tokens": 80   // should be <= completion_tokens
  }
}

Impact

Downstream tools that validate server token counts see this as inconsistent and clamp output tokens to 0, breaking token throughput metrics.

Environment

LiteLLM version: v1.81.12-stable (ghcr.io/berriai/litellm-database:main-v1.81.12-stable)
Backend: vLLM serving Qwen/Qwen3.5-397B-A17B-FP8 via OpenAI-compatible API
Request type: streaming chat completions
LiteLLM config: model configured as openai/Qwen/Qwen3.5-397B-A17B-FP8

Expected Behavior

completion_tokens should always be >= reasoning_tokens, representing the total output token count.

Actual Behavior

completion_tokens appears to exclude reasoning tokens, while reasoning_tokens includes them (or counts them separately), resulting in reasoning_tokens > completion_tokens.

extent analysis

Fix Plan

To resolve the issue, we need to adjust the token counting logic in LiteLLM to ensure completion_tokens includes both the output and reasoning tokens.

Update the token counting function:
- Modify the function responsible for calculating completion_tokens and reasoning_tokens to correctly account for the total output tokens.
- Ensure that completion_tokens is the sum of all output tokens, including those from reasoning.
Adjust the usage response:
- Update the code generating the usage field in the response to reflect the corrected completion_tokens and reasoning_tokens values.
- Verify that completion_tokens is always greater than or equal to reasoning_tokens.

Example Code Snippet (Python):

def calculate_tokens(output, reasoning_output):
    # Calculate total completion tokens
    completion_tokens = len(output) + len(reasoning_output)
    
    # Calculate reasoning tokens
    reasoning_tokens = len(reasoning_output)
    
    return completion_tokens, reasoning_tokens

def generate_usage_response(completion_tokens, reasoning_tokens):
    usage = {
        "completion_tokens": completion_tokens,
        "completion_tokens_details": {
            "reasoning_tokens": reasoning_tokens
        }
    }
    return usage

# Example usage
output = "This is an example output."
reasoning_output = "Reasoning behind the output."
completion_tokens, reasoning_tokens = calculate_tokens(output, reasoning_output)
usage_response = generate_usage_response(completion_tokens, reasoning_tokens)
print(usage_response)

Verification

To verify the fix, send a streaming chat completion request and check the final streaming chunk's usage field. Ensure that completion_tokens is greater than or equal to reasoning_tokens.

Extra Tips

Review the OpenAI API documentation to ensure compliance with the latest standards.
Consider adding unit tests to validate the token counting logic and usage response generation.
Monitor production logs for any further discrepancies in token counts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - ✅(Solved) Fix reasoning_tokens exceeds completion_tokens in streaming responses from vLLM [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #24546: fix(streaming): ensure completion_tokens >= reasoning_tokens in streaming usage

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Code Example

Bug Description

Steps to Reproduce

Examples from production logs

Impact

Environment

Expected Behavior

Actual Behavior

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

litellm - ✅(Solved) Fix reasoning_tokens exceeds completion_tokens in streaming responses from vLLM [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #24546: fix(streaming): ensure completion_tokens >= reasoning_tokens in streaming usage

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Code Example

Bug Description

Steps to Reproduce

Examples from production logs

Impact

Environment

Expected Behavior

Actual Behavior

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING