vllm - ✅(Solved) Fix [Bug]: Duplicate token with different logprob when requesting top [1 pull requests, 1 comments, 2 participants]

CoolFish88 · 2026-03-10T14:22:47Z

[vllm] PR 36746: Bugfix Deduplicate sampled token in top logprobs output - Repository: vllm-project/vllm - Author: mvanhorn - State: open | merged: False - Lin… # PR #36746: [Bugfix] Deduplicate sampled token in top_logprobs output - Repository: vllm-project/vllm - Author: mvanhorn - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/36746 ## Description (problem / solution / changelog) ## Purpose Fixes #36660 When the sampled token is already among the top-k logprobs (common for greedy or near-greedy decoding), `compute_topk_logprobs()` returns the same token twice - once as the sampled token (column 0) and once in the top-k list. This causes duplicate entries in the API response's `top_logprobs` field. ## Root Cause In `vllm/v1/worker/gpu/sample/logprob.py`, lines 103-106: ```python logprob_token_ids = sampled_token_ids.unsqueeze(-1) # [batch, 1] if num_logprobs > 0: topk_indices = torch.topk(logits, num_logprobs, dim=-1).indices logprob_token_ids = torch.cat((logprob_token_ids, topk_indices), dim=1) ``` The sampled token is prepended, then `topk` tokens are appended without checking for overlap. ## Fix - Request `num_logprobs + 1` from `torch.topk` (one extra to account for potential duplicate) - Identify the first occurrence of the sampled token in the top-k results - Remove it using a stable sort that preserves the original ranking order - Slice to exactly `num_logprobs` entries This handles both cases correctly: - **Sampled token IS in top-k:** removed from top-k, leaving exactly `num_logprobs` unique additional entries - **Sampled token is NOT in top-k:** no removal needed, the extra entry is simply sliced off ## Test Plan - Verified with `ruff check` and `ruff format` (passes) - The fix uses only standard PyTorch operations (`topk`, `gather`, `sort`) with no new dependencies - Existing logprob tests in `tests/v1/sample/test_logprobs.py` cover the end-to-end behavior - Manual verification: with `top_logprobs=5` and greedy sampling, the sampled token should appear only once in the response ## Changed files - `vllm/v1/worker/gpu/sample/logprob.py` (modified, +21/-1) ## Fixed - Fixed by PR: [Bugfix] Deduplicate sampled token in top_logprobs output (https://github.com/vllm-project/vllm/pull/36746) ### Your current environment I am using vLLM 0.14 within the AWS LMI container: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi19.0.0-cu128 ### 🐛 Describe the bug When requesting top_logprobs = 5 using the OpenAI chat completion schema (loprobs: True, max_tokens:1, max_completion_tokens:1), I get duplicate tokens with different probabilities. `Example: output["choices"][0]["logprobs"]["content"][0]["top_logprobs"] -> [{'token': 'none', 'logprob': -0.0012816318776458502, 'bytes': [110, 111, 110, 101]}, {'token': 'helper', 'logprob': -7.25128173828125, 'bytes': [104, 101, 108, 112, 101, 114]}, {'token': 'explanation', 'logprob': -7.75128173828125, 'bytes': [101, 120, 112, 108, 97, 110, 97, 116, 105, 111, 110]}, {'token': 'None', 'logprob': -9.87628173828125, 'bytes': [78, 111, 110, 101]}, {'token': 'none', 'logprob': -10.43878173828125, 'bytes': [110, 111, 110, 101]}]` Notice the first and last tokens in the serie are identical in name and sequence of bytes, but have different log probabilities. Why does this happen? ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-10 14:22:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36660•Fetched 2026-04-08 00:35:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

CoolFish88

Participants

CoolFish88

mvanhorn

Timeline (top)

referenced ×2commented ×1cross-referenced ×1labeled ×1

Fix Action

Fixed

Fixed by PR: [Bugfix] Deduplicate sampled token in top_logprobs output (https://github.com/vllm-project/vllm/pull/36746)

PR fix notes

PR #36746: [Bugfix] Deduplicate sampled token in top_logprobs output

Repository: vllm-project/vllm
Author: mvanhorn
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36746

Description (problem / solution / changelog)

Purpose

Fixes #36660

When the sampled token is already among the top-k logprobs (common for greedy or near-greedy decoding), compute_topk_logprobs() returns the same token twice - once as the sampled token (column 0) and once in the top-k list. This causes duplicate entries in the API response's top_logprobs field.

Root Cause

In vllm/v1/worker/gpu/sample/logprob.py, lines 103-106:

logprob_token_ids = sampled_token_ids.unsqueeze(-1)  # [batch, 1]
if num_logprobs > 0:
    topk_indices = torch.topk(logits, num_logprobs, dim=-1).indices
    logprob_token_ids = torch.cat((logprob_token_ids, topk_indices), dim=1)

The sampled token is prepended, then topk tokens are appended without checking for overlap.

Fix

Request num_logprobs + 1 from torch.topk (one extra to account for potential duplicate)
Identify the first occurrence of the sampled token in the top-k results
Remove it using a stable sort that preserves the original ranking order
Slice to exactly num_logprobs entries

This handles both cases correctly:

Sampled token IS in top-k: removed from top-k, leaving exactly num_logprobs unique additional entries
Sampled token is NOT in top-k: no removal needed, the extra entry is simply sliced off

Test Plan

Verified with ruff check and ruff format (passes)
The fix uses only standard PyTorch operations (topk, gather, sort) with no new dependencies
Existing logprob tests in tests/v1/sample/test_logprobs.py cover the end-to-end behavior
Manual verification: with top_logprobs=5 and greedy sampling, the sampled token should appear only once in the response

Changed files

vllm/v1/worker/gpu/sample/logprob.py (modified, +21/-1)

RAW_BUFFERClick to expand / collapse

Your current environment

I am using vLLM 0.14 within the AWS LMI container: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi19.0.0-cu128

🐛 Describe the bug

When requesting top_logprobs = 5 using the OpenAI chat completion schema (loprobs: True, max_tokens:1, max_completion_tokens:1), I get duplicate tokens with different probabilities.

Example: output["choices"][0]["logprobs"]["content"][0]["top_logprobs"] -> [{'token': 'none', 'logprob': -0.0012816318776458502, 'bytes': [110, 111, 110, 101]}, {'token': 'helper', 'logprob': -7.25128173828125, 'bytes': [104, 101, 108, 112, 101, 114]}, {'token': 'explanation', 'logprob': -7.75128173828125, 'bytes': [101, 120, 112, 108, 97, 110, 97, 116, 105, 111, 110]}, {'token': 'None', 'logprob': -9.87628173828125, 'bytes': [78, 111, 110, 101]}, {'token': 'none', 'logprob': -10.43878173828125, 'bytes': [110, 111, 110, 101]}]

Notice the first and last tokens in the serie are identical in name and sequence of bytes, but have different log probabilities. Why does this happen?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of duplicate tokens with different probabilities, we need to modify the code to handle token normalization and probability calculation.

Step-by-Step Solution

Token Normalization: Normalize the tokens to ensure that identical tokens are not treated as separate entities.
Probability Calculation: Recalculate the log probabilities for the normalized tokens.

Example Code

import json

def normalize_tokens(top_logprobs):
    # Create a dictionary to store unique tokens and their log probabilities
    unique_tokens = {}
    for token_info in top_logprobs:
        token = token_info['token'].lower()  # Normalize token to lowercase
        if token in unique_tokens:
            # If token already exists, update its log probability
            unique_tokens[token] = max(unique_tokens[token], token_info['logprob'])
        else:
            unique_tokens[token] = token_info['logprob']
    
    # Convert the dictionary back to a list of token information
    normalized_top_logprobs = []
    for token, logprob in unique_tokens.items():
        # Find the original token information with the maximum log probability
        max_logprob_token_info = max([t for t in top_logprobs if t['token'].lower() == token], key=lambda x: x['logprob'])
        normalized_top_logprobs.append({
            'token': max_logprob_token_info['token'],
            'logprob': logprob,
            'bytes': max_logprob_token_info['bytes']
        })
    
    return normalized_top_logprobs

# Example usage:
output = {
    "choices": [{
        "logprobs": {
            "content": [{
                "top_logprobs": [
                    {'token': 'none', 'logprob': -0.0012816318776458502, 'bytes': [110, 111, 110, 101]},
                    {'token': 'helper', 'logprob': -7.25128173828125, 'bytes': [104, 101, 108, 112, 101, 114]},
                    {'token': 'explanation', 'logprob': -7.75128173828125, 'bytes': [101, 120, 112, 108, 97, 110, 97, 116, 105, 111, 110]},
                    {'token': 'None', 'logprob': -9.87628173828125, 'bytes': [78, 111, 110, 101]},
                    {'token': 'none', 'logprob': -10.43878173828125, 'bytes': [110, 111, 110, 101]}
                ]
            }]
        }
    }]
}

normalized_top_logprobs = normalize_tokens(output["choices"][0]["logprobs"]["content"][0]["top_logprobs"])
print(json.dumps(normalized

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Duplicate token with different logprob when requesting top [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36746: [Bugfix] Deduplicate sampled token in top_logprobs output

Description (problem / solution / changelog)

Purpose

Root Cause

Fix

Test Plan

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Duplicate token with different logprob when requesting top [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36746: [Bugfix] Deduplicate sampled token in top_logprobs output

Description (problem / solution / changelog)

Purpose

Root Cause

Fix

Test Plan

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Still need to ship something?

RELATED_DISCOVERY

TRENDING