vllm - ✅(Solved) Fix [Feature]: Speculative Decoding using draft_model does not use draft_probs [1 pull requests, 1 participants]

vllm2026-04-17 13:20:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40149•Fetched 2026-04-18 05:52:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Graval504

Participants

Graval504

Timeline (top)

labeled ×1

Root Cause

Because draft_probs is None, the rejection_random_sample_kernel defaults all draft_prob values to 1. This deviates from the expected probabilistic rejection sampling logic, leading to a significantly lower acceptance rate than theoretically expected.

# vllm/v1/sample/rejection_sampler.py
class RejectionSampler(nn.Module):
    def forward(
        ...
        output_token_ids = rejection_sample(
            metadata.draft_token_ids,
            metadata.num_draft_tokens,
            metadata.max_spec_len,
            metadata.cu_num_draft_tokens,
            draft_probs,
            target_logits,
            bonus_token_ids,
            sampling_metadata,
        )
        ...

PR fix notes

PR #40269: [Bugfix][Spec Decode] Wire draft_probs into probabilistic draft_model rejection

Repository: vllm-project/vllm
Author: bedeks
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40269

Description (problem / solution / changelog)

Co-authored-by: OpenAI Codex

Purpose

Fixes #40149 by wiring draft-model proposal probabilities through the legacy V1 speculative decoding path when rejection_sample_method="probabilistic".

Previously, GPUModelRunner._sample() passed None for draft_probs, which forced the rejection sampler onto its no-draft-probs fallback instead of using the draft model’s actual proposal distribution. This change captures draft probabilities in the proposer, preserves them across the runner boundary, realigns them by request, and passes them into RejectionSampler so probabilistic rejection sampling can use the intended p(x) / q(x) logic for draft_model.

Test Plan

.venv/bin/python -m py_compile tests/v1/spec_decode/test_eagle.py tests/v1/worker/test_gpu_model_runner.py vllm/v1/spec_decode/eagle.py vllm/v1/worker/gpu_model_runner.py
.venv/bin/python -m pytest tests/v1/worker/test_gpu_model_runner.py -k reordered_draft_probs -v
.venv/bin/python -m pytest tests/v1/spec_decode/test_eagle.py -k probabilistic_draft_probs -v
Manual GPU validation on equivalent code:
- compared baseline vs fixed probabilistic draft-model acceptance on Qwen/Qwen3-1.7B + Qwen/Qwen3-0.6B

Test Result

py_compile: passed
tests/v1/worker/test_gpu_model_runner.py -k reordered_draft_probs -v
- verifies that runner-side cached draft_probs are reordered and sliced correctly before being passed to RejectionSampler
tests/v1/spec_decode/test_eagle.py -k probabilistic_draft_probs -v
- verifies that the proposer captures the expected per-step draft probabilities in probabilistic mode
Manual GPU validation on an L40S with equivalent code showed consistent improvement in speculative acceptance:
- run 1: acceptance_rate 0.2207 -> 0.4512, acceptance_len 1.6620 -> 2.3535
- run 2: acceptance_rate 0.2207 -> 0.4491, acceptance_len 1.6620 -> 2.3474
- run 3: acceptance_rate 0.2255 -> 0.4551, acceptance_len 1.6766 -> 2.3653

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/v1/spec_decode/test_eagle.py (modified, +102/-0)
tests/v1/worker/test_gpu_model_runner.py (modified, +37/-0)
vllm/v1/spec_decode/eagle.py (modified, +50/-3)
vllm/v1/worker/gpu_model_runner.py (modified, +52/-4)

Code Example

"""
Args:
    draft_probs (Optional[torch.Tensor]): 
        Probability distribution for the draft tokens. Shape is
        [num_tokens, vocab_size]. Can be None if probabilities are
        not provided, which is the case for ngram spec decode.
"""

---

# vllm/v1/worker/gpu_model_runner.py
class GPUModelRunner(
    ...
    __init__(
        ...
        self.rejection_sampler = RejectionSampler(self.sampler)
        ...

    def _sample(
        ...
        sampler_output = self.rejection_sampler(
            spec_decode_metadata,
            None,  # draft_probs
            logits,
            sampling_metadata,
        )
        return sampler_output

---

# vllm/v1/sample/rejection_sampler.py
class RejectionSampler(nn.Module):
    def forward(
        ...
        output_token_ids = rejection_sample(
            metadata.draft_token_ids,
            metadata.num_draft_tokens,
            metadata.max_spec_len,
            metadata.cu_num_draft_tokens,
            draft_probs,
            target_logits,
            bonus_token_ids,
            sampling_metadata,
        )
        ...

def rejection_sample(
    ...
    # Rejection sampling for random sampling requests.
    rejection_random_sample_kernel[(batch_size,)](
        output_token_ids,
        cu_num_draft_tokens,
        draft_token_ids,
        draft_probs,
        target_probs,
        bonus_token_ids,
        recovered_token_ids,
        uniform_probs,
        is_greedy,
        max_spec_len,
        vocab_size,
        NO_DRAFT_PROBS=draft_probs is None,
    )
    return output_token_ids

def rejection_random_sample_kernel[(batch_size,)](
        ...
            if NO_DRAFT_PROBS:
                draft_prob = 1
        ...

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I am reporting this as a feature request, but it can also be considered a bug as the current implementation deviates from the intended speculative decoding logic. When using the draft_model with probabilistic rejection sampling in Speculative Decoding, the system should follow the distribution-matching logic defined in Leviathan et al. (2022). Specifically, tokens should be accepted/rejected based on the $$p(x)/q(x)$$ ratio. But in the current vLLM v1 implementation, specifically within vllm/v1/worker/gpu_model_runner.py, the GPUModelRunner._sample method passes None as the draft_probs argument to the RejectionSampler. Notably, the docstring for RejectionSampler.forward explicitly states:

"""
Args:
    draft_probs (Optional[torch.Tensor]): 
        Probability distribution for the draft tokens. Shape is
        [num_tokens, vocab_size]. Can be None if probabilities are
        not provided, which is the case for ngram spec decode.
"""

However, even when using a model-based draft approach (not ngram), draft_probs is still being passed as None.

# vllm/v1/worker/gpu_model_runner.py
class GPUModelRunner(
    ...
    __init__(
        ...
        self.rejection_sampler = RejectionSampler(self.sampler)
        ...

    def _sample(
        ...
        sampler_output = self.rejection_sampler(
            spec_decode_metadata,
            None,  # draft_probs
            logits,
            sampling_metadata,
        )
        return sampler_output

# vllm/v1/sample/rejection_sampler.py
class RejectionSampler(nn.Module):
    def forward(
        ...
        output_token_ids = rejection_sample(
            metadata.draft_token_ids,
            metadata.num_draft_tokens,
            metadata.max_spec_len,
            metadata.cu_num_draft_tokens,
            draft_probs,
            target_logits,
            bonus_token_ids,
            sampling_metadata,
        )
        ...

def rejection_sample(
    ...
    # Rejection sampling for random sampling requests.
    rejection_random_sample_kernel[(batch_size,)](
        output_token_ids,
        cu_num_draft_tokens,
        draft_token_ids,
        draft_probs,
        target_probs,
        bonus_token_ids,
        recovered_token_ids,
        uniform_probs,
        is_greedy,
        max_spec_len,
        vocab_size,
        NO_DRAFT_PROBS=draft_probs is None,
    )
    return output_token_ids

def rejection_random_sample_kernel[(batch_size,)](
        ...
            if NO_DRAFT_PROBS:
                draft_prob = 1
        ...

We need to ensure that the actual logprobs or probabilities from the draft model are correctly captured and passed through the GPUModelRunner to the RejectionSampler.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Pass the actual log probabilities from the draft model to the RejectionSampler instead of None to fix the deviation from the intended speculative decoding logic.

Guidance

Identify where the draft_probs are calculated in the draft_model and ensure they are correctly passed to the GPUModelRunner.
Modify the GPUModelRunner._sample method to pass the actual draft_probs to the RejectionSampler instead of None.
Verify that the RejectionSampler is correctly using the provided draft_probs by checking the rejection_random_sample_kernel function.
Test the updated implementation to ensure it follows the distribution-matching logic defined in Leviathan et al. (2022).

Example

# vllm/v1/worker/gpu_model_runner.py
class GPUModelRunner(
    ...
    def _sample(
        ...
        draft_probs = self.draft_model.get_log_probs()  # Assuming get_log_probs() returns the log probabilities
        sampler_output = self.rejection_sampler(
            spec_decode_metadata,
            draft_probs,
            logits,
            sampling_metadata,
        )
        return sampler_output

Notes

The exact implementation of passing the draft_probs to the RejectionSampler may vary depending on the specifics of the draft_model and the GPUModelRunner classes.

Recommendation

Apply workaround: Modify the GPUModelRunner._sample method to pass the actual draft_probs to the RejectionSampler instead of None, as this will ensure the correct implementation of the speculative decoding logic.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Speculative Decoding using draft_model does not use draft_probs [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #40269: [Bugfix][Spec Decode] Wire draft_probs into probabilistic draft_model rejection

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Speculative Decoding using draft_model does not use draft_probs [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #40269: [Bugfix][Spec Decode] Wire draft_probs into probabilistic draft_model rejection

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING