vllm - ✅(Solved) Fix [Feature]: Speculative Decoding using draft_model does not use draft_probs [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40149Fetched 2026-04-18 05:52:18
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Root Cause

Because draft_probs is None, the rejection_random_sample_kernel defaults all draft_prob values to 1. This deviates from the expected probabilistic rejection sampling logic, leading to a significantly lower acceptance rate than theoretically expected.

# vllm/v1/sample/rejection_sampler.py
class RejectionSampler(nn.Module):
    def forward(
        ...
        output_token_ids = rejection_sample(
            metadata.draft_token_ids,
            metadata.num_draft_tokens,
            metadata.max_spec_len,
            metadata.cu_num_draft_tokens,
            draft_probs,
            target_logits,
            bonus_token_ids,
            sampling_metadata,
        )
        ...

PR fix notes

PR #40269: [Bugfix][Spec Decode] Wire draft_probs into probabilistic draft_model rejection

Description (problem / solution / changelog)

Co-authored-by: OpenAI Codex

Purpose

Fixes #40149 by wiring draft-model proposal probabilities through the legacy V1 speculative decoding path when rejection_sample_method="probabilistic".

Previously, GPUModelRunner._sample() passed None for draft_probs, which forced the rejection sampler onto its no-draft-probs fallback instead of using the draft model’s actual proposal distribution. This change captures draft probabilities in the proposer, preserves them across the runner boundary, realigns them by request, and passes them into RejectionSampler so probabilistic rejection sampling can use the intended p(x) / q(x) logic for draft_model.

Test Plan

  • .venv/bin/python -m py_compile tests/v1/spec_decode/test_eagle.py tests/v1/worker/test_gpu_model_runner.py vllm/v1/spec_decode/eagle.py vllm/v1/worker/gpu_model_runner.py
  • .venv/bin/python -m pytest tests/v1/worker/test_gpu_model_runner.py -k reordered_draft_probs -v
  • .venv/bin/python -m pytest tests/v1/spec_decode/test_eagle.py -k probabilistic_draft_probs -v
  • Manual GPU validation on equivalent code:
    • compared baseline vs fixed probabilistic draft-model acceptance on Qwen/Qwen3-1.7B + Qwen/Qwen3-0.6B

Test Result

  • py_compile: passed
  • tests/v1/worker/test_gpu_model_runner.py -k reordered_draft_probs -v
    • verifies that runner-side cached draft_probs are reordered and sliced correctly before being passed to RejectionSampler
  • tests/v1/spec_decode/test_eagle.py -k probabilistic_draft_probs -v
    • verifies that the proposer captures the expected per-step draft probabilities in probabilistic mode
  • Manual GPU validation on an L40S with equivalent code showed consistent improvement in speculative acceptance:
    • run 1: acceptance_rate 0.2207 -> 0.4512, acceptance_len 1.6620 -> 2.3535
    • run 2: acceptance_rate 0.2207 -> 0.4491, acceptance_len 1.6620 -> 2.3474
    • run 3: acceptance_rate 0.2255 -> 0.4551, acceptance_len 1.6766 -> 2.3653

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/v1/spec_decode/test_eagle.py (modified, +102/-0)
  • tests/v1/worker/test_gpu_model_runner.py (modified, +37/-0)
  • vllm/v1/spec_decode/eagle.py (modified, +50/-3)
  • vllm/v1/worker/gpu_model_runner.py (modified, +52/-4)

Code Example

"""
Args:
    draft_probs (Optional[torch.Tensor]): 
        Probability distribution for the draft tokens. Shape is
        [num_tokens, vocab_size]. Can be None if probabilities are
        not provided, which is the case for ngram spec decode.
"""

---

# vllm/v1/worker/gpu_model_runner.py
class GPUModelRunner(
    ...
    __init__(
        ...
        self.rejection_sampler = RejectionSampler(self.sampler)
        ...

    def _sample(
        ...
        sampler_output = self.rejection_sampler(
            spec_decode_metadata,
            None,  # draft_probs
            logits,
            sampling_metadata,
        )
        return sampler_output

---

# vllm/v1/sample/rejection_sampler.py
class RejectionSampler(nn.Module):
    def forward(
        ...
        output_token_ids = rejection_sample(
            metadata.draft_token_ids,
            metadata.num_draft_tokens,
            metadata.max_spec_len,
            metadata.cu_num_draft_tokens,
            draft_probs,
            target_logits,
            bonus_token_ids,
            sampling_metadata,
        )
        ...

def rejection_sample(
    ...
    # Rejection sampling for random sampling requests.
    rejection_random_sample_kernel[(batch_size,)](
        output_token_ids,
        cu_num_draft_tokens,
        draft_token_ids,
        draft_probs,
        target_probs,
        bonus_token_ids,
        recovered_token_ids,
        uniform_probs,
        is_greedy,
        max_spec_len,
        vocab_size,
        NO_DRAFT_PROBS=draft_probs is None,
    )
    return output_token_ids

def rejection_random_sample_kernel[(batch_size,)](
        ...
            if NO_DRAFT_PROBS:
                draft_prob = 1
        ...
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I am reporting this as a feature request, but it can also be considered a bug as the current implementation deviates from the intended speculative decoding logic. When using the draft_model with probabilistic rejection sampling in Speculative Decoding, the system should follow the distribution-matching logic defined in Leviathan et al. (2022). Specifically, tokens should be accepted/rejected based on the $$p(x)/q(x)$$ ratio. But in the current vLLM v1 implementation, specifically within vllm/v1/worker/gpu_model_runner.py, the GPUModelRunner._sample method passes None as the draft_probs argument to the RejectionSampler. Notably, the docstring for RejectionSampler.forward explicitly states:

"""
Args:
    draft_probs (Optional[torch.Tensor]): 
        Probability distribution for the draft tokens. Shape is
        [num_tokens, vocab_size]. Can be None if probabilities are
        not provided, which is the case for ngram spec decode.
"""

However, even when using a model-based draft approach (not ngram), draft_probs is still being passed as None.

# vllm/v1/worker/gpu_model_runner.py
class GPUModelRunner(
    ...
    __init__(
        ...
        self.rejection_sampler = RejectionSampler(self.sampler)
        ...

    def _sample(
        ...
        sampler_output = self.rejection_sampler(
            spec_decode_metadata,
            None,  # draft_probs
            logits,
            sampling_metadata,
        )
        return sampler_output

Because draft_probs is None, the rejection_random_sample_kernel defaults all draft_prob values to 1. This deviates from the expected probabilistic rejection sampling logic, leading to a significantly lower acceptance rate than theoretically expected.

# vllm/v1/sample/rejection_sampler.py
class RejectionSampler(nn.Module):
    def forward(
        ...
        output_token_ids = rejection_sample(
            metadata.draft_token_ids,
            metadata.num_draft_tokens,
            metadata.max_spec_len,
            metadata.cu_num_draft_tokens,
            draft_probs,
            target_logits,
            bonus_token_ids,
            sampling_metadata,
        )
        ...

def rejection_sample(
    ...
    # Rejection sampling for random sampling requests.
    rejection_random_sample_kernel[(batch_size,)](
        output_token_ids,
        cu_num_draft_tokens,
        draft_token_ids,
        draft_probs,
        target_probs,
        bonus_token_ids,
        recovered_token_ids,
        uniform_probs,
        is_greedy,
        max_spec_len,
        vocab_size,
        NO_DRAFT_PROBS=draft_probs is None,
    )
    return output_token_ids

def rejection_random_sample_kernel[(batch_size,)](
        ...
            if NO_DRAFT_PROBS:
                draft_prob = 1
        ...

We need to ensure that the actual logprobs or probabilities from the draft model are correctly captured and passed through the GPUModelRunner to the RejectionSampler.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Pass the actual log probabilities from the draft model to the RejectionSampler instead of None to fix the deviation from the intended speculative decoding logic.

Guidance

  • Identify where the draft_probs are calculated in the draft_model and ensure they are correctly passed to the GPUModelRunner.
  • Modify the GPUModelRunner._sample method to pass the actual draft_probs to the RejectionSampler instead of None.
  • Verify that the RejectionSampler is correctly using the provided draft_probs by checking the rejection_random_sample_kernel function.
  • Test the updated implementation to ensure it follows the distribution-matching logic defined in Leviathan et al. (2022).

Example

# vllm/v1/worker/gpu_model_runner.py
class GPUModelRunner(
    ...
    def _sample(
        ...
        draft_probs = self.draft_model.get_log_probs()  # Assuming get_log_probs() returns the log probabilities
        sampler_output = self.rejection_sampler(
            spec_decode_metadata,
            draft_probs,
            logits,
            sampling_metadata,
        )
        return sampler_output

Notes

The exact implementation of passing the draft_probs to the RejectionSampler may vary depending on the specifics of the draft_model and the GPUModelRunner classes.

Recommendation

Apply workaround: Modify the GPUModelRunner._sample method to pass the actual draft_probs to the RejectionSampler instead of None, as this will ensure the correct implementation of the speculative decoding logic.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING