vllm - ✅(Solved) Fix [RFC]: Dynamic Speculation Length (DSL) with Confidence-Threshold Early Exit for vLLM Speculative Decoding [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36657Fetched 2026-04-08 00:35:35
View on GitHub
Comments
3
Participants
2
Timeline
19
Reactions
2
Author
Participants
Timeline (top)
subscribed ×8mentioned ×6commented ×3cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #32374: [V1][Spec Decode] Add Dynamic SD

Description (problem / solution / changelog)

Why is Dynamic SD needed?

SD methods need to verify K tokens for each sequence during decoding. As BS increases, the effective BS becomes BS * K which increases the compute requirement during verification. When this BS*K goes beyond a critical BS then SD negatively impacts the TPOT. DSD helps by tuning down the K to an optimal value such that we continue to reap the benefits from SD.

Use cases

  • Possibility of High workload using same deployment. Here K would go down as workload increases.
  • During RL rollout where we start off with high BS but then end up with small BS due to very few long tail request which end up generating a lot of tokens stalling the progress of the current rollout. Here K would go up during the end of rollout.

What this PR does

Addresses https://github.com/vllm-project/vllm/issues/4565 V0 had milestone 0. V1 didn't have any form of Dynamic SD.

This PR implements something between Milestone 2 and 3 of Dynamic SD (DSD) where we dynamically determine the proposed length for speculative decoding using runtime information such as batch size and position level acceptance rate in conjunction with profiled parameters like token acceptance rate (for cold start) and the comparative costs of running the draft versus the target model. This approach allows us to adjust the proposed length in real-time, optimizing performance based on current system conditions.

Before inference happens, the approach uses a representative dataset to profile (similar to how the optimal K is selected for SD w/o Dynamic by iterating on a representative dataset):

  1. the position level acceptance rate for solving the cold start problem
  2. cost of running draft and target model

During inference runtime, the optimal K is found using:

  1. the current batch size
  2. average of position level acceptance rate that the system has seen so far. It waits for warmup_steps before it starts using the measured AR so far. Till warmup_steps it uses the AR from the offline profiling on a representative dataset.

This balances the cold start problem and allows the system to adapt to running request. There are many ways to extend this strategy like resetting AR after some steps but those are left for future work. The purpose of the PR is to have at least something working in vLLM.

The PR computes the goodput similar TurboSpec. However, there is some change to the formula to make it simpler and easier to extend to future models. For a given BS and K: goodput = AL / ITL where AL is a function of K and ITL is a function of K and BS.

TurboSpec on the other hand profiles draft and target separately and builds a regression model which is a function of Model config, KV cache size and batch size to find goodput. This PR follows a simplified approach where the ITL (inter token latency) of the SD model, i.e., target + draft, is directly noted across batch sizes which encapsulates the model config. This makes the setup easier to adapt when model arch changes like SWA or a new change come into picture in future which would make the equation more complicated. The setup profiles using some given batch sizes (BS) and num of draft (K) and linearly interpolates the values between neighboring values for each BS and K bw min and max values of BS and K. While simple, it works effectively as shown in the results.

Results

Offline profiled on MTBench and Tested on MTBench

<google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}--></style>

1xH100   
llama 3.1 8b   
MTBenchVanillaEAGLEDynamic EAGLE
BS 16.33.983.98
BS 46.384.034.05
BS 166.774.454.45
BS 647.946.786.56
BS 12810.1511.199.88
BS 25616.219.9617.2
<img width="603" height="371" alt="image" src="https://github.com/user-attachments/assets/e002fd54-fe14-4c69-a549-db80a7e0d07e" /> Above measures TPOT (ms). Lower is better.

As we can see,

  • At lower BS, DSD is equal to SD and both are better than vanilla
  • At higher BS, SD is worse than vanilla and DSD is better than SD and closer to vanilla. However, DSD has some overhead of running the draft model to prefill even though its not used during decode even though DSD would assign K=0. This is fine because the setup can change BS in future so having all tokens prefilled in draft model is needed.

Offline profiled on MTBench and Tested on InstructCoder

<google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}--></style>

 Profiled on MTB   
InstructCoderVanillaEAGLEDynamic EAGLEDynamic EAGLE with runtime AL
BS 12812.6911.5511.8511.43
BS 25621.1921.521.0721.07
<img width="683" height="311" alt="image" src="https://github.com/user-attachments/assets/c8bc353c-a71f-4736-85e7-5fdffd4221ce" />

Here, "Dynamic EAGLE" is not using runtime AL at all. As we can see adding runtime AL to goodput calculation after sometime give some minor improvement here so for this dataset MTBench numbers are well transferrable to InstrucrCoder but the runtime AL connection would help in adapting more to current workload.

Cmds

Generate DSD Config

time python3 vllm/v1/spec_decode/dynamic/generate_config.py \
    --method eagle \
    --model-dir 'meta-llama/Llama-3.1-8B-Instruct' \
    --draft-dir 'yuhuili/EAGLE-LLaMA3.1-Instruct-8B' \
    --tp 1 \
    --temp 0 \
    --top-p 1.0 \
    --top-k -1 \
    --max-vllm-batch-size 256 \
    --batch-size-list 1 4 16 64 256 \
    --num-speculative-tokens-list 1 3 5 \
    --num-batches 20 \
    --dataset-name hf \
    --dataset-path 'philschmid/mt-bench' \
    --no-oversample \
    --result-dir './log/dynamic_sd_test'
<details> <summary>Example of `dynamic_speculative_config.json` generated</summary>
{
    "is_online": false,
    "batch_stats": {
        "1": {
            "0": 6.520589930005372,
            "1": 7.367628160864115,
            "3": 8.84066498838365,
            "5": 10.32649097032845
        },
        "4": {
            "0": 6.601515458896756,
            "1": 7.472813129425049,
            "3": 8.981170016340911,
            "5": 10.400271974503994
        },
        "16": {
            "0": 6.898819003254175,
            "1": 7.852344075217843,
            "3": 9.518282022327185,
            "5": 11.196403065696359
        },
        "64": {
            "0": 7.774091092869639,
            "1": 9.656429989263415,
            "3": 13.497876934707165,
            "5": 16.831180080771446
        },
        "256": {
            "0": 14.491415582597256,
            "1": 27.138127014040947,
            "3": 41.848431108519435,
            "5": 57.40421102382243
        }
    },
    "max_num_speculative_tokens": 5,
    "acceptance_rate_per_pos": [
        0.6811801775995416,
        0.3914351188771126,
        0.20352334574620454,
        0.1014036092810083,
        0.051417931824692065
    ]
}
</details>

Benchmark

We chose 20*MAX_CONCURRENCY as the num of prompt so that each setting has at least 20 batches. Without this since MTBench only has 80 samples so MAX_CONCURRENCY=1 would have 80 batches and MAX_CONCURRENCY=128 will have only 1 BS.

# vanilla
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 9001 \
  --no-enable-prefix-caching \
  --max-num-seqs 256

# Eagle
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 9001 \
  --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' \
  --no-enable-prefix-caching \
  --max-num-seqs 256

# Dynamic Eagle
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 9001 \
  --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3, "dynamic_config_path": "log/dynamic_sd_test_2/tp-1_temp-0.0_top_p-1.0_top_k--1/philschmid/mt-bench/dynamic_speculative_config.json"}' \
  --no-enable-prefix-caching \
  --max-num-seqs 256

# change MAX_CONCURRENCY here.
MAX_CONCURRENCY=1
NUM_PROMPTS=$((MAX_CONCURRENCY * 20))  
time vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts ${NUM_PROMPTS} \
    --max-concurrency ${MAX_CONCURRENCY} \
    --result-dir "./log/EAGLE-1"

File changes:

  • vllm/v1/spec_decode/dynamic/generate_config.py is the master file which schedules different scripts and gets the config which is used by DSD during runtime. It has different stages:
    • Step 1: Uses offline script to get the AL across different positions.
      • vllm/v1/spec_decode/offline.py is used for it. This is offline_inference/spec_decode.py but moved to vllm/ so that it can be imported here. This offline script is also used in test in CI so is an important file.
    • Step 2: Runs profiling to get the ITL across different BS and K using vllm bench sweep
    • Step 3: Parses the various values generated for each BS and K and collates ITL from them in a config value.
    • Step 4: saves the Dynamic SD config as a config file
  • Adds config class DynamicSpeculativeConfig in vllm/config/speculative.py which holds the config values during DSD profiling. It also has path to the config values.
  • vllm/v1/spec_decode/dynamic/manager.py is the Dynamic SD Manager which reads the ITL from the DynamicSpeculativeConfig generated above and generates optimal K for each BS by interpolating across K and BS during the profiling and then provides it to the SD method during proposal.
  • vllm/v1/worker/gpu_model_runner.py will initalize the DSD Manager and provide the optimal K for the given BS during inference to resp SD method.
  • Introduces spec_decoding_stats_all in scheduler which collects the stats and is used in dynamic/manager.py to compute AR and use the updated values after certain warmup_steps

After Async scheduling and padded drafter compatibility

Results

<img width="910" height="490" alt="image" src="https://github.com/user-attachments/assets/9120c672-4614-4935-9fa2-b0dcf7b5f902" />

Similar to the synchronous scheduling

File changes for async and padded drafter

vllm/v1/core/sched/async_scheduler.py

Problem: With async scheduling, when dynamic SD changes the optimal K (e.g., from 5 to 3), there's a pipeline latency issue: the scheduler has already committed accounting (num_computed_tokens, num_output_placeholders) for the in-flight batch using the old K. Solution: _pending_optimal_k: int | None — stores the optimal K from model output, deferred until the next schedule() call. _in_flight_decode_req_k: dict[str, int] — maps req_id -> committed spec token count for decode requests in the most recently dispatched batch. Used to know exactly which requests need accounting correction and by how much.

New method _apply_pending_dynamic_sd_update(): Called at the start of schedule(). Applies the deferred K update:

  • Updates _spec_token_placeholders to the new K length (controls how many spec positions the scheduler reserves for future batches → reduces KV block waste).
  • Corrects the in-flight batch's over-committed accounting: for each request in _in_flight_decode_req_k, computes diff = committed_k - optimal_k. If diff > 0 (K decreased), subtracts diff from request.num_output_placeholders and request.num_computed_tokens. If diff <= 0 (K increased), just updates request.spec_token_ids for the next scheduling step (can't retroactively add tokens to an in-flight batch).

Override schedule(): Calls _apply_pending_dynamic_sd_update() then delegates to super().schedule(). Modified _update_after_schedule(): Resets and populates _in_flight_decode_req_k with req_id -> cur_num_spec_tokens for each non-prefill decode request that was just committed with spec tokens > 0.

vllm/v1/worker/gpu_model_runner.py

Problem: the model runner still processes (and rejects) zero-padded speculative tokens beyond the optimal K, wasting compute. The SchedulerOutput seen by the model runner still contains the old (larger) K from when the batch was scheduled.

Solution: New method _trim_spec_tokens_for_dynamic_sd(scheduler_output): Trims scheduled_spec_decode_tokens in-place to match self._optimal_num_speculative_tokens. For each request where scheduled_k > optimal_k

Modified _update_states(): Inserted a call to _trim_spec_tokens_for_dynamic_sd(scheduler_output) before the ngram_gpu handling block. Conditioned on _optimal_num_speculative_tokens is not None and use_async_scheduling and scheduled_spec_tokens. This ordering ensures original_num_spec_per_req (saved for ngram_gpu's prev_num_draft_len restoration) is based on the dynamically-trimmed K rather than the over-allocated K.

Modified take_draft_token_ids(): When dynamic SD reduced K below num_spec_tokens, truncates each request's draft token list to k entries (the GPU tensor is zero-padded to num_spec_tokens for scatter indexing, but the scheduler should only see real draft tokens).

PENDING (some of them can be done in future PRs):

  • use online AL to refine the goodput after warmup
  • While this PR only tested EAGLE-1, it can be extended to other methods like EAGLE-3 etc
  • Probably vllm sweep can be used instead of the newly added profiling_client.py and profiling_server.py
  • padded drafter
  • async scheduling
  • add some tests

Changed files

  • tests/v1/spec_decode/test_eagle.py (modified, +2/-0)
  • tests/v1/spec_decode/test_extract_hidden_states.py (modified, +2/-0)
  • tests/v1/spec_decode/test_mtp.py (modified, +1/-0)
  • tests/v1/spec_decode/test_ngram.py (modified, +10/-0)
  • vllm/config/speculative.py (modified, +51/-0)
  • vllm/v1/core/sched/async_scheduler.py (modified, +91/-0)
  • vllm/v1/core/sched/output.py (modified, +5/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +11/-0)
  • vllm/v1/outputs.py (modified, +4/-0)
  • vllm/v1/spec_decode/dynamic/generate_config.py (added, +461/-0)
  • vllm/v1/spec_decode/dynamic/manager.py (added, +179/-0)
  • vllm/v1/spec_decode/eagle.py (modified, +14/-0)
  • vllm/v1/spec_decode/extract_hidden_states.py (modified, +6/-1)
  • vllm/v1/spec_decode/medusa.py (modified, +3/-0)
  • vllm/v1/spec_decode/metrics.py (modified, +4/-0)
  • vllm/v1/spec_decode/ngram_proposer.py (modified, +9/-1)
  • vllm/v1/spec_decode/ngram_proposer_gpu.py (modified, +3/-0)
  • vllm/v1/spec_decode/suffix_decoding.py (modified, +2/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +91/-2)

Code Example

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "model": "meta-llama/Llama-3.2-1B-Instruct",
        "num_speculative_tokens": 20,
        "draft_confidence_threshold": 0.6,  # enables DSL
    },
)

---

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 20 \
  --draft-confidence-threshold 0.6
RAW_BUFFERClick to expand / collapse

Motivation.

Standard speculative decoding always runs the draft model for exactly num_speculative_tokens steps per decode iteration. When the draft model is uncertain early in the speculation chain, every additional step wastes compute: the low-confidence tokens are almost certainly rejected by the target model, and the draft forward passes are lost work.

Setting num_speculative_tokens conservatively (e.g., 5) leaves throughput on the table during high-confidence sequences. Setting it aggressively (e.g., 20) wastes compute during low-confidence sequences. DSL resolves this tension by adapting the draft length at runtime, inspired by DISCO (Mamou et al., 2024).

Proposed Change.

PR: https://github.com/vllm-project/vllm/pull/35301

Add a draft_confidence_threshold field (default 0.0, disabled) to SpeculativeConfig. When enabled, the DraftModelProposer evaluates a batch-level mean exit condition at each draft step $i$:

$$\frac{1}{B} \sum_j \max_v \text{softmax}(z_{i,j}) < \tau$$

If the condition holds, drafting stops early and the tokens generated so far are returned. The output tensor is zero-padded to the full num_speculative_tokens width so that all downstream consumers (GDN attention, scheduler, KV-cache) see a fixed-shape tensor; zero-padded positions are rejected by the verifier with no effect on output quality.

Key design choices

  • Mean policy: the exit condition uses the batch-mean confidence, not the minimum, so a single uncertain request in a mixed batch does not short-circuit drafting for the whole batch.
  • No interface change: propose() always returns shape [batch_size, num_speculative_tokens].
  • New DSL metrics (dsl_num_proposals, dsl_num_early_exits, dsl_tokens_generated, dsl_tokens_requested) propagated end-to-end from the draft model through ModelRunnerOutput, scheduler, and benchmark output.

Usage

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "model": "meta-llama/Llama-3.2-1B-Instruct",
        "num_speculative_tokens": 20,
        "draft_confidence_threshold": 0.6,  # enables DSL
    },
)

Or via CLI:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 20 \
  --draft-confidence-threshold 0.6

Benchmark highlights (1× A100 80 GB PCIe)

Model pairDatasetConfigvs. targetvs. SD/5
Llama-3.1-8B + Llama-3.2-1Bmt-bench c=1DSL/20/τ=0.61.77×1.44×
OPT-6.7B + OPT-125Mmt-bench c=1DSL/10/τ=0.63.10×1.64×
Llama-3.1-8B + Llama-3.2-1Brandom c=8DSL/15/τ=0.83.60×1.52×
Llama-3.1-8B + Llama-3.2-1Bmt-bench c=4DSL/20/τ=0.41.32×1.63×

Feedback Period.

until 2026-03-17

CC List.

@xuechendi @yao-matrix @bigPYJ1151 @keyboardAnt @orenpereg

Any Other Things.

  • Known limitation — GPU→CPU sync: the early-exit check calls .item() on a GPU scalar, forcing a host–device sync per draft step. This is negligible at small batch sizes but may become a bottleneck at high concurrency. A device-side sync-free approach is left as future work.
  • Threshold is not auto-tuned: recommended starting range is 0.4–0.6; optimal value is model-pair and dataset dependent.
  • Scope: DSL is supported only by DraftModelProposer (standard draft-model speculation) in this PR. SpecDecodeBaseProposer exposes a _supports_dsl = False class flag; DraftModelProposer opts in by setting it to True. EAGLE/EAGLE2 and other proposers are unaffected.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Dynamic Speculation Length (DSL) feature, follow these steps:

  • Add a draft_confidence_threshold field to SpeculativeConfig with a default value of 0.0.
  • Modify the DraftModelProposer to evaluate a batch-level mean exit condition at each draft step.
  • Implement the early-exit check using the draft_confidence_threshold value.

Example code:

class SpeculativeConfig:
    def __init__(self, num_speculative_tokens, draft_confidence_threshold=0.0):
        self.num_speculative_tokens = num_speculative_tokens
        self.draft_confidence_threshold = draft_confidence_threshold

class DraftModelProposer:
    def propose(self, batch):
        # Evaluate batch-level mean exit condition at each draft step
        for i in range(self.speculative_config.num_speculative_tokens):
            # Calculate mean confidence
            mean_confidence = torch.mean(torch.max(torch.softmax(batch, dim=-1), dim=-1)[0])
            # Check early-exit condition
            if mean_confidence < self.speculative_config.draft_confidence_threshold:
                # Stop drafting and return generated tokens
                return batch[:, :i+1]
        # Return full batch if early-exit condition not met
        return batch

Verification

To verify the fix, test the DraftModelProposer with different draft_confidence_threshold values and batch sizes. Monitor the dsl_num_proposals, dsl_num_early_exits, dsl_tokens_generated, and dsl_tokens_requested metrics to ensure the DSL feature is working correctly.

Extra Tips

  • The optimal draft_confidence_threshold value is model-pair and dataset dependent. Experiment with different values to find the best threshold for your specific use case.
  • Be aware of the known limitation regarding GPU→CPU sync, which may become a bottleneck at high concurrency. Consider implementing a device-side sync-free approach in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING