vllm - 💡(How to fix) Fix [RFC]: Integrate MineDraft speculative decoding into vLLM [1 participants]

Sebastian-dong · 2026-03-24T12:13:36Z

[vllm] - Introduce a two-batch scheduling mechanism - Overlap drafting and verification: - Batch A → verification - Batch B → drafting - Alternate roles per de… - Introduce a two-batch scheduling mechanism - Overlap drafting and verification: - Batch A → verification - Batch B → drafting - Alternate roles per decoding step ### Motivation. Current speculative decoding in vLLM executes drafting and verification sequentially, which places drafting on the critical path and limits achievable speedup. MineDraft introduces a batch-parallel speculative decoding paradigm that overlaps drafting and verification across different batches, effectively hiding drafting latency. This RFC aims to explore integrating this paradigm into vLLM to further improve throughput and latency without requiring model retraining. ### Proposed Change. This RFC proposes integrating MineDraft, a batch-parallel speculative decoding framework, into vLLM. Full design and details: [MineDraft.md](https://github.com/user-attachments/files/26212782/MineDraft.md) ### Summary - Introduce a two-batch scheduling mechanism - Overlap drafting and verification: - Batch A → verification - Batch B → drafting - Alternate roles per decoding step ### Key Integration Points - Scheduler: batch-level coordination - LLMEngine: overlapping execution flow - KV cache: avoid over-allocation for draft-only requests ### Compatibility - Continuous batching - PagedAttention - Existing speculative decoding methods ### Feedback Period. 1–2 weeks for initial feedback. Happy to iterate quickly based on maintainer suggestions. ### CC List. _No response_ ### Any Other Things. - I am not the author of MineDraft, but implemented a working version and validated its behavior. - This feature can be implemented as an optional decoding mode and does not affect existing workflows. - I plan to split the implementation into multiple smaller PRs if the design is accepted. - Preliminary results (from the paper): - Throughput: up to +75% - Latency: up to -39% ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-24 12:13:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38003•Fetched 2026-04-08 01:22:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Sebastian-dong

Participants

Sebastian-dong

Timeline (top)

labeled ×1

Introduce a two-batch scheduling mechanism
Overlap drafting and verification:
- Batch A → verification
- Batch B → drafting
Alternate roles per decoding step

Root Cause

Introduce a two-batch scheduling mechanism
Overlap drafting and verification:
- Batch A → verification
- Batch B → drafting
Alternate roles per decoding step

RAW_BUFFERClick to expand / collapse

Motivation.

Current speculative decoding in vLLM executes drafting and verification sequentially, which places drafting on the critical path and limits achievable speedup.

MineDraft introduces a batch-parallel speculative decoding paradigm that overlaps drafting and verification across different batches, effectively hiding drafting latency.

This RFC aims to explore integrating this paradigm into vLLM to further improve throughput and latency without requiring model retraining.

Proposed Change.

This RFC proposes integrating MineDraft, a batch-parallel speculative decoding framework, into vLLM.

Full design and details:

MineDraft.md

Summary

Introduce a two-batch scheduling mechanism
Overlap drafting and verification:
- Batch A → verification
- Batch B → drafting
Alternate roles per decoding step

Key Integration Points

Scheduler: batch-level coordination
LLMEngine: overlapping execution flow
KV cache: avoid over-allocation for draft-only requests

Compatibility

Continuous batching
PagedAttention
Existing speculative decoding methods

Feedback Period.

1–2 weeks for initial feedback. Happy to iterate quickly based on maintainer suggestions.

CC List.

No response

Any Other Things.

I am not the author of MineDraft, but implemented a working version and validated its behavior.
This feature can be implemented as an optional decoding mode and does not affect existing workflows.
I plan to split the implementation into multiple smaller PRs if the design is accepted.
Preliminary results (from the paper):
- Throughput: up to +75%
- Latency: up to -39%

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To integrate MineDraft into vLLM, follow these steps:

Implement a two-batch scheduling mechanism in the Scheduler class.
Modify the LLMEngine class to overlap drafting and verification execution flows.
Update the KV cache to avoid over-allocation for draft-only requests.

Example code snippet for the two-batch scheduling mechanism:

class Scheduler:
    def __init__(self):
        self.batch_a = []
        self.batch_b = []

    def schedule(self, batch):
        if self.batch_a:
            self.batch_b = batch
        else:
            self.batch_a = batch

    def execute(self):
        # Execute verification on batch A and drafting on batch B
        if self.batch_a:
            self.verify(self.batch_a)
            self.draft(self.batch_b)
        else:
            self.draft(self.batch_a)
            self.verify(self.batch_b)

Verification

To verify the fix, measure the throughput and latency of the vLLM model with the integrated MineDraft framework. Compare the results to the preliminary results mentioned in the issue body (+75% throughput and -39% latency).

Extra Tips

Implement the integration as an optional decoding mode to avoid affecting existing workflows.
Split the implementation into multiple smaller PRs for easier review and maintenance.
Test the integration with continuous batching, PagedAttention, and existing speculative decoding methods to ensure compatibility.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#retriever error #indexing error #inference speed #output truncation #response parsing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Integrate MineDraft speculative decoding into vLLM [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Motivation.

Proposed Change.

Summary

Key Integration Points

Compatibility

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Integrate MineDraft speculative decoding into vLLM [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Motivation.

Proposed Change.

Summary

Key Integration Points

Compatibility

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING