vllm - 💡(How to fix) Fix [RFC]: Integrate MineDraft speculative decoding into vLLM [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38003Fetched 2026-04-08 01:22:02
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1
  • Introduce a two-batch scheduling mechanism
  • Overlap drafting and verification:
    • Batch A → verification
    • Batch B → drafting
  • Alternate roles per decoding step

Root Cause

  • Introduce a two-batch scheduling mechanism
  • Overlap drafting and verification:
    • Batch A → verification
    • Batch B → drafting
  • Alternate roles per decoding step
RAW_BUFFERClick to expand / collapse

Motivation.

Current speculative decoding in vLLM executes drafting and verification sequentially, which places drafting on the critical path and limits achievable speedup.

MineDraft introduces a batch-parallel speculative decoding paradigm that overlaps drafting and verification across different batches, effectively hiding drafting latency.

This RFC aims to explore integrating this paradigm into vLLM to further improve throughput and latency without requiring model retraining.

Proposed Change.

This RFC proposes integrating MineDraft, a batch-parallel speculative decoding framework, into vLLM.

Full design and details:

MineDraft.md

Summary

  • Introduce a two-batch scheduling mechanism
  • Overlap drafting and verification:
    • Batch A → verification
    • Batch B → drafting
  • Alternate roles per decoding step

Key Integration Points

  • Scheduler: batch-level coordination
  • LLMEngine: overlapping execution flow
  • KV cache: avoid over-allocation for draft-only requests

Compatibility

  • Continuous batching
  • PagedAttention
  • Existing speculative decoding methods

Feedback Period.

1–2 weeks for initial feedback. Happy to iterate quickly based on maintainer suggestions.

CC List.

No response

Any Other Things.

  • I am not the author of MineDraft, but implemented a working version and validated its behavior.
  • This feature can be implemented as an optional decoding mode and does not affect existing workflows.
  • I plan to split the implementation into multiple smaller PRs if the design is accepted.
  • Preliminary results (from the paper):
    • Throughput: up to +75%
    • Latency: up to -39%

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To integrate MineDraft into vLLM, follow these steps:

  • Implement a two-batch scheduling mechanism in the Scheduler class.
  • Modify the LLMEngine class to overlap drafting and verification execution flows.
  • Update the KV cache to avoid over-allocation for draft-only requests.

Example code snippet for the two-batch scheduling mechanism:

class Scheduler:
    def __init__(self):
        self.batch_a = []
        self.batch_b = []

    def schedule(self, batch):
        if self.batch_a:
            self.batch_b = batch
        else:
            self.batch_a = batch

    def execute(self):
        # Execute verification on batch A and drafting on batch B
        if self.batch_a:
            self.verify(self.batch_a)
            self.draft(self.batch_b)
        else:
            self.draft(self.batch_a)
            self.verify(self.batch_b)

Verification

To verify the fix, measure the throughput and latency of the vLLM model with the integrated MineDraft framework. Compare the results to the preliminary results mentioned in the issue body (+75% throughput and -39% latency).

Extra Tips

  • Implement the integration as an optional decoding mode to avoid affecting existing workflows.
  • Split the implementation into multiple smaller PRs for easier review and maintenance.
  • Test the integration with continuous batching, PagedAttention, and existing speculative decoding methods to ensure compatibility.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING