Fix Action

Fix / Workaround

DFlash has recently emerged in a blog post as a potentially superior method for speculative decoding compared to Eagle-3. Speculators now has the ability to train a dflash model. However, we can't directly load speculators produced models in vllm yet without conversion. Similar algorithms like Eagle3 already has speculators support. So users can serve a speculator models as simple as

vllm serve RedHatAI/Qwen3-235B-A22B-Instruct-2507-speculator.eagle3

We would like the same for dflash models as well.

PR fix notes

PR #38300: [Speculative Decoding] Add DFlash speculators config parsing

Repository: vllm-project/vllm
Author: ZhanqiuHu
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38300

Description (problem / solution / changelog)

Summary

Adds DFlash speculators config parsing support (algos.py)
Allows user --speculative-config to override auto-detected values (config.py)
Updates qwen3_dflash.py weight loading: d2t/t2d/verifier handling (similar to Eagle3 patterns)
Adds E2E test for DFlash speculators auto-detect path
Closes #38240

Test Results (`shanjiaz/speculators-dflash-format`, Qwen3-8B target)

GSM8K Correctness (1319 questions, 5-shot, batched)

Accuracy: 0.885 (Qwen3-8B baseline: ~0.87-0.92)
Mean AL: 1.84

Magpie Acceptance Rates (200 prompts, batch-size-1)

Mean AL: 1.77 (min 1.45, max 2.09)
Per-position acceptance rate: [0.478, 0.181, 0.069, 0.023, 0.007, 0.002, 0.001, 0.000]

@shanjiaz @fynnsu Ready for review. Needs confirmation on qwen3_dflash.py changes.

<details> <summary>Magpie validation script</summary>

"""
Test DFlash speculators auto-detect path.

Loads a speculators-format model directly from HF (no config patching)
and measures acceptance length on the magpie dataset.

Usage:
    chg run --gpus 1 -- python my_wip/dflash_speculators/test_speculators_path.py
"""

import os

from tqdm import tqdm

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Metric

DEFAULT_MODEL = "shanjiaz/speculators-dflash-format"


def _metric_map(metrics):
    return {m.name: m.value for m in metrics if hasattr(m, "value")}


def compute_acceptance_len(
    metrics: list[Metric], prev_metrics: list[Metric] | None = None
) -> float:
    name2metric = _metric_map(metrics)
    n_drafts = name2metric["vllm:spec_decode_num_drafts"]
    n_accepted = name2metric["vllm:spec_decode_num_accepted_tokens"]
    if prev_metrics is not None:
        prev = _metric_map(prev_metrics)
        n_drafts -= prev["vllm:spec_decode_num_drafts"]
        n_accepted -= prev["vllm:spec_decode_num_accepted_tokens"]
    if n_drafts <= 0:
        return 1.0
    return 1 + (n_accepted / n_drafts)


def load_magpie_dataset(max_prompts=200):
    from datasets import load_dataset
    ds = load_dataset("Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1", split="train")
    prompts = []
    for i, row in enumerate(ds):
        if i >= max_prompts:
            break
        if "instruction" in row and row["instruction"]:
            prompts.append(row["instruction"])
        elif "conversations" in row and row["conversations"]:
            for turn in row["conversations"]:
                if turn.get("role") == "user" or turn.get("from") == "human":
                    prompts.append(turn.get("content", turn.get("value", "")))
                    break
    return prompts


def main():
    model_path = os.environ.get("SPEC_MODEL_PATH", DEFAULT_MODEL)

    print(f"\nLoading model via speculators auto-detect: {model_path}")
    llm = LLM(
        model=model_path,
        trust_remote_code=True,
        max_model_len=4096,
        max_num_seqs=128,
        gpu_memory_utilization=0.85,
        enforce_eager=False,
        disable_log_stats=False,
    )

    sc = llm.llm_engine.vllm_config.speculative_config
    print(f"Detected: method={sc.method}, "
          f"num_speculative_tokens={sc.num_speculative_tokens}, "
          f"draft_model={sc.model}")

    tokenizer = llm.get_tokenizer()

    print("\nLoading magpie dataset...")
    prompts_raw = load_magpie_dataset(max_prompts=200)
    print(f"Loaded {len(prompts_raw)} prompts")

    sampling_params = SamplingParams(temperature=0, max_tokens=2048)

    prev_metrics = None
    acceptance_lengths = []

    for i in tqdm(range(len(prompts_raw)), desc="Processing magpie"):
        prompt_text = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompts_raw[i]}],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )

        llm.generate([prompt_text], sampling_params, use_tqdm=False)
        current_metrics = llm.get_metrics()
        al = compute_acceptance_len(current_metrics, prev_metrics)
        prev_metrics = current_metrics
        acceptance_lengths.append(al)

    mean_al = sum(acceptance_lengths) / len(acceptance_lengths)

    print("\n" + "=" * 60)
    print("RESULTS — Speculators Auto-Detect Path")
    print(f"Model: {model_path}")
    print(f"Dataset: magpie ({len(prompts_raw)} prompts)")
    print(f"Mean Acceptance Length: {mean_al:.3f}")
    print(f"Min AL: {min(acceptance_lengths):.3f}")
    print(f"Max AL: {max(acceptance_lengths):.3f}")

    final_metrics = llm.get_metrics()
    name2val = _metric_map(final_metrics)
    n_drafts = name2val.get("vllm:spec_decode_num_drafts", 0)
    per_pos_rates = []
    if n_drafts > 0:
        for m in final_metrics:
            if hasattr(m, "values") and "per_pos" in m.name:
                per_pos_rates = [v / n_drafts for v in m.values]
                break
    if per_pos_rates:
        rate_strs = ", ".join(f"{r:.3f}" for r in per_pos_rates)
        print(f"Per-position acceptance rate: [{rate_strs}]")

    print("=" * 60)

    print("\nFinal aggregate metrics:")
    for key in sorted(name2val):
        if "spec_decode" in key:
            print(f"  {key}: {name2val[key]}")


if __name__ == "__main__":
    main()

## Changed files

- `.buildkite/test_areas/spec_decode.yaml` (modified, +13/-0)
- `tests/v1/spec_decode/test_speculators_dflash.py` (added, +171/-0)
- `vllm/model_executor/models/qwen3_dflash.py` (modified, +8/-1)
- `vllm/transformers_utils/configs/speculators/algos.py` (modified, +31/-0)

🚀 The feature, motivation and pitch

vllm serve RedHatAI/Qwen3-235B-A22B-Instruct-2507-speculator.eagle3

We would like the same for dflash models as well.

Alternatives

We currently have to convert a dflash speculator model to the format vllm expects. The process is manual and might discourage users from trying out our dflash training support.

Additional context

shanjiaz/dflash-qwen3-8b: This is a manually converted model that serves correctly on the working branch of dflash support. shanjiaz/qwen3-8b-speculator-format: This is produced by speculators training code and has the expected speculators format.

Note: The second model is trained on a much smaller dataset, would not be as good. Will provide a fully validated model soon. These models are just for reference.

Would be great to add tests in similar format as eagle3 speculators tests

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To enable loading DFlash models in VLLM without manual conversion, we need to:

Update the model loading code to support DFlash format
Add tests for DFlash speculator models

Step-by-Step Solution

Update model loading code:
- Modify the vllm model loading function to recognize and load DFlash models.
- Use the transformers library to load the DFlash model, similar to how Eagle3 models are loaded.
Add DFlash model support:
- Create a new function to load DFlash models, e.g., load_dflash_model.
- Use this function in the vllm model loading code to load DFlash models.
Add tests for DFlash speculator models:
- Create a new test file, e.g., test_dflash_speculators.py.
- Add tests similar to the Eagle3 speculator tests, using the shanjiaz/dflash-qwen3-8b model.

Example Code

# Load DFlash model using transformers library
from transformers import AutoModelForCausalLM

def load_dflash_model(model_name):
    model = AutoModelForCausalLM.from_pretrained(model_name)
    return model

# Update vllm model loading code to support DFlash models
def load_model(model_name):
    if "dflash" in model_name:
        return load_dflash_model(model_name)
    # ... existing code for loading other models ...

Verification

Test loading a DFlash model using the updated vllm command: vllm serve shanjiaz/dflash-qwen3-8b
Verify that the model is loaded correctly and serves speculator models as expected.

Extra Tips

Make sure to update the documentation to reflect the new support for DFlash models.
Consider adding a check to ensure that the loaded model is a valid DFlash model.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: dflash speculator model support [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #38300: [Speculative Decoding] Add DFlash speculators config parsing

Description (problem / solution / changelog)

Summary

Test Results (`shanjiaz/speculators-dflash-format`, Qwen3-8B target)

GSM8K Correctness (1319 questions, 5-shot, batched)

Magpie Acceptance Rates (200 prompts, batch-size-1)

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: dflash speculator model support [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #38300: [Speculative Decoding] Add DFlash speculators config parsing

Description (problem / solution / changelog)

Summary

Test Results (shanjiaz/speculators-dflash-format, Qwen3-8B target)

GSM8K Correctness (1319 questions, 5-shot, batched)

Magpie Acceptance Rates (200 prompts, batch-size-1)

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Test Results (`shanjiaz/speculators-dflash-format`, Qwen3-8B target)