vllm - ✅(Solved) Fix [Feature]: dflash speculator model support [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38240Fetched 2026-04-08 01:37:05
View on GitHub
Comments
1
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
subscribed ×2commented ×1cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

DFlash has recently emerged in a blog post as a potentially superior method for speculative decoding compared to Eagle-3. Speculators now has the ability to train a dflash model. However, we can't directly load speculators produced models in vllm yet without conversion. Similar algorithms like Eagle3 already has speculators support. So users can serve a speculator models as simple as

vllm serve RedHatAI/Qwen3-235B-A22B-Instruct-2507-speculator.eagle3

We would like the same for dflash models as well.

PR fix notes

PR #38300: [Speculative Decoding] Add DFlash speculators config parsing

Description (problem / solution / changelog)

Summary

  • Adds DFlash speculators config parsing support (algos.py)
  • Allows user --speculative-config to override auto-detected values (config.py)
  • Updates qwen3_dflash.py weight loading: d2t/t2d/verifier handling (similar to Eagle3 patterns)
  • Adds E2E test for DFlash speculators auto-detect path
  • Closes #38240

Test Results (shanjiaz/speculators-dflash-format, Qwen3-8B target)

GSM8K Correctness (1319 questions, 5-shot, batched)

  • Accuracy: 0.885 (Qwen3-8B baseline: ~0.87-0.92)
  • Mean AL: 1.84

Magpie Acceptance Rates (200 prompts, batch-size-1)

  • Mean AL: 1.77 (min 1.45, max 2.09)
  • Per-position acceptance rate: [0.478, 0.181, 0.069, 0.023, 0.007, 0.002, 0.001, 0.000]

@shanjiaz @fynnsu Ready for review. Needs confirmation on qwen3_dflash.py changes.

<details> <summary>Magpie validation script</summary>
"""
Test DFlash speculators auto-detect path.

Loads a speculators-format model directly from HF (no config patching)
and measures acceptance length on the magpie dataset.

Usage:
    chg run --gpus 1 -- python my_wip/dflash_speculators/test_speculators_path.py
"""

import os

from tqdm import tqdm

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Metric

DEFAULT_MODEL = "shanjiaz/speculators-dflash-format"


def _metric_map(metrics):
    return {m.name: m.value for m in metrics if hasattr(m, "value")}


def compute_acceptance_len(
    metrics: list[Metric], prev_metrics: list[Metric] | None = None
) -> float:
    name2metric = _metric_map(metrics)
    n_drafts = name2metric["vllm:spec_decode_num_drafts"]
    n_accepted = name2metric["vllm:spec_decode_num_accepted_tokens"]
    if prev_metrics is not None:
        prev = _metric_map(prev_metrics)
        n_drafts -= prev["vllm:spec_decode_num_drafts"]
        n_accepted -= prev["vllm:spec_decode_num_accepted_tokens"]
    if n_drafts <= 0:
        return 1.0
    return 1 + (n_accepted / n_drafts)


def load_magpie_dataset(max_prompts=200):
    from datasets import load_dataset
    ds = load_dataset("Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1", split="train")
    prompts = []
    for i, row in enumerate(ds):
        if i >= max_prompts:
            break
        if "instruction" in row and row["instruction"]:
            prompts.append(row["instruction"])
        elif "conversations" in row and row["conversations"]:
            for turn in row["conversations"]:
                if turn.get("role") == "user" or turn.get("from") == "human":
                    prompts.append(turn.get("content", turn.get("value", "")))
                    break
    return prompts


def main():
    model_path = os.environ.get("SPEC_MODEL_PATH", DEFAULT_MODEL)

    print(f"\nLoading model via speculators auto-detect: {model_path}")
    llm = LLM(
        model=model_path,
        trust_remote_code=True,
        max_model_len=4096,
        max_num_seqs=128,
        gpu_memory_utilization=0.85,
        enforce_eager=False,
        disable_log_stats=False,
    )

    sc = llm.llm_engine.vllm_config.speculative_config
    print(f"Detected: method={sc.method}, "
          f"num_speculative_tokens={sc.num_speculative_tokens}, "
          f"draft_model={sc.model}")

    tokenizer = llm.get_tokenizer()

    print("\nLoading magpie dataset...")
    prompts_raw = load_magpie_dataset(max_prompts=200)
    print(f"Loaded {len(prompts_raw)} prompts")

    sampling_params = SamplingParams(temperature=0, max_tokens=2048)

    prev_metrics = None
    acceptance_lengths = []

    for i in tqdm(range(len(prompts_raw)), desc="Processing magpie"):
        prompt_text = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompts_raw[i]}],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )

        llm.generate([prompt_text], sampling_params, use_tqdm=False)
        current_metrics = llm.get_metrics()
        al = compute_acceptance_len(current_metrics, prev_metrics)
        prev_metrics = current_metrics
        acceptance_lengths.append(al)

    mean_al = sum(acceptance_lengths) / len(acceptance_lengths)

    print("\n" + "=" * 60)
    print("RESULTS — Speculators Auto-Detect Path")
    print(f"Model: {model_path}")
    print(f"Dataset: magpie ({len(prompts_raw)} prompts)")
    print(f"Mean Acceptance Length: {mean_al:.3f}")
    print(f"Min AL: {min(acceptance_lengths):.3f}")
    print(f"Max AL: {max(acceptance_lengths):.3f}")

    final_metrics = llm.get_metrics()
    name2val = _metric_map(final_metrics)
    n_drafts = name2val.get("vllm:spec_decode_num_drafts", 0)
    per_pos_rates = []
    if n_drafts > 0:
        for m in final_metrics:
            if hasattr(m, "values") and "per_pos" in m.name:
                per_pos_rates = [v / n_drafts for v in m.values]
                break
    if per_pos_rates:
        rate_strs = ", ".join(f"{r:.3f}" for r in per_pos_rates)
        print(f"Per-position acceptance rate: [{rate_strs}]")

    print("=" * 60)

    print("\nFinal aggregate metrics:")
    for key in sorted(name2val):
        if "spec_decode" in key:
            print(f"  {key}: {name2val[key]}")


if __name__ == "__main__":
    main()

## Changed files

- `.buildkite/test_areas/spec_decode.yaml` (modified, +13/-0)
- `tests/v1/spec_decode/test_speculators_dflash.py` (added, +171/-0)
- `vllm/model_executor/models/qwen3_dflash.py` (modified, +8/-1)
- `vllm/transformers_utils/configs/speculators/algos.py` (modified, +31/-0)

Code Example

vllm serve RedHatAI/Qwen3-235B-A22B-Instruct-2507-speculator.eagle3
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

DFlash has recently emerged in a blog post as a potentially superior method for speculative decoding compared to Eagle-3. Speculators now has the ability to train a dflash model. However, we can't directly load speculators produced models in vllm yet without conversion. Similar algorithms like Eagle3 already has speculators support. So users can serve a speculator models as simple as

vllm serve RedHatAI/Qwen3-235B-A22B-Instruct-2507-speculator.eagle3

We would like the same for dflash models as well.

Alternatives

We currently have to convert a dflash speculator model to the format vllm expects. The process is manual and might discourage users from trying out our dflash training support.

Additional context

shanjiaz/dflash-qwen3-8b: This is a manually converted model that serves correctly on the working branch of dflash support. shanjiaz/qwen3-8b-speculator-format: This is produced by speculators training code and has the expected speculators format.

Note: The second model is trained on a much smaller dataset, would not be as good. Will provide a fully validated model soon. These models are just for reference.

Would be great to add tests in similar format as eagle3 speculators tests

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To enable loading DFlash models in VLLM without manual conversion, we need to:

  • Update the model loading code to support DFlash format
  • Add tests for DFlash speculator models

Step-by-Step Solution

  1. Update model loading code:
    • Modify the vllm model loading function to recognize and load DFlash models.
    • Use the transformers library to load the DFlash model, similar to how Eagle3 models are loaded.
  2. Add DFlash model support:
    • Create a new function to load DFlash models, e.g., load_dflash_model.
    • Use this function in the vllm model loading code to load DFlash models.
  3. Add tests for DFlash speculator models:
    • Create a new test file, e.g., test_dflash_speculators.py.
    • Add tests similar to the Eagle3 speculator tests, using the shanjiaz/dflash-qwen3-8b model.

Example Code

# Load DFlash model using transformers library
from transformers import AutoModelForCausalLM

def load_dflash_model(model_name):
    model = AutoModelForCausalLM.from_pretrained(model_name)
    return model

# Update vllm model loading code to support DFlash models
def load_model(model_name):
    if "dflash" in model_name:
        return load_dflash_model(model_name)
    # ... existing code for loading other models ...

Verification

  • Test loading a DFlash model using the updated vllm command: vllm serve shanjiaz/dflash-qwen3-8b
  • Verify that the model is loaded correctly and serves speculator models as expected.

Extra Tips

  • Make sure to update the documentation to reflect the new support for DFlash models.
  • Consider adding a check to ensure that the loaded model is a valid DFlash model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING