vllm - ✅(Solved) Fix [Tracking] DFlash: Update DFlash speculators test checkpoint with a stronger model [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39519Fetched 2026-04-11 06:13:04
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2cross-referenced ×1labeled ×1

The current DFlash speculators E2E test (tests/v1/spec_decode/test_speculators_dflash.py) uses nm-testing/dflash-qwen3-8b-speculators, a 3-layer checkpoint trained with limited data. Its acceptance rate is low (mean AL ~1.84, first-token AR ~47.5%), which limits test coverage for later-token code paths in speculative decoding.

Introduced in #38300.

Root Cause

The current DFlash speculators E2E test (tests/v1/spec_decode/test_speculators_dflash.py) uses nm-testing/dflash-qwen3-8b-speculators, a 3-layer checkpoint trained with limited data. Its acceptance rate is low (mean AL ~1.84, first-token AR ~47.5%), which limits test coverage for later-token code paths in speculative decoding.

Introduced in #38300.

Fix Action

Fixed

PR fix notes

PR #38300: [Speculative Decoding] Add DFlash speculators config parsing

Description (problem / solution / changelog)

Summary

  • Adds DFlash speculators config parsing support (algos.py)
  • Allows user --speculative-config to override auto-detected values (config.py)
  • Updates qwen3_dflash.py weight loading: d2t/t2d/verifier handling (similar to Eagle3 patterns)
  • Adds E2E test for DFlash speculators auto-detect path
  • Closes #38240

Test Results (shanjiaz/speculators-dflash-format, Qwen3-8B target)

GSM8K Correctness (1319 questions, 5-shot, batched)

  • Accuracy: 0.885 (Qwen3-8B baseline: ~0.87-0.92)
  • Mean AL: 1.84

Magpie Acceptance Rates (200 prompts, batch-size-1)

  • Mean AL: 1.77 (min 1.45, max 2.09)
  • Per-position acceptance rate: [0.478, 0.181, 0.069, 0.023, 0.007, 0.002, 0.001, 0.000]

@shanjiaz @fynnsu Ready for review. Needs confirmation on qwen3_dflash.py changes.

<details> <summary>Magpie validation script</summary>
"""
Test DFlash speculators auto-detect path.

Loads a speculators-format model directly from HF (no config patching)
and measures acceptance length on the magpie dataset.

Usage:
    chg run --gpus 1 -- python my_wip/dflash_speculators/test_speculators_path.py
"""

import os

from tqdm import tqdm

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Metric

DEFAULT_MODEL = "shanjiaz/speculators-dflash-format"


def _metric_map(metrics):
    return {m.name: m.value for m in metrics if hasattr(m, "value")}


def compute_acceptance_len(
    metrics: list[Metric], prev_metrics: list[Metric] | None = None
) -> float:
    name2metric = _metric_map(metrics)
    n_drafts = name2metric["vllm:spec_decode_num_drafts"]
    n_accepted = name2metric["vllm:spec_decode_num_accepted_tokens"]
    if prev_metrics is not None:
        prev = _metric_map(prev_metrics)
        n_drafts -= prev["vllm:spec_decode_num_drafts"]
        n_accepted -= prev["vllm:spec_decode_num_accepted_tokens"]
    if n_drafts <= 0:
        return 1.0
    return 1 + (n_accepted / n_drafts)


def load_magpie_dataset(max_prompts=200):
    from datasets import load_dataset
    ds = load_dataset("Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1", split="train")
    prompts = []
    for i, row in enumerate(ds):
        if i >= max_prompts:
            break
        if "instruction" in row and row["instruction"]:
            prompts.append(row["instruction"])
        elif "conversations" in row and row["conversations"]:
            for turn in row["conversations"]:
                if turn.get("role") == "user" or turn.get("from") == "human":
                    prompts.append(turn.get("content", turn.get("value", "")))
                    break
    return prompts


def main():
    model_path = os.environ.get("SPEC_MODEL_PATH", DEFAULT_MODEL)

    print(f"\nLoading model via speculators auto-detect: {model_path}")
    llm = LLM(
        model=model_path,
        trust_remote_code=True,
        max_model_len=4096,
        max_num_seqs=128,
        gpu_memory_utilization=0.85,
        enforce_eager=False,
        disable_log_stats=False,
    )

    sc = llm.llm_engine.vllm_config.speculative_config
    print(f"Detected: method={sc.method}, "
          f"num_speculative_tokens={sc.num_speculative_tokens}, "
          f"draft_model={sc.model}")

    tokenizer = llm.get_tokenizer()

    print("\nLoading magpie dataset...")
    prompts_raw = load_magpie_dataset(max_prompts=200)
    print(f"Loaded {len(prompts_raw)} prompts")

    sampling_params = SamplingParams(temperature=0, max_tokens=2048)

    prev_metrics = None
    acceptance_lengths = []

    for i in tqdm(range(len(prompts_raw)), desc="Processing magpie"):
        prompt_text = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompts_raw[i]}],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )

        llm.generate([prompt_text], sampling_params, use_tqdm=False)
        current_metrics = llm.get_metrics()
        al = compute_acceptance_len(current_metrics, prev_metrics)
        prev_metrics = current_metrics
        acceptance_lengths.append(al)

    mean_al = sum(acceptance_lengths) / len(acceptance_lengths)

    print("\n" + "=" * 60)
    print("RESULTS — Speculators Auto-Detect Path")
    print(f"Model: {model_path}")
    print(f"Dataset: magpie ({len(prompts_raw)} prompts)")
    print(f"Mean Acceptance Length: {mean_al:.3f}")
    print(f"Min AL: {min(acceptance_lengths):.3f}")
    print(f"Max AL: {max(acceptance_lengths):.3f}")

    final_metrics = llm.get_metrics()
    name2val = _metric_map(final_metrics)
    n_drafts = name2val.get("vllm:spec_decode_num_drafts", 0)
    per_pos_rates = []
    if n_drafts > 0:
        for m in final_metrics:
            if hasattr(m, "values") and "per_pos" in m.name:
                per_pos_rates = [v / n_drafts for v in m.values]
                break
    if per_pos_rates:
        rate_strs = ", ".join(f"{r:.3f}" for r in per_pos_rates)
        print(f"Per-position acceptance rate: [{rate_strs}]")

    print("=" * 60)

    print("\nFinal aggregate metrics:")
    for key in sorted(name2val):
        if "spec_decode" in key:
            print(f"  {key}: {name2val[key]}")


if __name__ == "__main__":
    main()

## Changed files

- `tests/v1/spec_decode/test_speculators_dflash.py` (added, +104/-0)
- `vllm/model_executor/models/qwen3_dflash.py` (modified, +8/-1)
- `vllm/transformers_utils/configs/speculators/algos.py` (modified, +31/-0)

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Summary

The current DFlash speculators E2E test (tests/v1/spec_decode/test_speculators_dflash.py) uses nm-testing/dflash-qwen3-8b-speculators, a 3-layer checkpoint trained with limited data. Its acceptance rate is low (mean AL ~1.84, first-token AR ~47.5%), which limits test coverage for later-token code paths in speculative decoding.

Introduced in #38300.

TODO

Replace the checkpoint with a stronger model.

Reference comparison (GSM8k-5shot, same target model Qwen3-8B)

CheckpointLayersMean ALFirst-token AR
nm-testing/dflash-qwen3-8b-speculators (current test)31.8447.5%
z-lab/Qwen3-8B-DFlash-b16 (reference)53.7075.6%

cc @shanjiaz @benchislett

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Replace the current checkpoint nm-testing/dflash-qwen3-8b-speculators with a stronger model, such as z-lab/Qwen3-8B-DFlash-b16, to improve test coverage for speculative decoding.

Guidance

  • Identify a stronger model with better acceptance rates, such as the reference comparison z-lab/Qwen3-8B-DFlash-b16, which has a mean AL of 3.70 and first-token AR of 75.6%.
  • Update the tests/v1/spec_decode/test_speculators_dflash.py test to use the new checkpoint.
  • Verify the improvement in test coverage by comparing the acceptance rates before and after the change.
  • Consider evaluating other stronger models to ensure the best possible test coverage.

Example

No code snippet is provided as the issue does not contain sufficient code details.

Notes

The current checkpoint's low acceptance rate limits test coverage, and replacing it with a stronger model should improve the test's effectiveness.

Recommendation

Apply workaround: Replace the current checkpoint with a stronger model, as the current model's limitations are well-documented and a better alternative is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING