vllm - ✅(Solved) Fix [Bug]: Draft model speculative decoding tests failing: async_scheduling not enabled and engine core initialization errors [1 pull requests, 1 participants]

puririshi98 · 2026-04-03T18:15:16Z

[vllm] PR 37916: tests/v1/e2e/spec decode : assert async scheduling is used - Repository: vllm-project/vllm - Author: puririshi98 - State: closed | merged: Fal… # PR #37916: `tests/v1/e2e/spec_decode`: assert async scheduling is used - Repository: vllm-project/vllm - Author: puririshi98 - State: closed | merged: False - Link: https://github.com/vllm-project/vllm/pull/37916 ## Description (problem / solution / changelog) Add assertions in all spec decode E2E tests to verify that async scheduling is actually active when a speculative decoding method that supports it (EAGLE, EAGLE3, MTP, draft_model, ngram_gpu) is configured. Checks scheduler_config.async_scheduling on the resolved VllmConfig after __post_init__ auto-enable logic has run. As requested by @benchislett ## Changed files - `.buildkite/ci_config_intel.yaml` (added, +23/-0) - `.buildkite/hardware_tests/amd.yaml` (modified, +0/-8) - `.buildkite/hardware_tests/cpu.yaml` (modified, +4/-8) - `.buildkite/image_build/image_build_xpu.sh` (added, +34/-0) - `.buildkite/intel_jobs/test-intel.yaml` (added, +64/-0) - `.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml` (modified, +3/-0) - `.buildkite/lm-eval-harness/configs/Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml` (modified, +3/-0) - `.buildkite/lm-eval-harness/configs/Qwen3-235B-A22B-Instruct-2507-FP8.yaml` (modified, +3/-0) - `.buildkite/lm-eval-harness/configs/SparseLlama3.1_2of4_fp8_compressed.yaml` (removed, +0/-12) - `.buildkite/lm-eval-harness/configs/models-small-rocm.txt` (modified, +1/-0) - `.buildkite/lm-eval-harness/test_lm_eval_correctness.py` (modified, +32/-0) - `.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json` (modified, +2/-1) - `.buildkite/performance-benchmarks/tests/serving-tests-cpu-asr.json` (modified, +1/-0) - `.buildkite/performance-benchmarks/tests/serving-tests-cpu-text.json` (modified, +1/-0) - `.buildkite/performance-benchmarks/tests/serving-tests-cpu.json` (modified, +1/-0) - `.buildkite/performance-benchmarks/tests/serving-tests-hpu.json` (modified, +6/-0) - `.buildkite/performance-benchmarks/tests/serving-tests.json` (modified, +4/-0) - `.buildkite/release-pipeline.yaml` (modified, +229/-244) - `.buildkite/scripts/annotate-release.sh` (modified, +4/-2) - `.buildkite/scripts/annotate-rocm-release.sh` (modified, +6/-5) - `.buildkite/scripts/cache-rocm-base-wheels.sh` (modified, +7/-16) - `.buildkite/scripts/cleanup-nightly-builds.sh` (modified, +10/-7) - `.buildkite/scripts/generate-and-upload-nightly-index.sh` (added, +84/-0) - `.buildkite/scripts/hardware_ci/run-amd-test.sh` (modified, +3/-25) - `.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh` (modified, +21/-20) - `.buildkite/scripts/hardware_ci/run-cpu-test-arm.sh` (modified, +7/-2) - `.buildkite/scripts/hardware_ci/run-intel-test.sh` (added, +292/-0) - `.buildkite/scripts/hardware_ci/run-xpu-test.sh` (modified, +1/-0) - `.buildkite/scripts/push-nightly-builds-rocm.sh` (added, +62/-0) - `.buildkite/scripts/upload-nightly-wheels.sh` (modified, +4/-65) - `.buildkite/test-amd.yaml` (modified, +146/-9) - `.buildkite/test_areas/basic_correctness.yaml` (modified, +1/-0) - `.buildkite/test_areas/benchmarks.yaml` (modified, +1/-8) - `.buildkite/test_areas/compile.yaml` (modified, +2/-0) - `.buildkite/test_areas/cuda.yaml` (modified, +1/-0) - `.buildkite/test_areas/distributed.yaml` (modified, +20/-0) - `.buildkite/test_areas/engine.yaml` (modified, +2/-0) - `.buildkite/test_areas/entrypoints.yaml` (modified, +2/-0) - `.buildkite/test_areas/expert_parallelism.yaml` (modified, +5/-2) - `.buildkite/test_areas/kernels.yaml` (modified, +31/-1) - `.buildkite/test_areas/lm_eval.yaml` (modified, +1/-0) - `.buildkite/test_areas/misc.yaml` (modified, +68/-13) - `.buildkite/test_areas/model_executor.yaml` (modified, +1/-1) - `.buildkite/test_areas/model_runner_v2.yaml` (modified, +2/-1) - `.buildkite/test_areas/models_basic.yaml` (modified, +2/-1) - `.buildkite/test_areas/models_distributed.yaml` (modified, +3/-2) - `.buildkite/test_areas/models_language.yaml` (modified, +2/-0) - `.buildkite/test_areas/models_multimodal.yaml` (modified, +5/-1) - `.buildkite/test_areas/pytorch.yaml` (modified, +13/-1) - `.buildkite/test_areas/ray_compat.yaml` (modified, +1/-0) - `.buildkite/test_areas/spec_decode.yaml` (modified, +4/-0) - `.github/CODEOWNERS` (modified, +20/-7) - `.github/mergify.yml` (modified, +30/-0) - `.github/workflows/new_pr_bot.yml` (modified, +10/-4) - `.github/workflows/pre-commit.yml` (modified, +4/-3) - `.pre-commit-config.yaml` (modified, +37/-2) - `AGENTS.md` (modified, +27/-13) - `CMakeLists.txt` (modified, +304/-315) - `benchmarks/attention_benchmarks/benchmark.py` (modified, +2/-8) - `benchmarks/benchmark_long_document_qa_throughput.py` (modified, +1/-2) - `benchmarks/benchmark_prefix_caching.py` (modified, +1/-2) - `benchmarks/benchmark_prioritization.py` (modified, +1/-2) - `benchmarks/cutlass_benchmarks/sparse_benchmarks.py` (removed, +0/-517) - `benchmarks/c

vllm2026-04-03 18:15:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38929•Fetched 2026-04-08 02:44:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

puririshi98

Participants

puririshi98

Timeline (top)

referenced ×2labeled ×1

Error Message

AssertionError: Expected async_scheduling=True for draft_model spec decode, got False.

Root Cause

Error 2: Engine core initialization failure

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

PR fix notes

PR #37916: `tests/v1/e2e/spec_decode`: assert async scheduling is used

Repository: vllm-project/vllm
Author: puririshi98
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/37916

Description (problem / solution / changelog)

Add assertions in all spec decode E2E tests to verify that async scheduling is actually active when a speculative decoding method that supports it (EAGLE, EAGLE3, MTP, draft_model, ngram_gpu) is configured. Checks scheduler_config.async_scheduling on the resolved VllmConfig after post_init auto-enable logic has run. As requested by @benchislett

Changed files

.buildkite/ci_config_intel.yaml (added, +23/-0)
.buildkite/hardware_tests/amd.yaml (modified, +0/-8)
.buildkite/hardware_tests/cpu.yaml (modified, +4/-8)
.buildkite/image_build/image_build_xpu.sh (added, +34/-0)
.buildkite/intel_jobs/test-intel.yaml (added, +64/-0)
.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml (modified, +3/-0)
.buildkite/lm-eval-harness/configs/Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml (modified, +3/-0)
.buildkite/lm-eval-harness/configs/Qwen3-235B-A22B-Instruct-2507-FP8.yaml (modified, +3/-0)
.buildkite/lm-eval-harness/configs/SparseLlama3.1_2of4_fp8_compressed.yaml (removed, +0/-12)
.buildkite/lm-eval-harness/configs/models-small-rocm.txt (modified, +1/-0)
.buildkite/lm-eval-harness/test_lm_eval_correctness.py (modified, +32/-0)
.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json (modified, +2/-1)
.buildkite/performance-benchmarks/tests/serving-tests-cpu-asr.json (modified, +1/-0)
.buildkite/performance-benchmarks/tests/serving-tests-cpu-text.json (modified, +1/-0)
.buildkite/performance-benchmarks/tests/serving-tests-cpu.json (modified, +1/-0)
.buildkite/performance-benchmarks/tests/serving-tests-hpu.json (modified, +6/-0)
.buildkite/performance-benchmarks/tests/serving-tests.json (modified, +4/-0)
.buildkite/release-pipeline.yaml (modified, +229/-244)
.buildkite/scripts/annotate-release.sh (modified, +4/-2)
.buildkite/scripts/annotate-rocm-release.sh (modified, +6/-5)
.buildkite/scripts/cache-rocm-base-wheels.sh (modified, +7/-16)
.buildkite/scripts/cleanup-nightly-builds.sh (modified, +10/-7)
.buildkite/scripts/generate-and-upload-nightly-index.sh (added, +84/-0)
.buildkite/scripts/hardware_ci/run-amd-test.sh (modified, +3/-25)
.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh (modified, +21/-20)
.buildkite/scripts/hardware_ci/run-cpu-test-arm.sh (modified, +7/-2)
.buildkite/scripts/hardware_ci/run-intel-test.sh (added, +292/-0)
.buildkite/scripts/hardware_ci/run-xpu-test.sh (modified, +1/-0)
.buildkite/scripts/push-nightly-builds-rocm.sh (added, +62/-0)
.buildkite/scripts/upload-nightly-wheels.sh (modified, +4/-65)
.buildkite/test-amd.yaml (modified, +146/-9)
.buildkite/test_areas/basic_correctness.yaml (modified, +1/-0)
.buildkite/test_areas/benchmarks.yaml (modified, +1/-8)
.buildkite/test_areas/compile.yaml (modified, +2/-0)
.buildkite/test_areas/cuda.yaml (modified, +1/-0)
.buildkite/test_areas/distributed.yaml (modified, +20/-0)
.buildkite/test_areas/engine.yaml (modified, +2/-0)
.buildkite/test_areas/entrypoints.yaml (modified, +2/-0)
.buildkite/test_areas/expert_parallelism.yaml (modified, +5/-2)
.buildkite/test_areas/kernels.yaml (modified, +31/-1)
.buildkite/test_areas/lm_eval.yaml (modified, +1/-0)
.buildkite/test_areas/misc.yaml (modified, +68/-13)
.buildkite/test_areas/model_executor.yaml (modified, +1/-1)
.buildkite/test_areas/model_runner_v2.yaml (modified, +2/-1)
.buildkite/test_areas/models_basic.yaml (modified, +2/-1)
.buildkite/test_areas/models_distributed.yaml (modified, +3/-2)
.buildkite/test_areas/models_language.yaml (modified, +2/-0)
.buildkite/test_areas/models_multimodal.yaml (modified, +5/-1)
.buildkite/test_areas/pytorch.yaml (modified, +13/-1)
.buildkite/test_areas/ray_compat.yaml (modified, +1/-0)
.buildkite/test_areas/spec_decode.yaml (modified, +4/-0)
.github/CODEOWNERS (modified, +20/-7)
.github/mergify.yml (modified, +30/-0)
.github/workflows/new_pr_bot.yml (modified, +10/-4)
.github/workflows/pre-commit.yml (modified, +4/-3)
.pre-commit-config.yaml (modified, +37/-2)
AGENTS.md (modified, +27/-13)
CMakeLists.txt (modified, +304/-315)
benchmarks/attention_benchmarks/benchmark.py (modified, +2/-8)
benchmarks/benchmark_long_document_qa_throughput.py (modified, +1/-2)
benchmarks/benchmark_prefix_caching.py (modified, +1/-2)
benchmarks/benchmark_prioritization.py (modified, +1/-2)
benchmarks/cutlass_benchmarks/sparse_benchmarks.py (removed, +0/-517)
benchmarks/cutlass_benchmarks/utils.py (modified, +0/-48)
benchmarks/fused_kernels/merge_attn_states_benchmarks.py (added, +264/-0)
benchmarks/fused_kernels/silu_mul_block_quant_benchmark.py (added, +211/-0)
benchmarks/kernels/benchmark_fused_collective.py (modified, +19/-7)
benchmarks/kernels/benchmark_moe.py (modified, +2/-3)
benchmarks/kernels/benchmark_router_gemm.py (removed, +0/-134)
benchmarks/kernels/benchmark_vit_bilinear_pos_embed.py (added, +162/-0)
cmake/cpu_extension.cmake (modified, +1/-0)
cmake/external_projects/qutlass.cmake (modified, +4/-4)
cmake/external_projects/vllm_flash_attn.cmake (modified, +1/-1)
cmake/utils.cmake (modified, +41/-4)
csrc/attention/merge_attn_states.cu (modified, +164/-43)
csrc/cache.h (modified, +4/-0)
csrc/cache_kernels.cu (modified, +66/-1)
csrc/cpu/cpu_fused_moe.cpp (modified, +43/-1)
csrc/cpu/generate_cpu_attn_dispatch.py (modified, +1/-1)
csrc/cpu/sgl-kernels/common.h (modified, +8/-0)
csrc/cpu/sgl-kernels/gemm.h (modified, +36/-3)
csrc/cpu/sgl-kernels/gemm_int4.cpp (added, +755/-0)
csrc/cpu/torch_bindings.cpp (modified, +32/-3)
csrc/cpu/utils.cpp (modified, +35/-144)
csrc/cuda_vec_utils.cuh (modified, +2/-2)
csrc/cutlass_extensions/common.hpp (modified, +7/-5)
csrc/cutlass_extensions/cute_utils.cuh (modified, +0/-1)
csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_array_c3x.hpp (modified, +20/-20)
csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_c3x.hpp (modified, +20/-20)
csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp (modified, +33/-19)
csrc/cutlass_extensions/torch_utils.hpp (modified, +52/-37)
csrc/layernorm_kernels.cu (modified, +1/-1)
csrc/layernorm_quant_kernels.cu (modified, +1/-1)
csrc/libtorch_stable/cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp (renamed, +20/-16)
csrc/libtorch_stable/dispatch_utils.h (added, +69/-0)
csrc/libtorch_stable/ops.h (modified, +128/-0)
csrc/libtorch_stable/quantization/cutlass_w4a8/get_group_starts.cuh (renamed, +36/-25)
csrc/libtorch_stable/quantization/cutlass_w4a8/w4a8_grouped_mm_entry.cu (renamed, +83/-63)
csrc/libtorch_stable/quantization/cutlass_w4a8/w4a8_mm_entry.cu (renamed, +60/-58)
csrc/libtorch_stable/quantization/cutlass_w4a8/w4a8_utils.cu (renamed, +0/-0)

Code Example

AssertionError: Expected async_scheduling=True for draft_model spec decode, got False.

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

CUDA_VISIBLE_DEVICES=0 pytest tests/v1/e2e/spec_decode/test_spec_decode.py::test_draft_model_correctness -v

RAW_BUFFERClick to expand / collapse

The tests that use assert_draft_model_correctness in tests/v1/e2e/spec_decode/test_spec_decode.py are failing with two types of errors:

Error 1: async_scheduling not enabled

AssertionError: Expected async_scheduling=True for draft_model spec decode, got False.

This assertion is at line 1084-1086 of the test file. The draft_model spec decode expects async scheduling to be auto-enabled, but it's not happening.

Error 2: Engine core initialization failure

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

This appears to be a more serious runtime error during engine initialization for some test cases.

Failing Tests

All tests that call assert_draft_model_correctness:

test_draft_model_correctness (line 850)
test_draft_model_realistic_example (line 856)
test_draft_model_parallel_drafting (line 871)
test_draft_model_quantization (line 897)
test_draft_model_tensor_parallelism (line 909)

Steps to Reproduce

CUDA_VISIBLE_DEVICES=0 pytest tests/v1/e2e/spec_decode/test_spec_decode.py::test_draft_model_correctness -v

Environment

vLLM version: 0.19.1rc1.dev17+ga5a623d96.d20260403.precompiled
Installation: Built from source with VLLM_USE_PRECOMPILED=1
Platform: Linux with CUDA 13.2

Expected Behavior

Tests should pass with:

async_scheduling automatically enabled for draft_model spec decode
Engine core initializing successfully

Actual Behavior

Tests fail with async_scheduling=False and engine initialization errors.

Proposed Solution

These tests are being marked in a PR as @pytest.mark.xfail until the underlying issues are resolved: https://github.com/vllm-project/vllm/pull/37916

extent analysis

TL;DR

The tests are likely failing due to async_scheduling not being enabled and engine core initialization issues, which can be temporarily worked around by marking the tests as @pytest.mark.xfail until the underlying issues are resolved.

Guidance

Review the test file tests/v1/e2e/spec_decode/test_spec_decode.py to understand the conditions under which assert_draft_model_correctness is called and how it expects async_scheduling to be enabled.
Investigate the engine core initialization process to identify why it's failing for some test cases, potentially looking into CUDA and version compatibility issues.
Consider updating the test configuration or environment variables to enable async_scheduling by default for draft_model spec decode.
Verify that the engine core initialization failure is not due to a version mismatch or compatibility issue with CUDA 13.2.

Example

No code snippet is provided as the issue does not imply a specific code change but rather an environmental or configuration issue.

Notes

The proposed solution of marking tests as @pytest.mark.xfail is a temporary workaround and does not address the underlying issues.
The root cause of the engine core initialization failure and async_scheduling not being enabled needs to be investigated further.

Recommendation

Apply workaround: Marking the tests as @pytest.mark.xfail until the underlying issues are resolved, as proposed in the PR https://github.com/vllm-project/vllm/pull/37916, to prevent test failures from blocking other developments.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#installation #runtime error #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Draft model speculative decoding tests failing: async_scheduling not enabled and engine core initialization errors [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Error 2: Engine core initialization failure

PR fix notes

PR #37916: `tests/v1/e2e/spec_decode`: assert async scheduling is used

Description (problem / solution / changelog)

Changed files

Code Example

Error 1: async_scheduling not enabled

Error 2: Engine core initialization failure

Failing Tests

Steps to Reproduce

Environment

Expected Behavior

Actual Behavior

Proposed Solution

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Draft model speculative decoding tests failing: async_scheduling not enabled and engine core initialization errors [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Error 2: Engine core initialization failure

PR fix notes

PR #37916: tests/v1/e2e/spec_decode: assert async scheduling is used

Description (problem / solution / changelog)

Changed files

Code Example

Error 1: async_scheduling not enabled

Error 2: Engine core initialization failure

Failing Tests

Steps to Reproduce

Environment

Expected Behavior

Actual Behavior

Proposed Solution

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #37916: `tests/v1/e2e/spec_decode`: assert async scheduling is used