vllm - ✅(Solved) Fix [Bug]: Draft model speculative decoding tests failing: async_scheduling not enabled and engine core initialization errors [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38929Fetched 2026-04-08 02:44:54
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
referenced ×2labeled ×1

Error Message

AssertionError: Expected async_scheduling=True for draft_model spec decode, got False.

Root Cause

Error 2: Engine core initialization failure

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

PR fix notes

PR #37916: tests/v1/e2e/spec_decode: assert async scheduling is used

Description (problem / solution / changelog)

Add assertions in all spec decode E2E tests to verify that async scheduling is actually active when a speculative decoding method that supports it (EAGLE, EAGLE3, MTP, draft_model, ngram_gpu) is configured. Checks scheduler_config.async_scheduling on the resolved VllmConfig after post_init auto-enable logic has run. As requested by @benchislett

Changed files

  • .buildkite/ci_config_intel.yaml (added, +23/-0)
  • .buildkite/hardware_tests/amd.yaml (modified, +0/-8)
  • .buildkite/hardware_tests/cpu.yaml (modified, +4/-8)
  • .buildkite/image_build/image_build_xpu.sh (added, +34/-0)
  • .buildkite/intel_jobs/test-intel.yaml (added, +64/-0)
  • .buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml (modified, +3/-0)
  • .buildkite/lm-eval-harness/configs/Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml (modified, +3/-0)
  • .buildkite/lm-eval-harness/configs/Qwen3-235B-A22B-Instruct-2507-FP8.yaml (modified, +3/-0)
  • .buildkite/lm-eval-harness/configs/SparseLlama3.1_2of4_fp8_compressed.yaml (removed, +0/-12)
  • .buildkite/lm-eval-harness/configs/models-small-rocm.txt (modified, +1/-0)
  • .buildkite/lm-eval-harness/test_lm_eval_correctness.py (modified, +32/-0)
  • .buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json (modified, +2/-1)
  • .buildkite/performance-benchmarks/tests/serving-tests-cpu-asr.json (modified, +1/-0)
  • .buildkite/performance-benchmarks/tests/serving-tests-cpu-text.json (modified, +1/-0)
  • .buildkite/performance-benchmarks/tests/serving-tests-cpu.json (modified, +1/-0)
  • .buildkite/performance-benchmarks/tests/serving-tests-hpu.json (modified, +6/-0)
  • .buildkite/performance-benchmarks/tests/serving-tests.json (modified, +4/-0)
  • .buildkite/release-pipeline.yaml (modified, +229/-244)
  • .buildkite/scripts/annotate-release.sh (modified, +4/-2)
  • .buildkite/scripts/annotate-rocm-release.sh (modified, +6/-5)
  • .buildkite/scripts/cache-rocm-base-wheels.sh (modified, +7/-16)
  • .buildkite/scripts/cleanup-nightly-builds.sh (modified, +10/-7)
  • .buildkite/scripts/generate-and-upload-nightly-index.sh (added, +84/-0)
  • .buildkite/scripts/hardware_ci/run-amd-test.sh (modified, +3/-25)
  • .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh (modified, +21/-20)
  • .buildkite/scripts/hardware_ci/run-cpu-test-arm.sh (modified, +7/-2)
  • .buildkite/scripts/hardware_ci/run-intel-test.sh (added, +292/-0)
  • .buildkite/scripts/hardware_ci/run-xpu-test.sh (modified, +1/-0)
  • .buildkite/scripts/push-nightly-builds-rocm.sh (added, +62/-0)
  • .buildkite/scripts/upload-nightly-wheels.sh (modified, +4/-65)
  • .buildkite/test-amd.yaml (modified, +146/-9)
  • .buildkite/test_areas/basic_correctness.yaml (modified, +1/-0)
  • .buildkite/test_areas/benchmarks.yaml (modified, +1/-8)
  • .buildkite/test_areas/compile.yaml (modified, +2/-0)
  • .buildkite/test_areas/cuda.yaml (modified, +1/-0)
  • .buildkite/test_areas/distributed.yaml (modified, +20/-0)
  • .buildkite/test_areas/engine.yaml (modified, +2/-0)
  • .buildkite/test_areas/entrypoints.yaml (modified, +2/-0)
  • .buildkite/test_areas/expert_parallelism.yaml (modified, +5/-2)
  • .buildkite/test_areas/kernels.yaml (modified, +31/-1)
  • .buildkite/test_areas/lm_eval.yaml (modified, +1/-0)
  • .buildkite/test_areas/misc.yaml (modified, +68/-13)
  • .buildkite/test_areas/model_executor.yaml (modified, +1/-1)
  • .buildkite/test_areas/model_runner_v2.yaml (modified, +2/-1)
  • .buildkite/test_areas/models_basic.yaml (modified, +2/-1)
  • .buildkite/test_areas/models_distributed.yaml (modified, +3/-2)
  • .buildkite/test_areas/models_language.yaml (modified, +2/-0)
  • .buildkite/test_areas/models_multimodal.yaml (modified, +5/-1)
  • .buildkite/test_areas/pytorch.yaml (modified, +13/-1)
  • .buildkite/test_areas/ray_compat.yaml (modified, +1/-0)
  • .buildkite/test_areas/spec_decode.yaml (modified, +4/-0)
  • .github/CODEOWNERS (modified, +20/-7)
  • .github/mergify.yml (modified, +30/-0)
  • .github/workflows/new_pr_bot.yml (modified, +10/-4)
  • .github/workflows/pre-commit.yml (modified, +4/-3)
  • .pre-commit-config.yaml (modified, +37/-2)
  • AGENTS.md (modified, +27/-13)
  • CMakeLists.txt (modified, +304/-315)
  • benchmarks/attention_benchmarks/benchmark.py (modified, +2/-8)
  • benchmarks/benchmark_long_document_qa_throughput.py (modified, +1/-2)
  • benchmarks/benchmark_prefix_caching.py (modified, +1/-2)
  • benchmarks/benchmark_prioritization.py (modified, +1/-2)
  • benchmarks/cutlass_benchmarks/sparse_benchmarks.py (removed, +0/-517)
  • benchmarks/cutlass_benchmarks/utils.py (modified, +0/-48)
  • benchmarks/fused_kernels/merge_attn_states_benchmarks.py (added, +264/-0)
  • benchmarks/fused_kernels/silu_mul_block_quant_benchmark.py (added, +211/-0)
  • benchmarks/kernels/benchmark_fused_collective.py (modified, +19/-7)
  • benchmarks/kernels/benchmark_moe.py (modified, +2/-3)
  • benchmarks/kernels/benchmark_router_gemm.py (removed, +0/-134)
  • benchmarks/kernels/benchmark_vit_bilinear_pos_embed.py (added, +162/-0)
  • cmake/cpu_extension.cmake (modified, +1/-0)
  • cmake/external_projects/qutlass.cmake (modified, +4/-4)
  • cmake/external_projects/vllm_flash_attn.cmake (modified, +1/-1)
  • cmake/utils.cmake (modified, +41/-4)
  • csrc/attention/merge_attn_states.cu (modified, +164/-43)
  • csrc/cache.h (modified, +4/-0)
  • csrc/cache_kernels.cu (modified, +66/-1)
  • csrc/cpu/cpu_fused_moe.cpp (modified, +43/-1)
  • csrc/cpu/generate_cpu_attn_dispatch.py (modified, +1/-1)
  • csrc/cpu/sgl-kernels/common.h (modified, +8/-0)
  • csrc/cpu/sgl-kernels/gemm.h (modified, +36/-3)
  • csrc/cpu/sgl-kernels/gemm_int4.cpp (added, +755/-0)
  • csrc/cpu/torch_bindings.cpp (modified, +32/-3)
  • csrc/cpu/utils.cpp (modified, +35/-144)
  • csrc/cuda_vec_utils.cuh (modified, +2/-2)
  • csrc/cutlass_extensions/common.hpp (modified, +7/-5)
  • csrc/cutlass_extensions/cute_utils.cuh (modified, +0/-1)
  • csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_array_c3x.hpp (modified, +20/-20)
  • csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_c3x.hpp (modified, +20/-20)
  • csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp (modified, +33/-19)
  • csrc/cutlass_extensions/torch_utils.hpp (modified, +52/-37)
  • csrc/layernorm_kernels.cu (modified, +1/-1)
  • csrc/layernorm_quant_kernels.cu (modified, +1/-1)
  • csrc/libtorch_stable/cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp (renamed, +20/-16)
  • csrc/libtorch_stable/dispatch_utils.h (added, +69/-0)
  • csrc/libtorch_stable/ops.h (modified, +128/-0)
  • csrc/libtorch_stable/quantization/cutlass_w4a8/get_group_starts.cuh (renamed, +36/-25)
  • csrc/libtorch_stable/quantization/cutlass_w4a8/w4a8_grouped_mm_entry.cu (renamed, +83/-63)
  • csrc/libtorch_stable/quantization/cutlass_w4a8/w4a8_mm_entry.cu (renamed, +60/-58)
  • csrc/libtorch_stable/quantization/cutlass_w4a8/w4a8_utils.cu (renamed, +0/-0)

Code Example

AssertionError: Expected async_scheduling=True for draft_model spec decode, got False.

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

CUDA_VISIBLE_DEVICES=0 pytest tests/v1/e2e/spec_decode/test_spec_decode.py::test_draft_model_correctness -v
RAW_BUFFERClick to expand / collapse

The tests that use assert_draft_model_correctness in tests/v1/e2e/spec_decode/test_spec_decode.py are failing with two types of errors:

Error 1: async_scheduling not enabled

AssertionError: Expected async_scheduling=True for draft_model spec decode, got False.

This assertion is at line 1084-1086 of the test file. The draft_model spec decode expects async scheduling to be auto-enabled, but it's not happening.

Error 2: Engine core initialization failure

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

This appears to be a more serious runtime error during engine initialization for some test cases.

Failing Tests

All tests that call assert_draft_model_correctness:

  • test_draft_model_correctness (line 850)
  • test_draft_model_realistic_example (line 856)
  • test_draft_model_parallel_drafting (line 871)
  • test_draft_model_quantization (line 897)
  • test_draft_model_tensor_parallelism (line 909)

Steps to Reproduce

CUDA_VISIBLE_DEVICES=0 pytest tests/v1/e2e/spec_decode/test_spec_decode.py::test_draft_model_correctness -v

Environment

  • vLLM version: 0.19.1rc1.dev17+ga5a623d96.d20260403.precompiled
  • Installation: Built from source with VLLM_USE_PRECOMPILED=1
  • Platform: Linux with CUDA 13.2

Expected Behavior

Tests should pass with:

  • async_scheduling automatically enabled for draft_model spec decode
  • Engine core initializing successfully

Actual Behavior

Tests fail with async_scheduling=False and engine initialization errors.

Proposed Solution

These tests are being marked in a PR as @pytest.mark.xfail until the underlying issues are resolved: https://github.com/vllm-project/vllm/pull/37916

extent analysis

TL;DR

  • The tests are likely failing due to async_scheduling not being enabled and engine core initialization issues, which can be temporarily worked around by marking the tests as @pytest.mark.xfail until the underlying issues are resolved.

Guidance

  • Review the test file tests/v1/e2e/spec_decode/test_spec_decode.py to understand the conditions under which assert_draft_model_correctness is called and how it expects async_scheduling to be enabled.
  • Investigate the engine core initialization process to identify why it's failing for some test cases, potentially looking into CUDA and version compatibility issues.
  • Consider updating the test configuration or environment variables to enable async_scheduling by default for draft_model spec decode.
  • Verify that the engine core initialization failure is not due to a version mismatch or compatibility issue with CUDA 13.2.

Example

No code snippet is provided as the issue does not imply a specific code change but rather an environmental or configuration issue.

Notes

  • The proposed solution of marking tests as @pytest.mark.xfail is a temporary workaround and does not address the underlying issues.
  • The root cause of the engine core initialization failure and async_scheduling not being enabled needs to be investigated further.

Recommendation

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING