vllm - 💡(How to fix) Fix [Bug][V1][Hybrid] IndexError in get_temporal_copy_spec during DFlash speculative decoding + prefix caching [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41884Fetched 2026-05-07 03:32:12
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

File "vllm/v1/worker/gpu_model_runner.py", line 3950, in execute_model
    mamba_utils.preprocess_mamba(...)
File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
File "vllm/v1/worker/mamba_utils.py", line 124, in collect_mamba_copy_meta
    copy_spec = state_copy_func(
                ^^^^^^^^^^^^^^^^
File "vllm/model_executor/layers/mamba/mamba_utils.py", line 341, in get_temporal_copy_spec
    src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1]
                   ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Followed by:

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
INFO: Shutting down

Root Cause

get_temporal_copy_spec computes src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1] with no bounds check. When num_accepted_tokens is large enough relative to the remaining allocated blocks, this index exceeds len(block_ids).

This is the corner case acknowledged in:

  • PR #30877: "Speculative decoding is temporarily disabled in this PR as there are still corner-case bugs when using with prefix-caching in align mode."
  • PR #33726: explicitly lists "Prefix Caching (old style, have not tested 'align' mode)" as a known crash source with mamba spec decode.

Fix Action

Workaround

Either:

  • Disable DFlash speculative decoding (removes the Mamba state copy path entirely)
  • Pass --no-enable-prefix-caching (avoids the align-mode block copy logic)

Code Example

File "vllm/v1/worker/gpu_model_runner.py", line 3950, in execute_model
    mamba_utils.preprocess_mamba(...)
File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
File "vllm/v1/worker/mamba_utils.py", line 124, in collect_mamba_copy_meta
    copy_spec = state_copy_func(
                ^^^^^^^^^^^^^^^^
File "vllm/model_executor/layers/mamba/mamba_utils.py", line 341, in get_temporal_copy_spec
    src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1]
                   ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

---

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
INFO: Shutting down
RAW_BUFFERClick to expand / collapse

Environment

  • vLLM version: 0.1.dev1+gbfde49e28.d20260418 (image ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2)
  • Hardware: NVIDIA GB10 (DGX SPARK, 128GB UMA)
  • Model: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm (compressed-tensors / PrismaQuant 4.75-bit)
  • Speculative decoding: DFlash, 15 draft tokens, draft model z-lab/Qwen3.6-35B-A3B-DFlash
  • Prefix caching: enabled (default)

Bug Description

EngineCore crashes mid-inference on text-only requests with an IndexError in get_temporal_copy_spec. APIServer then shuts down cleanly. The crash is non-deterministic — triggered by a specific acceptance pattern during DFlash decode + Mamba SSM state copy.

Stack Trace

File "vllm/v1/worker/gpu_model_runner.py", line 3950, in execute_model
    mamba_utils.preprocess_mamba(...)
File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
File "vllm/v1/worker/mamba_utils.py", line 124, in collect_mamba_copy_meta
    copy_spec = state_copy_func(
                ^^^^^^^^^^^^^^^^
File "vllm/model_executor/layers/mamba/mamba_utils.py", line 341, in get_temporal_copy_spec
    src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1]
                   ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Followed by:

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
INFO: Shutting down

Root Cause

get_temporal_copy_spec computes src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1] with no bounds check. When num_accepted_tokens is large enough relative to the remaining allocated blocks, this index exceeds len(block_ids).

This is the corner case acknowledged in:

  • PR #30877: "Speculative decoding is temporarily disabled in this PR as there are still corner-case bugs when using with prefix-caching in align mode."
  • PR #33726: explicitly lists "Prefix Caching (old style, have not tested 'align' mode)" as a known crash source with mamba spec decode.

Reproduction

Difficult to reproduce deterministically — occurs under concurrent load with varying sequence lengths. Prefix cache hit rate was 68.6% at time of crash, suggesting a warmed cache was involved.

Workaround

Either:

  • Disable DFlash speculative decoding (removes the Mamba state copy path entirely)
  • Pass --no-enable-prefix-caching (avoids the align-mode block copy logic)

Expected Behavior

Bounds check before indexing block_ids, or a graceful fallback when cur_block_idx + num_accepted_tokens - 1 >= len(block_ids).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug][V1][Hybrid] IndexError in get_temporal_copy_spec during DFlash speculative decoding + prefix caching [1 participants]