vllm - 💡(How to fix) Fix [Bug][V1][Hybrid] IndexError in get_temporal_copy_spec during DFlash speculative decoding + prefix caching [1 participants]

vllm2026-05-07 01:37:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41884•Fetched 2026-05-07 03:32:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

redhelix

Participants

redhelix

Error Message

File "vllm/v1/worker/gpu_model_runner.py", line 3950, in execute_model
    mamba_utils.preprocess_mamba(...)
File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
File "vllm/v1/worker/mamba_utils.py", line 124, in collect_mamba_copy_meta
    copy_spec = state_copy_func(
                ^^^^^^^^^^^^^^^^
File "vllm/model_executor/layers/mamba/mamba_utils.py", line 341, in get_temporal_copy_spec
    src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1]
                   ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Followed by:

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
INFO: Shutting down

Root Cause

get_temporal_copy_spec computes src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1] with no bounds check. When num_accepted_tokens is large enough relative to the remaining allocated blocks, this index exceeds len(block_ids).

This is the corner case acknowledged in:

PR #30877: "Speculative decoding is temporarily disabled in this PR as there are still corner-case bugs when using with prefix-caching in align mode."
PR #33726: explicitly lists "Prefix Caching (old style, have not tested 'align' mode)" as a known crash source with mamba spec decode.

Fix Action

Workaround

Either:

Disable DFlash speculative decoding (removes the Mamba state copy path entirely)
Pass --no-enable-prefix-caching (avoids the align-mode block copy logic)

Code Example

File "vllm/v1/worker/gpu_model_runner.py", line 3950, in execute_model
    mamba_utils.preprocess_mamba(...)
File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
File "vllm/v1/worker/mamba_utils.py", line 124, in collect_mamba_copy_meta
    copy_spec = state_copy_func(
                ^^^^^^^^^^^^^^^^
File "vllm/model_executor/layers/mamba/mamba_utils.py", line 341, in get_temporal_copy_spec
    src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1]
                   ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

---

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
INFO: Shutting down

RAW_BUFFERClick to expand / collapse

Environment

vLLM version: 0.1.dev1+gbfde49e28.d20260418 (image ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2)
Hardware: NVIDIA GB10 (DGX SPARK, 128GB UMA)
Model: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm (compressed-tensors / PrismaQuant 4.75-bit)
Speculative decoding: DFlash, 15 draft tokens, draft model z-lab/Qwen3.6-35B-A3B-DFlash
Prefix caching: enabled (default)

Bug Description

EngineCore crashes mid-inference on text-only requests with an IndexError in get_temporal_copy_spec. APIServer then shuts down cleanly. The crash is non-deterministic — triggered by a specific acceptance pattern during DFlash decode + Mamba SSM state copy.

Stack Trace

File "vllm/v1/worker/gpu_model_runner.py", line 3950, in execute_model
    mamba_utils.preprocess_mamba(...)
File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
File "vllm/v1/worker/mamba_utils.py", line 124, in collect_mamba_copy_meta
    copy_spec = state_copy_func(
                ^^^^^^^^^^^^^^^^
File "vllm/model_executor/layers/mamba/mamba_utils.py", line 341, in get_temporal_copy_spec
    src_block_id = block_ids[cur_block_idx + num_accepted_tokens - 1]
                   ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Followed by:

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
INFO: Shutting down

Root Cause

This is the corner case acknowledged in:

PR #30877: "Speculative decoding is temporarily disabled in this PR as there are still corner-case bugs when using with prefix-caching in align mode."
PR #33726: explicitly lists "Prefix Caching (old style, have not tested 'align' mode)" as a known crash source with mamba spec decode.

Reproduction

Difficult to reproduce deterministically — occurs under concurrent load with varying sequence lengths. Prefix cache hit rate was 68.6% at time of crash, suggesting a warmed cache was involved.

Workaround

Either:

Disable DFlash speculative decoding (removes the Mamba state copy path entirely)
Pass --no-enable-prefix-caching (avoids the align-mode block copy logic)

Expected Behavior

Bounds check before indexing block_ids, or a graceful fallback when cur_block_idx + num_accepted_tokens - 1 >= len(block_ids).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug][V1][Hybrid] IndexError in get_temporal_copy_spec during DFlash speculative decoding + prefix caching [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Environment

Bug Description

Stack Trace

Root Cause

Reproduction

Workaround

Expected Behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug][V1][Hybrid] IndexError in get_temporal_copy_spec during DFlash speculative decoding + prefix caching [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Environment

Bug Description

Stack Trace

Root Cause

Reproduction

Workaround

Expected Behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING