vllm - 💡(How to fix) Fix Spec decode with multimodal pruning gives Eagle drafter shifted embeddings but unshifted M-RoPE positions

vllm2026-05-27 23:24:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

I can reproduce a V1 speculative decoding issue where multimodal pruning + M-RoPE cause the Eagle drafter to see M-RoPE positions that do not match the shifted multimodal embeddings gathered for the draft pass.

The suspicious path is:

propose_draft_token_ids() computes target_positions
it calls _gather_mm_embeddings(scheduler_output, shift_computed_tokens=1)
_gather_mm_embeddings() uses req_state.num_computed_tokens + shift_computed_tokens for the multimodal embedding overlap window
but with multimodal pruning + M-RoPE enabled it recomputes/copies M-RoPE positions using unshifted req_state.num_computed_tokens
drafter.propose(..., target_positions=target_positions) then consumes the same target_positions buffer

Runtime instrumentation shows that the multimodal embeddings correspond to the shifted token window, while the positions consumed by the drafter at those same multimodal rows are unshifted. On a later decode step, _gather_mm_embeddings() also mutates the target_positions view before drafter.propose() consumes it.

Error Message

RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0

Root Cause

I used an image input because a Qwen3-VL video prompt with pruning currently hits a separate non-speculative device mismatch before generation:

Fix Action

Fix / Workaround

Local patch experiment

I also tried a minimal local patch that passes the shifted value into recompute_mrope_positions() for the draft gather:

{"label":"spec_prune_patch_shifted","speculative":true,"video_pruning_rate":0.75,"text":"The image features a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,4419,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

Code Example

RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0

---

{"label":"no_spec_prune","speculative":false,"video_pruning_rate":0.75,"text":"The image displays a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,18689,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

---

{"label":"spec_no_prune","speculative":true,"video_pruning_rate":0.0,"text":"The image displays a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,18689,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

---

{"label":"spec_prune_current","speculative":true,"video_pruning_rate":0.75,"text":"The image features a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,4419,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

---

{
  "event": "gather_mm_embeddings_before_recompute",
  "shift_computed_tokens": 1,
  "req_num_computed_tokens": 0,
  "effective_num_computed_tokens_for_embedding_overlap": 1,
  "num_computed_tokens_for_recompute": 0,
  "is_multimodal_pruning_enabled": true,
  "uses_mrope": true,
  "is_mm_embed_true_indices_global_sample": [15,16,17,18,19,20,21,22,23],
  "mrope_values_at_shifted_embedding_indices": [[16,16,16,16,16,16,16,16,16],[16,16,16,17,17,17,18,18,18],[16,17,18,16,17,18,16,17,18]],
  "mrope_values_at_unshifted_indices": [[15,16,16,16,16,16,16,16,16],[15,16,16,16,17,17,17,18,18],[15,16,17,18,16,17,18,16,17]]
}

---

{
  "event": "drafter_first_pass_multimodal_rows",
  "positions_at_mm_indices_sample": [[15,16,16,16,16,16,16,16,16],[15,16,16,16,17,17,17,18,18],[15,16,17,18,16,17,18,16,17]]
}

---

{
  "event": "propose_draft_token_ids",
  "total_num_tokens_passed_to_drafter": 3,
  "target_positions_before_gather": {
    "shape": [3,3],
    "head": [36,37,38,36,37,38,36,37],
    "sum": 333,
    "weighted_sum": 1671
  },
  "target_positions_after_gather": {
    "shape": [3,3],
    "head": [34,35,36,34,35,36,34,35],
    "sum": 315,
    "weighted_sum": 1581
  }
}

---

num_computed_tokens_for_recompute = req_state.num_computed_tokens + shift_computed_tokens

---

{"label":"spec_prune_patch_shifted","speculative":true,"video_pruning_rate":0.75,"text":"The image features a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,4419,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

RAW_BUFFERClick to expand / collapse

Summary

The suspicious path is:

propose_draft_token_ids() computes target_positions
it calls _gather_mm_embeddings(scheduler_output, shift_computed_tokens=1)
_gather_mm_embeddings() uses req_state.num_computed_tokens + shift_computed_tokens for the multimodal embedding overlap window
but with multimodal pruning + M-RoPE enabled it recomputes/copies M-RoPE positions using unshifted req_state.num_computed_tokens
drafter.propose(..., target_positions=target_positions) then consumes the same target_positions buffer

Environment

vLLM main: 5963c194787d30ed4a49c1e2e01010d8dffe1e79
vLLM version: 0.21.1rc1.dev343+g5963c1947.d20260527
GPU: A100-SXM4-40GB
Driver: 580.159.03
PyTorch: 2.11.0+cu130
Target model: Qwen/Qwen3-VL-4B-Instruct
Draft model: AngelSlim/Qwen3-VL-4B-Instruct_eagle3
Spec decode: method=eagle3, num_speculative_tokens=2
dtype=bfloat16, enforce_eager=True, max_model_len=1024, max_num_batched_tokens=1024, max_num_seqs=1
Synthetic image input, video_pruning_rate=0.75

I used an image input because a Qwen3-VL video prompt with pruning currently hits a separate non-speculative device mismatch before generation:

RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0

The image case still reaches:

self.is_multimodal_pruning_enabled == True
self.uses_mrope == True
self.supports_mm_inputs == True
self.drafter.supports_mm_inputs == True
_gather_mm_embeddings(..., shift_computed_tokens=1) from propose_draft_token_ids()

Negative controls / observed output

Greedy no-spec with pruning enabled:

{"label":"no_spec_prune","speculative":false,"video_pruning_rate":0.75,"text":"The image displays a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,18689,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

Spec decode with pruning disabled matches the no-spec output exactly:

{"label":"spec_no_prune","speculative":true,"video_pruning_rate":0.0,"text":"The image displays a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,18689,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

Spec decode with pruning enabled diverges:

{"label":"spec_prune_current","speculative":true,"video_pruning_rate":0.75,"text":"The image features a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,4419,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

Instrumentation evidence

In the first draft proposal, _gather_mm_embeddings(..., shift_computed_tokens=1) gathers multimodal embeddings for the shifted overlap. The shifted multimodal rows are global token indices [15..23], and the recompute call still uses req_state.num_computed_tokens == 0:

{
  "event": "gather_mm_embeddings_before_recompute",
  "shift_computed_tokens": 1,
  "req_num_computed_tokens": 0,
  "effective_num_computed_tokens_for_embedding_overlap": 1,
  "num_computed_tokens_for_recompute": 0,
  "is_multimodal_pruning_enabled": true,
  "uses_mrope": true,
  "is_mm_embed_true_indices_global_sample": [15,16,17,18,19,20,21,22,23],
  "mrope_values_at_shifted_embedding_indices": [[16,16,16,16,16,16,16,16,16],[16,16,16,17,17,17,18,18,18],[16,17,18,16,17,18,16,17,18]],
  "mrope_values_at_unshifted_indices": [[15,16,16,16,16,16,16,16,16],[15,16,16,16,17,17,17,18,18],[15,16,17,18,16,17,18,16,17]]
}

The drafter then receives the unshifted M-RoPE values at those multimodal rows:

{
  "event": "drafter_first_pass_multimodal_rows",
  "positions_at_mm_indices_sample": [[15,16,16,16,16,16,16,16,16],[15,16,16,16,17,17,17,18,18],[15,16,17,18,16,17,18,16,17]]
}

On a later decode step, _gather_mm_embeddings() mutates target_positions before the drafter consumes it:

{
  "event": "propose_draft_token_ids",
  "total_num_tokens_passed_to_drafter": 3,
  "target_positions_before_gather": {
    "shape": [3,3],
    "head": [36,37,38,36,37,38,36,37],
    "sum": 333,
    "weighted_sum": 1671
  },
  "target_positions_after_gather": {
    "shape": [3,3],
    "head": [34,35,36,34,35,36,34,35],
    "sum": 315,
    "weighted_sum": 1581
  }
}

Local patch experiment

I also tried a minimal local patch that passes the shifted value into recompute_mrope_positions() for the draft gather:

num_computed_tokens_for_recompute = req_state.num_computed_tokens + shift_computed_tokens

That changed the instrumented recompute argument from 0 to 1, but it did not restore the no-spec output and did not remove the later checksum mutation:

{"label":"spec_prune_patch_shifted","speculative":true,"video_pruning_rate":0.75,"text":"The image features a gradient transitioning from bright yellow in the top-left to deep purple in the top-right, with a green","token_ids":[785,2168,4419,264,20169,72094,504,9906,13753,304,279,1909,7950,311,5538,24932,304,279,1909,6701,11,448,264,6176]}

So the fix may need a separate draft-position buffer, or otherwise avoid mutating the target-position buffer before drafter.propose() consumes it.

Expected behavior

Speculative greedy generation should match the non-speculative greedy result, and multimodal embeddings gathered for a shifted draft window should be paired with M-RoPE positions for the same shifted rows. The draft multimodal gather should also not overwrite the target-position buffer that is passed to the drafter.

Prior search

I searched open issues and PRs in this repo for the following exact topics and did not find an open match:

recompute_mrope_positions speculative
multimodal pruning spec decode
shift_computed_tokens mrope
EAGLE M-RoPE pruning
Qwen3-VL speculative multimodal pruning

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix Spec decode with multimodal pruning gives Eagle drafter shifted embeddings but unshifted M-RoPE positions

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Local patch experiment

Code Example

Summary

Environment

Negative controls / observed output

Instrumentation evidence

Local patch experiment

Expected behavior

Prior search

FAQ

Expected behavior

Still need to ship something?

TRENDING