vllm - ✅(Solved) Fix Stale padded request metadata can misclassify Mamba CUDA graph rows [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41841Fetched 2026-05-07 03:32:34
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1mentioned ×1subscribed ×1

Full CUDA graph metadata can include stale CPU request-state values for padded rows. For Mamba backends this can make padding rows look like prefills, so a decode-only CUDA graph replay can be classified as mixed decode/prefill metadata.

The relevant path is:

  • GPUModelRunner._build_attention_metadata slices num_computed_tokens_cpu_tensor[:num_reqs_padded] and num_prompt_tokens_cpu_tensor[:num_reqs_padded].
  • Other padded metadata is explicitly neutralized, e.g. padded block table entries are filled with NULL_BLOCK_ID and padded sequence lengths are zeroed.
  • The request-state tensors are not neutralized beyond num_reqs.
  • is_prefilling = num_computed_tokens_cpu < num_prompt_tokens_cpu can therefore be true for inactive padded rows if those rows contain stale values from prior requests.
  • Mamba calls split_decodes_and_prefills(..., treat_short_extends_as_decodes=False), which ORs is_prefilling into the split. A zero-token padded row can become a fake prefill.

Root Cause

This only affects padded CUDA graph metadata rows. It is most likely to surface in workloads that combine:

  • Mamba or hybrid Mamba models
  • full CUDA graph replay with padded request rows
  • prefix caching / Mamba align mode or any path sensitive to Mamba decode-vs-prefill metadata
  • long decode batches where request rows churn and stale inactive-row state is more likely to matter

Fix Action

Fixed

PR fix notes

PR #41873: [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba

Description (problem / solution / changelog)

Purpose

Fix stale is_prefilling metadata in padded CUDA graph rows that misclassifies Mamba decode requests as prefills.

After condense() compacts active requests, vacated slots retain stale num_computed_tokens and num_prompt_tokens values. When the batch is padded for CUDA graphs, the comparison num_computed < num_prompt can yield True for padding rows. Mamba is the only backend that passes treat_short_extends_as_decodes=False to split_decodes_and_prefills, so these stale True values get OR-ed into is_prefill and shift the decode/prefill boundary, misclassifying real decode rows as prefills.

Fix: zero is_prefilling[num_reqs:] immediately after computation.

Resolves #41841

Test Plan

pytest tests/v1/attention/test_attention_splitting.py -xvs

Test Result

23 passed, 20 warnings in 5.49s
  • test_split_decodes_stale_padded_is_prefilling — verifies correct splitting after zeroing padded rows
  • test_split_decodes_stale_padded_is_prefilling_without_fix — confirms stale data shifts the boundary without the fix

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/v1/attention/test_attention_splitting.py (modified, +66/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +3/-0)

Code Example

import torch

from vllm.v1.attention.backend import CommonAttentionMetadata
from vllm.v1.attention.backends.utils import split_decodes_and_prefills

num_reqs = 2
num_reqs_padded = 4

# Two real decode rows, then two padded rows with zero query length.
query_start_loc = torch.tensor([0, 1, 2, 2, 2], dtype=torch.int32)
seq_lens = torch.tensor([8, 9, 0, 0], dtype=torch.int32)

# Real rows are decode-only, but padded rows contain stale state from older requests.
num_computed_tokens = torch.tensor([7, 8, 1, 2], dtype=torch.int32)
num_prompt_tokens = torch.tensor([4, 4, 3, 4], dtype=torch.int32)

is_prefilling = num_computed_tokens[:num_reqs_padded] < num_prompt_tokens[:num_reqs_padded]
metadata = CommonAttentionMetadata(
    query_start_loc=query_start_loc,
    query_start_loc_cpu=query_start_loc,
    seq_lens=seq_lens,
    num_reqs=num_reqs_padded,
    num_actual_tokens=2,
    max_query_len=1,
    max_seq_len=9,
    block_table_tensor=torch.zeros((num_reqs_padded, 1), dtype=torch.int32),
    slot_mapping=torch.arange(2, dtype=torch.int64),
    is_prefilling=is_prefilling,
)

print(split_decodes_and_prefills(metadata, treat_short_extends_as_decodes=False))

---

(2, 2, 2, 0)

---

(4, 0, 2, 0)

---

fixed_num_computed = num_computed_tokens[:num_reqs_padded].clone()
fixed_num_prompt = num_prompt_tokens[:num_reqs_padded].clone()
fixed_num_computed[num_reqs:num_reqs_padded] = 0
fixed_num_prompt[num_reqs:num_reqs_padded] = 0
fixed_metadata = metadata.replace(
    is_prefilling=fixed_num_computed < fixed_num_prompt,
)
print(split_decodes_and_prefills(fixed_metadata, treat_short_extends_as_decodes=False))
# (4, 0, 2, 0)
RAW_BUFFERClick to expand / collapse

Summary

Full CUDA graph metadata can include stale CPU request-state values for padded rows. For Mamba backends this can make padding rows look like prefills, so a decode-only CUDA graph replay can be classified as mixed decode/prefill metadata.

The relevant path is:

  • GPUModelRunner._build_attention_metadata slices num_computed_tokens_cpu_tensor[:num_reqs_padded] and num_prompt_tokens_cpu_tensor[:num_reqs_padded].
  • Other padded metadata is explicitly neutralized, e.g. padded block table entries are filled with NULL_BLOCK_ID and padded sequence lengths are zeroed.
  • The request-state tensors are not neutralized beyond num_reqs.
  • is_prefilling = num_computed_tokens_cpu < num_prompt_tokens_cpu can therefore be true for inactive padded rows if those rows contain stale values from prior requests.
  • Mamba calls split_decodes_and_prefills(..., treat_short_extends_as_decodes=False), which ORs is_prefilling into the split. A zero-token padded row can become a fake prefill.

Minimal Repro

This repro is synthetic and does not require a model or dataset. It demonstrates the metadata classification issue directly.

import torch

from vllm.v1.attention.backend import CommonAttentionMetadata
from vllm.v1.attention.backends.utils import split_decodes_and_prefills

num_reqs = 2
num_reqs_padded = 4

# Two real decode rows, then two padded rows with zero query length.
query_start_loc = torch.tensor([0, 1, 2, 2, 2], dtype=torch.int32)
seq_lens = torch.tensor([8, 9, 0, 0], dtype=torch.int32)

# Real rows are decode-only, but padded rows contain stale state from older requests.
num_computed_tokens = torch.tensor([7, 8, 1, 2], dtype=torch.int32)
num_prompt_tokens = torch.tensor([4, 4, 3, 4], dtype=torch.int32)

is_prefilling = num_computed_tokens[:num_reqs_padded] < num_prompt_tokens[:num_reqs_padded]
metadata = CommonAttentionMetadata(
    query_start_loc=query_start_loc,
    query_start_loc_cpu=query_start_loc,
    seq_lens=seq_lens,
    num_reqs=num_reqs_padded,
    num_actual_tokens=2,
    max_query_len=1,
    max_seq_len=9,
    block_table_tensor=torch.zeros((num_reqs_padded, 1), dtype=torch.int32),
    slot_mapping=torch.arange(2, dtype=torch.int64),
    is_prefilling=is_prefilling,
)

print(split_decodes_and_prefills(metadata, treat_short_extends_as_decodes=False))

Current output:

(2, 2, 2, 0)

Expected output for a padded decode-only full CUDA graph batch:

(4, 0, 2, 0)

If the padded CPU request-state rows are cleared before computing is_prefilling, the split returns the expected decode-only result:

fixed_num_computed = num_computed_tokens[:num_reqs_padded].clone()
fixed_num_prompt = num_prompt_tokens[:num_reqs_padded].clone()
fixed_num_computed[num_reqs:num_reqs_padded] = 0
fixed_num_prompt[num_reqs:num_reqs_padded] = 0
fixed_metadata = metadata.replace(
    is_prefilling=fixed_num_computed < fixed_num_prompt,
)
print(split_decodes_and_prefills(fixed_metadata, treat_short_extends_as_decodes=False))
# (4, 0, 2, 0)

Why this matters

This only affects padded CUDA graph metadata rows. It is most likely to surface in workloads that combine:

  • Mamba or hybrid Mamba models
  • full CUDA graph replay with padded request rows
  • prefix caching / Mamba align mode or any path sensitive to Mamba decode-vs-prefill metadata
  • long decode batches where request rows churn and stale inactive-row state is more likely to matter

Possible Fix

When building common attention metadata with num_reqs_padded > num_reqs, clone and zero num_computed_tokens_cpu and num_prompt_tokens_cpu for padded rows before computing is_prefilling.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix Stale padded request metadata can misclassify Mamba CUDA graph rows [1 pull requests, 1 comments, 2 participants]