vllm - ✅(Solved) Fix Stale padded request metadata can misclassify Mamba CUDA graph rows [1 pull requests, 1 comments, 2 participants]

vllm2026-05-06 16:45:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41841•Fetched 2026-05-07 03:32:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tianshu-Michael-yu

Participants

liulanze

tianshu-Michael-yu

Timeline (top)

commented ×1cross-referenced ×1mentioned ×1subscribed ×1

Full CUDA graph metadata can include stale CPU request-state values for padded rows. For Mamba backends this can make padding rows look like prefills, so a decode-only CUDA graph replay can be classified as mixed decode/prefill metadata.

The relevant path is:

GPUModelRunner._build_attention_metadata slices num_computed_tokens_cpu_tensor[:num_reqs_padded] and num_prompt_tokens_cpu_tensor[:num_reqs_padded].
Other padded metadata is explicitly neutralized, e.g. padded block table entries are filled with NULL_BLOCK_ID and padded sequence lengths are zeroed.
The request-state tensors are not neutralized beyond num_reqs.
is_prefilling = num_computed_tokens_cpu < num_prompt_tokens_cpu can therefore be true for inactive padded rows if those rows contain stale values from prior requests.
Mamba calls split_decodes_and_prefills(..., treat_short_extends_as_decodes=False), which ORs is_prefilling into the split. A zero-token padded row can become a fake prefill.

Root Cause

This only affects padded CUDA graph metadata rows. It is most likely to surface in workloads that combine:

Mamba or hybrid Mamba models
full CUDA graph replay with padded request rows
prefix caching / Mamba align mode or any path sensitive to Mamba decode-vs-prefill metadata
long decode batches where request rows churn and stale inactive-row state is more likely to matter

Fix Action

Fixed

Fixed by PR: [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (https://github.com/vllm-project/vllm/pull/41873)

PR fix notes

PR #41873: [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba

Repository: vllm-project/vllm
Author: liulanze
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41873

Description (problem / solution / changelog)

Purpose

Fix stale is_prefilling metadata in padded CUDA graph rows that misclassifies Mamba decode requests as prefills.

After condense() compacts active requests, vacated slots retain stale num_computed_tokens and num_prompt_tokens values. When the batch is padded for CUDA graphs, the comparison num_computed < num_prompt can yield True for padding rows. Mamba is the only backend that passes treat_short_extends_as_decodes=False to split_decodes_and_prefills, so these stale True values get OR-ed into is_prefill and shift the decode/prefill boundary, misclassifying real decode rows as prefills.

Fix: zero is_prefilling[num_reqs:] immediately after computation.

Resolves #41841

Test Plan

pytest tests/v1/attention/test_attention_splitting.py -xvs

Test Result

23 passed, 20 warnings in 5.49s

test_split_decodes_stale_padded_is_prefilling — verifies correct splitting after zeroing padded rows
test_split_decodes_stale_padded_is_prefilling_without_fix — confirms stale data shifts the boundary without the fix

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/v1/attention/test_attention_splitting.py (modified, +66/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +3/-0)

Code Example

import torch

from vllm.v1.attention.backend import CommonAttentionMetadata
from vllm.v1.attention.backends.utils import split_decodes_and_prefills

num_reqs = 2
num_reqs_padded = 4

# Two real decode rows, then two padded rows with zero query length.
query_start_loc = torch.tensor([0, 1, 2, 2, 2], dtype=torch.int32)
seq_lens = torch.tensor([8, 9, 0, 0], dtype=torch.int32)

# Real rows are decode-only, but padded rows contain stale state from older requests.
num_computed_tokens = torch.tensor([7, 8, 1, 2], dtype=torch.int32)
num_prompt_tokens = torch.tensor([4, 4, 3, 4], dtype=torch.int32)

is_prefilling = num_computed_tokens[:num_reqs_padded] < num_prompt_tokens[:num_reqs_padded]
metadata = CommonAttentionMetadata(
    query_start_loc=query_start_loc,
    query_start_loc_cpu=query_start_loc,
    seq_lens=seq_lens,
    num_reqs=num_reqs_padded,
    num_actual_tokens=2,
    max_query_len=1,
    max_seq_len=9,
    block_table_tensor=torch.zeros((num_reqs_padded, 1), dtype=torch.int32),
    slot_mapping=torch.arange(2, dtype=torch.int64),
    is_prefilling=is_prefilling,
)

print(split_decodes_and_prefills(metadata, treat_short_extends_as_decodes=False))

---

(2, 2, 2, 0)

---

(4, 0, 2, 0)

---

fixed_num_computed = num_computed_tokens[:num_reqs_padded].clone()
fixed_num_prompt = num_prompt_tokens[:num_reqs_padded].clone()
fixed_num_computed[num_reqs:num_reqs_padded] = 0
fixed_num_prompt[num_reqs:num_reqs_padded] = 0
fixed_metadata = metadata.replace(
    is_prefilling=fixed_num_computed < fixed_num_prompt,
)
print(split_decodes_and_prefills(fixed_metadata, treat_short_extends_as_decodes=False))
# (4, 0, 2, 0)

RAW_BUFFERClick to expand / collapse

Summary

The relevant path is:

GPUModelRunner._build_attention_metadata slices num_computed_tokens_cpu_tensor[:num_reqs_padded] and num_prompt_tokens_cpu_tensor[:num_reqs_padded].
Other padded metadata is explicitly neutralized, e.g. padded block table entries are filled with NULL_BLOCK_ID and padded sequence lengths are zeroed.
The request-state tensors are not neutralized beyond num_reqs.
is_prefilling = num_computed_tokens_cpu < num_prompt_tokens_cpu can therefore be true for inactive padded rows if those rows contain stale values from prior requests.
Mamba calls split_decodes_and_prefills(..., treat_short_extends_as_decodes=False), which ORs is_prefilling into the split. A zero-token padded row can become a fake prefill.

Minimal Repro

This repro is synthetic and does not require a model or dataset. It demonstrates the metadata classification issue directly.

import torch

from vllm.v1.attention.backend import CommonAttentionMetadata
from vllm.v1.attention.backends.utils import split_decodes_and_prefills

num_reqs = 2
num_reqs_padded = 4

# Two real decode rows, then two padded rows with zero query length.
query_start_loc = torch.tensor([0, 1, 2, 2, 2], dtype=torch.int32)
seq_lens = torch.tensor([8, 9, 0, 0], dtype=torch.int32)

# Real rows are decode-only, but padded rows contain stale state from older requests.
num_computed_tokens = torch.tensor([7, 8, 1, 2], dtype=torch.int32)
num_prompt_tokens = torch.tensor([4, 4, 3, 4], dtype=torch.int32)

is_prefilling = num_computed_tokens[:num_reqs_padded] < num_prompt_tokens[:num_reqs_padded]
metadata = CommonAttentionMetadata(
    query_start_loc=query_start_loc,
    query_start_loc_cpu=query_start_loc,
    seq_lens=seq_lens,
    num_reqs=num_reqs_padded,
    num_actual_tokens=2,
    max_query_len=1,
    max_seq_len=9,
    block_table_tensor=torch.zeros((num_reqs_padded, 1), dtype=torch.int32),
    slot_mapping=torch.arange(2, dtype=torch.int64),
    is_prefilling=is_prefilling,
)

print(split_decodes_and_prefills(metadata, treat_short_extends_as_decodes=False))

Current output:

(2, 2, 2, 0)

Expected output for a padded decode-only full CUDA graph batch:

(4, 0, 2, 0)

If the padded CPU request-state rows are cleared before computing is_prefilling, the split returns the expected decode-only result:

fixed_num_computed = num_computed_tokens[:num_reqs_padded].clone()
fixed_num_prompt = num_prompt_tokens[:num_reqs_padded].clone()
fixed_num_computed[num_reqs:num_reqs_padded] = 0
fixed_num_prompt[num_reqs:num_reqs_padded] = 0
fixed_metadata = metadata.replace(
    is_prefilling=fixed_num_computed < fixed_num_prompt,
)
print(split_decodes_and_prefills(fixed_metadata, treat_short_extends_as_decodes=False))
# (4, 0, 2, 0)

Why this matters

This only affects padded CUDA graph metadata rows. It is most likely to surface in workloads that combine:

Mamba or hybrid Mamba models
full CUDA graph replay with padded request rows
prefix caching / Mamba align mode or any path sensitive to Mamba decode-vs-prefill metadata
long decode batches where request rows churn and stale inactive-row state is more likely to matter

Possible Fix

When building common attention metadata with num_reqs_padded > num_reqs, clone and zero num_computed_tokens_cpu and num_prompt_tokens_cpu for padded rows before computing is_prefilling.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#output truncation #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix Stale padded request metadata can misclassify Mamba CUDA graph rows [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41873: [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Summary

Minimal Repro

Why this matters

Possible Fix

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix Stale padded request metadata can misclassify Mamba CUDA graph rows [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41873: [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Summary

Minimal Repro

Why this matters

Possible Fix

Still need to ship something?

RELATED_DISCOVERY

TRENDING