vllm - ✅(Solved) Fix [CI Failure]: Language Models Tests (Hybrid) 1 - granite-4.0-tiny-preview prefix caching regression [1 pull requests, 1 comments, 2 participants]

vllm2026-05-19 11:30:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#43090•Fetched 2026-05-20 03:39:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

elvircrn

Participants

elvircrn

njhill

Timeline (top)

mentioned ×2subscribed ×2commented ×1

Error Message

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test1:

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #42766: [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim

Repository: vllm-project/vllm
Author: NickLucche
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/42766

Description (problem / solution / changelog)

With MRv2 being on by default for dense Qwen models (ie Qwen3-0.6B), we found that there's some discrepancy in how KV cache tensor are exposed to connectors through the register_kv_caches API when akernel and logical block_size "don't match" eg:

num_blocks, 2, 128, ...      <==MRV2
num_blocks*2, 2, 64, ...    <==MRV1

assuming block_size=128 and kernel_block_size=64 on FI backend

see https://buildkite.com/vllm/ci/builds/66356/canvas?jid=019e2a39-ca79-4903-8a53-15733744bade&tab=output

This PR merely re-applies the old logic we used in MRv1 for viewing tensors here to ensure this is exposed consistently across all connectors. I am not fully up to speed wrt why MRv2 is exposing logical shape only, so I welcome any comment with more context on it to come up with a fix that is more aligned with MRv2 design.

Test with

 FLASHINFER=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

or a similar PD example using eg block_size=128 with FI.

Resolves https://github.com/vllm-project/vllm/issues/42846

Changed files

tests/test_config.py (modified, +0/-7)
vllm/config/vllm.py (modified, +0/-4)
vllm/v1/worker/gpu/attn_utils.py (modified, +49/-12)
vllm/v1/worker/gpu/block_table.py (modified, +12/-2)
vllm/v1/worker/gpu/model_runner.py (modified, +6/-4)
vllm/v1/worker/gpu/spec_decode/eagle/speculator.py (modified, +1/-1)

Code Example

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test1:

---

vllm_no_cache:       'china' rank=1, logprob=-3.011
vllm_partial_cache:  'china' rank=2, logprob=-3.119  ('\n' rank=1 at same logprob)

RAW_BUFFERClick to expand / collapse

Name of failing test

tests/models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview]

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

url

The test compares vllm outputs between cached and uncached runs for the hybrid (Mamba/SSM) model ibm-granite/granite-4.0-tiny-preview. The assertion output_id_0 in logprobs_elem_1 fails — the top token from one run isn't in the top-k logprobs of the other run when prefix caching is involved.

The outputs diverge between cached and uncached paths. For example in Test1, vllm_no_cache produces "china" (rank 1) while vllm_partial_cache has "china" drop to rank 2 with "\n" tied at the same logprob. This is a prefix caching correctness issue for hybrid/SSM models.

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test1:

Example divergence:

vllm_no_cache:       'china' rank=1, logprob=-3.011
vllm_partial_cache:  'china' rank=2, logprob=-3.119  ('\n' rank=1 at same logprob)

📝 History of failing test

Bisection:

Last passing build: #66633 (May 18 nightly, commit 23c15acd)
Last passing build: #66759 (May 18 daily, commit cd49a05d)
First failing build: #66835 (May 19 nightly, commit 9fd8487d)

The 18 commits between cd49a05d and 9fd8487d were checked for relevance. The most likely root cause is PR #42766 which changes how KV cache block tables are initialized:

Introduces kernel_block_sizes that can differ from logical block_sizes
Changes block table sizing: max_num_blocks * blocks_per_kv_block
Changes block_sizes_tensor to use kernel_block_sizes instead of block_sizes
Changes append_block_ids to expand blocks when blocks_per_kv_block > 1

For hybrid/SSM models, the Mamba cache group goes through a different path in prepare_kernel_block_sizes, which may produce a different block size than before, changing the KV cache layout seen during prefix caching and causing cached vs uncached outputs to diverge.

CC List.

@aorwall

Tagging since it seems related to https://github.com/vllm-project/vllm/pull/42766.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#authentication issue #prompt issue #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [CI Failure]: Language Models Tests (Hybrid) 1 - granite-4.0-tiny-preview prefix caching regression [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #42766: [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim

Description (problem / solution / changelog)

Test with

Changed files

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [CI Failure]: Language Models Tests (Hybrid) 1 - granite-4.0-tiny-preview prefix caching regression [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #42766: [Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim

Description (problem / solution / changelog)

Test with

Changed files

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #42766: [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim