vllm - ✅(Solved) Fix [CI Failure]: Language Models Tests (Hybrid) 1 - granite-4.0-tiny-preview prefix caching regression [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#43090Fetched 2026-05-20 03:39:55
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2commented ×1

Error Message

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test1:

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #42766: [Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim

Description (problem / solution / changelog)

With MRv2 being on by default for dense Qwen models (ie Qwen3-0.6B), we found that there's some discrepancy in how KV cache tensor are exposed to connectors through the register_kv_caches API when akernel and logical block_size "don't match" eg:

num_blocks, 2, 128, ...      <==MRV2
num_blocks*2, 2, 64, ...    <==MRV1

assuming block_size=128 and kernel_block_size=64 on FI backend

see https://buildkite.com/vllm/ci/builds/66356/canvas?jid=019e2a39-ca79-4903-8a53-15733744bade&tab=output

This PR merely re-applies the old logic we used in MRv1 for viewing tensors here to ensure this is exposed consistently across all connectors. I am not fully up to speed wrt why MRv2 is exposing logical shape only, so I welcome any comment with more context on it to come up with a fix that is more aligned with MRv2 design.

Test with

 FLASHINFER=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

or a similar PD example using eg block_size=128 with FI.

Resolves https://github.com/vllm-project/vllm/issues/42846

Changed files

  • tests/test_config.py (modified, +0/-7)
  • vllm/config/vllm.py (modified, +0/-4)
  • vllm/v1/worker/gpu/attn_utils.py (modified, +49/-12)
  • vllm/v1/worker/gpu/block_table.py (modified, +12/-2)
  • vllm/v1/worker/gpu/model_runner.py (modified, +6/-4)
  • vllm/v1/worker/gpu/spec_decode/eagle/speculator.py (modified, +1/-1)

Code Example

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test1:

---

vllm_no_cache:       'china' rank=1, logprob=-3.011
vllm_partial_cache:  'china' rank=2, logprob=-3.119  ('\n' rank=1 at same logprob)
RAW_BUFFERClick to expand / collapse

Name of failing test

tests/models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

url

The test compares vllm outputs between cached and uncached runs for the hybrid (Mamba/SSM) model ibm-granite/granite-4.0-tiny-preview. The assertion output_id_0 in logprobs_elem_1 fails — the top token from one run isn't in the top-k logprobs of the other run when prefix caching is involved.

The outputs diverge between cached and uncached paths. For example in Test1, vllm_no_cache produces "china" (rank 1) while vllm_partial_cache has "china" drop to rank 2 with "\n" tied at the same logprob. This is a prefix caching correctness issue for hybrid/SSM models.

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test1:

Example divergence:

vllm_no_cache:       'china' rank=1, logprob=-3.011
vllm_partial_cache:  'china' rank=2, logprob=-3.119  ('\n' rank=1 at same logprob)

📝 History of failing test

Bisection:

  • Last passing build: #66633 (May 18 nightly, commit 23c15acd)
  • Last passing build: #66759 (May 18 daily, commit cd49a05d)
  • First failing build: #66835 (May 19 nightly, commit 9fd8487d)

The 18 commits between cd49a05d and 9fd8487d were checked for relevance. The most likely root cause is PR #42766 which changes how KV cache block tables are initialized:

  • Introduces kernel_block_sizes that can differ from logical block_sizes
  • Changes block table sizing: max_num_blocks * blocks_per_kv_block
  • Changes block_sizes_tensor to use kernel_block_sizes instead of block_sizes
  • Changes append_block_ids to expand blocks when blocks_per_kv_block > 1

For hybrid/SSM models, the Mamba cache group goes through a different path in prepare_kernel_block_sizes, which may produce a different block size than before, changing the KV cache layout seen during prefix caching and causing cached vs uncached outputs to diverge.

CC List.

@aorwall

Tagging since it seems related to https://github.com/vllm-project/vllm/pull/42766.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING