vllm - 💡(How to fix) Fix [CI Failure]: mi355_2: NixlConnector PD + Spec Decode acceptance (2 GPUs) [2 comments, 1 participants]

AndreasKaratzas · 2026-04-30T02:30:34Z

[vllm] Name of failing test command rocm-smi || true && export VLLM TEST GROUP NAME=mi355 2-nixlconnector-pd---spec-decode-acceptance-2-gpus && export VLLM ALL… ### Name of failing test `(command rocm-smi || true) && export VLLM_TEST_GROUP_NAME=mi355_2-nixlconnector-pd---spec-decode-acceptance-2-gpus && export VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 && cd /vllm-workspace/tests && uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt && ROCM_ATTN=1 bash v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh` ### Basic information - [ ] Flaky test - [x] Can reproduce locally - [ ] Caused by external libraries (e.g. bug in `transformers`) ### 🧪 Describe the failing test `AssertionError: All kv cache tensors must have the same number of blocks` ### 📝 History of failing test - Current streak start: 2026-04-28 - First failure in 60d window: 2026-04-21 - Last successful nightly: 2026-04-27 - Break frequency (60d, pass↔fail flips): 3 - Latest nightly date: 2026-04-29 - Latest build(s): [amd-ci #8058](https://buildkite.com/vllm/amd-ci/builds/8058) - Latest hardware status: `mi250_2`=fail

vllm2026-04-30 02:30:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41319•Fetched 2026-05-01 05:34:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

AndreasKaratzas

Participants

AndreasKaratzas

Timeline (top)

added_to_project_v2 ×2commented ×2labeled ×1mentioned ×1

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

RAW_BUFFERClick to expand / collapse

Name of failing test

(command rocm-smi || true) && export VLLM_TEST_GROUP_NAME=mi355_2-nixlconnector-pd---spec-decode-acceptance-2-gpus && export VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 && cd /vllm-workspace/tests && uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt && ROCM_ATTN=1 bash v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

AssertionError: All kv cache tensors must have the same number of blocks

📝 History of failing test

Current streak start: 2026-04-28
First failure in 60d window: 2026-04-21
Last successful nightly: 2026-04-27
Break frequency (60d, pass↔fail flips): 3
Latest nightly date: 2026-04-29
Latest build(s): amd-ci #8058
Latest hardware status: mi250_2=fail

extent analysis

TL;DR

The most likely fix involves ensuring that all kv cache tensors have the same number of blocks, potentially by updating the spec_decode_acceptance_test.sh script or the code that generates these tensors.

Guidance

Verify the test failure by running the spec_decode_acceptance_test.sh script locally and checking the output for any errors or inconsistencies in the kv cache tensors.
Investigate the code that generates the kv cache tensors to ensure they are being created with the same number of blocks.
Check the requirements/kv_connectors_rocm.txt file to see if any dependencies related to tensor creation or caching need to be updated.
Review the test history and hardware status to determine if the issue is specific to certain hardware configurations.

Example

No specific code snippet can be provided without more information about the spec_decode_acceptance_test.sh script or the code that generates the kv cache tensors.

Notes

The fix may depend on the specific implementation of the spec_decode_acceptance_test.sh script and the code that generates the kv cache tensors, which is not provided in the issue.

Recommendation

Apply a workaround by modifying the spec_decode_acceptance_test.sh script to handle kv cache tensors with different numbers of blocks, or update the code that generates these tensors to ensure consistency. This is recommended because the issue seems to be related to a specific test case and hardware configuration, and a workaround may be a quicker solution than a full fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent execution #callback error #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [CI Failure]: mi355_2: NixlConnector PD + Spec Decode acceptance (2 GPUs) [2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: mi355_2: NixlConnector PD + Spec Decode acceptance (2 GPUs) [2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING