vllm - 💡(How to fix) Fix [CI Failure]: mi300_1: V1 Core + KV + Metrics [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41323Fetched 2026-05-01 05:34:14
View on GitHub
Comments
2
Participants
1
Timeline
8
Reactions
0
Participants
Timeline (top)
added_to_project_v2 ×2commented ×2labeled ×1mentioned ×1

Error Message

ERROR 04-29 07:29:19 [worker.py:2112] Traceback (most recent call last): ERROR 04-29 07:29:19 [worker.py:2112] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py", line 2096, in _read_blocks ERROR 04-29 07:29:19 [worker.py:2112] handle = self.nixl_wrapper.make_prepped_xfer( ERROR 04-29 07:29:19 [worker.py:2112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-29 07:29:19 [worker.py:2112] File "/vllm-workspace/tests/v1/kv_connector/unit/test_nixl_connector.py", line 1957, in make_prepped_xferERROR 04-29 07:29:19 [worker.py:2112] raise RuntimeError("BAD STATUS") ERROR 04-29 07:29:19 [worker.py:2112] RuntimeError: BAD STATUS

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

ERROR 04-29 07:29:19 [worker.py:2112] Traceback (most recent call last):
ERROR 04-29 07:29:19 [worker.py:2112]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py", line 2096, in _read_blocks
ERROR 04-29 07:29:19 [worker.py:2112]     handle = self.nixl_wrapper.make_prepped_xfer(
ERROR 04-29 07:29:19 [worker.py:2112]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 07:29:19 [worker.py:2112]   File "/vllm-workspace/tests/v1/kv_connector/unit/test_nixl_connector.py", line 1957, in make_prepped_xferERROR 04-29 07:29:19 [worker.py:2112]     raise RuntimeError("BAD STATUS")
ERROR 04-29 07:29:19 [worker.py:2112] RuntimeError: BAD STATUS
RAW_BUFFERClick to expand / collapse

Name of failing test

(command rocm-smi || true) && export VLLM_TEST_GROUP_NAME=mi300_1-v1-core---kv---metrics && export VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 && cd /vllm-workspace/tests && uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt && pytest -v -s -m 'not cpu_test' v1/core && pytest -v -s v1/executor && pytest -v -s v1/kv_offload && pytest -v -s v1/worker && pytest -v -s -m 'not cpu_test' v1/kv_connector/unit && pytest -v -s -m 'not cpu_test' v1/metrics && pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api && pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

ERROR 04-29 07:29:19 [worker.py:2112] Traceback (most recent call last):
ERROR 04-29 07:29:19 [worker.py:2112]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py", line 2096, in _read_blocks
ERROR 04-29 07:29:19 [worker.py:2112]     handle = self.nixl_wrapper.make_prepped_xfer(
ERROR 04-29 07:29:19 [worker.py:2112]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 07:29:19 [worker.py:2112]   File "/vllm-workspace/tests/v1/kv_connector/unit/test_nixl_connector.py", line 1957, in make_prepped_xferERROR 04-29 07:29:19 [worker.py:2112]     raise RuntimeError("BAD STATUS")
ERROR 04-29 07:29:19 [worker.py:2112] RuntimeError: BAD STATUS

📝 History of failing test

  • Current streak start: 2026-04-23
  • First failure in 60d window: 2026-04-21
  • Last successful nightly: 2026-04-22
  • Break frequency (60d, pass↔fail flips): 2
  • Latest nightly date: 2026-04-29
  • Latest build(s): amd-ci #8058
  • Latest hardware status: mi300_1=fail

extent analysis

TL;DR

Investigate the RuntimeError: BAD STATUS exception in the make_prepped_xfer method of test_nixl_connector.py to resolve the failing test.

Guidance

  • Review the worker.py and test_nixl_connector.py files to understand the context of the make_prepped_xfer method and the RuntimeError exception.
  • Check the status handling in the make_prepped_xfer method to ensure it correctly handles different status codes.
  • Verify the test environment and hardware status, particularly for mi300_1, to rule out any external factors contributing to the failure.
  • Analyze the test history and break frequency to identify any patterns or correlations with other changes or events.

Notes

The issue seems to be related to a specific test case and hardware configuration, so the solution may require a deep understanding of the test code and the underlying system.

Recommendation

Apply workaround: Investigate and fix the status handling in the make_prepped_xfer method, as it is likely the root cause of the RuntimeError exception.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING