vllm - 💡(How to fix) Fix [CI Failure]: mi250_1: Kernels Core Operation Test [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37704Fetched 2026-04-08 01:08:45
View on GitHub
Comments
3
Participants
2
Timeline
18
Reactions
0
Timeline (top)
mentioned ×4subscribed ×4commented ×3project_v2_item_status_changed ×3

Error Message

RMS norm seems to be buggy on MI250. We might need to fall back to MI325 for this. Investigation pending. Error logs:

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-0.01-dtype0-False-768-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-0.01-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-0.01-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-1.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-1.0-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-10.0-dtype0-False-768-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-10.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-10.0-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-0.01-dtype0-False-768-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-0.01-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-0.01-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-1.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-1.0-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-10.0-dtype0-False-8-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-10.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-10.0-dtype0-False-8192-4096
RAW_BUFFERClick to expand / collapse

Name of failing test

pytest -s -v kernels/core/test_layernorm.py::test_fused_rms_norm_quant

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

RMS norm seems to be buggy on MI250. We might need to fall back to MI325 for this. Investigation pending. Error logs:

FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-0.01-dtype0-False-768-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-0.01-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-0.01-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-1.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-1.0-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-10.0-dtype0-False-768-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-10.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[False-cuda:0-0-10.0-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-0.01-dtype0-False-768-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-0.01-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-0.01-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-1.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-1.0-dtype0-False-8192-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-10.0-dtype0-False-8-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-10.0-dtype0-False-5120-4096]
FAILED kernels/core/test_layernorm.py::test_fused_rms_norm_quant[True-cuda:0-0-10.0-dtype0-False-8192-4096

📝 History of failing test

https://buildkite.com/vllm/amd-ci/builds/6721/steps/canvas?sid=019d09d4-70af-4ff1-90ed-6b2680d926f6&tab=output

extent analysis

Fix Plan

To fix the failing test test_fused_rms_norm_quant, we need to add a conditional statement to skip the test on MI250 or fall back to MI325. Here are the steps:

  • Modify the test_fused_rms_norm_quant function to include a conditional statement that checks the device type.
  • If the device is MI250, skip the test or fall back to MI325.

Example code:

import torch

def test_fused_rms_norm_quant():
    # Get the device type
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    device_type = torch.cuda.get_device_name(device)

    # Check if the device is MI250
    if "MI250" in device_type:
        # Skip the test or fall back to MI325
        print("Skipping test on MI250")
        return

    # Rest of the test code
    # ...

Alternatively, you can use a decorator to skip the test on MI250:

import torch
import pytest

def skip_on_mi250(func):
    def wrapper():
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        device_type = torch.cuda.get_device_name(device)
        if "MI250" in device_type:
            pytest.skip("Skipping test on MI250")
        return func()
    return wrapper

@skip_on_mi250
def test_fused_rms_norm_quant():
    # Rest of the test code
    # ...

Verification

To verify that the fix worked, run the test again and check that it is skipped on MI250 or falls back to MI325.

Extra Tips

  • Make sure to update the test code to handle the fallback to MI325 correctly.
  • Consider adding a comment to the test code to explain why the test is skipped on MI250.
  • If the issue is caused by a bug in the transformers library, consider opening an issue on the library's GitHub page.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING