vllm - ✅(Solved) Fix [CI Failure]: Test Eval Marlin Qwen3-30B-A3B-Fp8 [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38101Fetched 2026-04-08 01:26:38
View on GitHub
Comments
2
Participants
2
Timeline
16
Reactions
0
Author
Participants
Timeline (top)
mentioned ×5subscribed ×5commented ×2added_to_project_v2 ×1

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #32929: [FP8]add FP8 WoQ kernel abstraction.

Description (problem / solution / changelog)

Purpose

This PR refactors the FP8 linear kernel stack to integrate the Marlin kernel into the FP8 kernel abstraction and to centralize kernel selection. After this change, the FP8 execution path has a single intentional divergence: block-scaled scaled_mm, which should not use Marlin. Changes

  1. Centralized FP8 kernel selection via init_fp8_linear_kernel() Uses init_fp8_linear_kernel() to select the appropriate FP8 kernel implementation (e.g., W8A16 vs. W8A8) based on configuration and platform capability.
  2. Added MarlinFP8ScaledMMLinearKernel Introduces a Marlin-backed FP8 kernel implementation under the scaled-mm kernel abstraction, enabling Marlin to be selected and used through the unified FP8 kernel interface.

Follow-up (post-merge)

  • Add XPU W8A16 GEMM kernel support in the FP8 linear path once this refactor is merged.

Test Plan

CI

Test Result

lm-eval result qwen3-4B on 3090 <img width="848" height="207" alt="image" src="https://github.com/user-attachments/assets/a7bbd3bf-308b-445b-be56-42d8da1cdb41" />


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/model_executor/kernels/linear/__init__.py (modified, +4/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/__init__.py (modified, +4/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/marlin.py (added, +120/-0)
  • vllm/model_executor/layers/quantization/fbgemm_fp8.py (modified, +0/-12)
  • vllm/model_executor/layers/quantization/fp8.py (modified, +49/-71)
RAW_BUFFERClick to expand / collapse

Name of failing test

tests/evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-30B-A3B-Fp8-CT-Channel-marlin]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

pytest tests/evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/moe-refactor/config-h100.txt -k "Qwen3-30B-A3B-Fp8-CT-Channel-marlin" -v -s

Fails with 'QKVParallelLinear' object has no attribute 'workspace' in vllm/vllm/model_executor/kernels/linear/scaled_mm/marlin.py", line 101, in apply_weights called from schemes/compressed_tensors_w8a8_fp8.py because CompressedTensorsW8A8Fp8 doesn't call fp8_linear.process_weights_after_loading(layer)

Introduced in https://github.com/vllm-project/vllm/pull/32929

📝 History of failing test

https://buildkite.com/vllm/ci/builds/57706/steps/canvas?sid=019d1e6f-1784-460b-b631-fcea0f90d7ff&tab=output

CC List.

@jikunshang @robertgshaw2-redhat @mgoin @tjtanaa

extent analysis

Fix Plan

To fix the issue, we need to ensure that fp8_linear.process_weights_after_loading(layer) is called for CompressedTensorsW8A8Fp8.

Here are the steps:

  • Modify schemes/compressed_tensors_w8a8_fp8.py to call fp8_linear.process_weights_after_loading(layer) after loading the weights.
  • Update the CompressedTensorsW8A8Fp8 class to handle the workspace attribute.

Example code:

# schemes/compressed_tensors_w8a8_fp8.py
from vllm.vllm.model_executor.kernels.linear.scaled_mm import fp8_linear

class CompressedTensorsW8A8Fp8:
    # ...
    def load_weights(self, layer):
        # ...
        fp8_linear.process_weights_after_loading(layer)
        # ...

Verification

To verify the fix, run the failing test again:

pytest tests/evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/moe-refactor/config-h100.txt -k "Qwen3-30B-A3B-Fp8-CT-Channel-marlin" -v -s

If the test passes, the fix is successful.

Extra Tips

  • Make sure to update the CompressedTensorsW8A8Fp8 class to handle the workspace attribute correctly.
  • Test the fix thoroughly to ensure it doesn't introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING