vllm - ✅(Solved) Fix test_fused_marlin_moe borderline tolerance failure at m=666, K=2048 on L4 [1 pull requests, 1 participants]

vllm2026-04-11 00:06:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39549•Fetched 2026-04-11 06:12:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

blake-snc

Participants

blake-snc

Timeline (top)

cross-referenced ×1

Error Message

test_fused_marlin_moe fails on L4 (SM89) at m=666, K=2048 with max absolute error 0.04297 vs atol=4e-2. Failures are 1-4 elements out of 1.4M (0.0003%).

The error is inherent 4-bit weight quantization noise accumulated over K=2048 at the test's fixed seed (torch.cuda.manual_seed(1))
The error varies by L4 instance — PR #39024 passed all MOE tests on a different L4

Fix Action

Fix / Workaround

Investigation:

FP32 accumulation is used on SM80+ (use_fp16_accum = false) — not an accumulation precision issue
Partial block handling is correct: predicated loads skip padding rows, output writes are guarded by row < block_num_valid_tokens
The default_vllm_config fixture does not change Marlin kernel dispatch — moe_wna16_marlin_gemm is a direct C++ call with no config-dependent paths
The error is inherent 4-bit weight quantization noise accumulated over K=2048 at the test's fixed seed (torch.cuda.manual_seed(1))
m=1 and m=123 pass at the same K=2048. Only m=666 (chosen as a non-aligned stress test) is borderline
The error varies by L4 instance — PR #39024 passed all MOE tests on a different L4

PR fix notes

PR #35568: [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths

Repository: vllm-project/vllm
Author: blake-snc
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/35568

Description (problem / solution / changelog)

Summary

SM121 (DGX Spark GB10) shares the same FP8 MMA capabilities as SM120 (RTX 5090) — both support native mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32. However, SM121 is excluded from all Marlin and CUTLASS FP8 codepaths by exact-match arch guards (== 120, in [89, 120], enable_sm120_only).

This fixes 8 locations across codegen, runtime, dispatch, and tests using bounded SM12x family checks (arch // 10 == 12, major_capability == 12, enable_sm120_family, is_device_capability_family(120)):

Codegen (FP8 kernel template generation):

csrc/quantization/marlin/generate_kernels.py: arch in [89, 120] → arch == 89 or arch // 10 == 12
csrc/moe/marlin_moe_wna16/generate_kernels.py: same fix

Runtime (FP8 activation gate):

csrc/moe/marlin_moe_wna16/ops.cu: == 120 → major_capability == 12

CUTLASS FP8 dispatch (kernel wrapper):

csrc/quantization/w8a8/cutlass/c3x/scaled_mm.cuh: enable_sm120_only → enable_sm120_family
csrc/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh: same fix

Tests (FP8 test case generation):

tests/kernels/moe/test_moe.py: get_device_capability() not in [89, 120] → proper is_device_capability(89) / is_device_capability_family(120) API calls
tests/kernels/quantization/test_marlin_gemm.py: same fix

Python-side FP8 input validation:

vllm/model_executor/layers/quantization/utils/marlin_utils.py: is_device_capability(120) → is_device_capability_family(120)

All checks use bounded SM12x family matching (covers SM120/SM121 but won't accidentally match future SM13x).

The enable_sm120_only → enable_sm120_family change in the CUTLASS dispatch headers also resolves the CUTLASS FP4 GEMM failure on SM121 reported in #30163 ("Failed to run cutlass FP4 gemm on sm120. Error: Error Internal"), since enable_sm120_only uses __CUDA_ARCH__ == 1200 which excludes SM121 (__CUDA_ARCH__ == 1210), while enable_sm120_family uses >= 1200 && < 1300.

Validation

Tested on DGX Spark (NVIDIA GB10, SM121a / capability 12.1):

Marlin FP4 GEMM (all 5 configs including N=100544): PASS CUTLASS FP4 dispatch: cutlass_scaled_mm_supports_fp4(121) = True Capability check logic:

SM89 (Ada):   allowed via exact match ✓
SM90 (Hopper): blocked ✓
SM120 (RTX 5090): allowed ✓
SM121 (DGX Spark): allowed ✓
SM130 (future): not matched ✓

Subsumes #35803. Fixes #35432. Fixes #30163. Relates to #30135.

Contributed by Second Nature Computing (https://joinsecondnature.com)

Test plan

Validated on SM121a hardware (DGX Spark)
Marlin FP4 GEMM passes all 5 test configs
enable_sm120_family verified in common.hpp with correct >= 1200 && < 1300 range guard
is_device_capability_family(120) verified: uses to_int() // 10 == 120 // 10
Pre-commit hooks pass

🤖 Generated with Claude Code

Changed files

csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm.cuh (modified, +1/-1)
csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh (modified, +1/-1)
csrc/moe/marlin_moe_wna16/generate_kernels.py (modified, +3/-3)
csrc/moe/marlin_moe_wna16/ops.cu (modified, +2/-2)
csrc/quantization/marlin/generate_kernels.py (modified, +3/-3)
tests/kernels/moe/test_moe.py (modified, +5/-1)
tests/kernels/quantization/test_marlin_gemm.py (modified, +2/-1)
vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +2/-2)

RAW_BUFFERClick to expand / collapse

test_fused_marlin_moe fails on L4 (SM89) at m=666, K=2048 with max absolute error 0.04297 vs atol=4e-2. Failures are 1-4 elements out of 1.4M (0.0003%).

Investigation:

FP32 accumulation is used on SM80+ (use_fp16_accum = false) — not an accumulation precision issue
Partial block handling is correct: predicated loads skip padding rows, output writes are guarded by row < block_num_valid_tokens
The default_vllm_config fixture does not change Marlin kernel dispatch — moe_wna16_marlin_gemm is a direct C++ call with no config-dependent paths
The error is inherent 4-bit weight quantization noise accumulated over K=2048 at the test's fixed seed (torch.cuda.manual_seed(1))
m=1 and m=123 pass at the same K=2048. Only m=666 (chosen as a non-aligned stress test) is borderline
The error varies by L4 instance — PR #39024 passed all MOE tests on a different L4

test_fused_marlin_moe_with_bias already uses @pytest.mark.flaky(reruns=2) for similar borderline behavior. The base test should probably match, since the 0.003 margin over tolerance is within hardware variance range.

Observed during CI for #35568. Only surfaces when MOE tests are triggered by csrc/ changes.

extent analysis

TL;DR

Marking the test_fused_marlin_moe as flaky with reruns may help mitigate the intermittent failure due to 4-bit weight quantization noise.

Guidance

Consider adding @pytest.mark.flaky(reruns=2) to test_fused_marlin_moe to account for borderline behavior similar to test_fused_marlin_moe_with_bias.
Verify that the error margin is within the expected hardware variance range to determine if the failure is due to quantization noise.
Investigate if the test's fixed seed (torch.cuda.manual_seed(1)) contributes to the reproducibility of the failure.
Review the test's tolerance value (atol=4e-2) to ensure it is suitable for the specific test case.

Example

No code snippet is provided as it is not clearly supported by the issue.

Notes

The solution may not apply if the failure is not due to quantization noise or hardware variance. Further investigation may be needed to determine the root cause of the failure.

Recommendation

Apply workaround: Marking the test as flaky with reruns is a reasonable workaround given the intermittent nature of the failure and the small error margin.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix test_fused_marlin_moe borderline tolerance failure at m=666, K=2048 on L4 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #35568: [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths

Description (problem / solution / changelog)

Summary

Validation

Test plan

Changed files

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix test_fused_marlin_moe borderline tolerance failure at m=666, K=2048 on L4 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #35568: [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths

Description (problem / solution / changelog)

Summary

Validation

Test plan

Changed files

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING