vllm - ✅(Solved) Fix test_fused_marlin_moe borderline tolerance failure at m=666, K=2048 on L4 [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39549Fetched 2026-04-11 06:12:49
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Error Message

test_fused_marlin_moe fails on L4 (SM89) at m=666, K=2048 with max absolute error 0.04297 vs atol=4e-2. Failures are 1-4 elements out of 1.4M (0.0003%).

  • The error is inherent 4-bit weight quantization noise accumulated over K=2048 at the test's fixed seed (torch.cuda.manual_seed(1))
  • The error varies by L4 instance — PR #39024 passed all MOE tests on a different L4

Fix Action

Fix / Workaround

Investigation:

  • FP32 accumulation is used on SM80+ (use_fp16_accum = false) — not an accumulation precision issue
  • Partial block handling is correct: predicated loads skip padding rows, output writes are guarded by row < block_num_valid_tokens
  • The default_vllm_config fixture does not change Marlin kernel dispatch — moe_wna16_marlin_gemm is a direct C++ call with no config-dependent paths
  • The error is inherent 4-bit weight quantization noise accumulated over K=2048 at the test's fixed seed (torch.cuda.manual_seed(1))
  • m=1 and m=123 pass at the same K=2048. Only m=666 (chosen as a non-aligned stress test) is borderline
  • The error varies by L4 instance — PR #39024 passed all MOE tests on a different L4

PR fix notes

PR #35568: [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths

Description (problem / solution / changelog)

Summary

SM121 (DGX Spark GB10) shares the same FP8 MMA capabilities as SM120 (RTX 5090) — both support native mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32. However, SM121 is excluded from all Marlin and CUTLASS FP8 codepaths by exact-match arch guards (== 120, in [89, 120], enable_sm120_only).

This fixes 8 locations across codegen, runtime, dispatch, and tests using bounded SM12x family checks (arch // 10 == 12, major_capability == 12, enable_sm120_family, is_device_capability_family(120)):

Codegen (FP8 kernel template generation):

  • csrc/quantization/marlin/generate_kernels.py: arch in [89, 120]arch == 89 or arch // 10 == 12
  • csrc/moe/marlin_moe_wna16/generate_kernels.py: same fix

Runtime (FP8 activation gate):

  • csrc/moe/marlin_moe_wna16/ops.cu: == 120major_capability == 12

CUTLASS FP8 dispatch (kernel wrapper):

  • csrc/quantization/w8a8/cutlass/c3x/scaled_mm.cuh: enable_sm120_onlyenable_sm120_family
  • csrc/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh: same fix

Tests (FP8 test case generation):

  • tests/kernels/moe/test_moe.py: get_device_capability() not in [89, 120] → proper is_device_capability(89) / is_device_capability_family(120) API calls
  • tests/kernels/quantization/test_marlin_gemm.py: same fix

Python-side FP8 input validation:

  • vllm/model_executor/layers/quantization/utils/marlin_utils.py: is_device_capability(120)is_device_capability_family(120)

All checks use bounded SM12x family matching (covers SM120/SM121 but won't accidentally match future SM13x).

The enable_sm120_onlyenable_sm120_family change in the CUTLASS dispatch headers also resolves the CUTLASS FP4 GEMM failure on SM121 reported in #30163 ("Failed to run cutlass FP4 gemm on sm120. Error: Error Internal"), since enable_sm120_only uses __CUDA_ARCH__ == 1200 which excludes SM121 (__CUDA_ARCH__ == 1210), while enable_sm120_family uses >= 1200 && < 1300.

Validation

Tested on DGX Spark (NVIDIA GB10, SM121a / capability 12.1):

Marlin FP4 GEMM (all 5 configs including N=100544): PASS CUTLASS FP4 dispatch: cutlass_scaled_mm_supports_fp4(121) = True Capability check logic:

SM89 (Ada):   allowed via exact match ✓
SM90 (Hopper): blocked ✓
SM120 (RTX 5090): allowed ✓
SM121 (DGX Spark): allowed ✓
SM130 (future): not matched ✓

Subsumes #35803. Fixes #35432. Fixes #30163. Relates to #30135.

Contributed by Second Nature Computing (https://joinsecondnature.com)

Test plan

  • Validated on SM121a hardware (DGX Spark)
  • Marlin FP4 GEMM passes all 5 test configs
  • enable_sm120_family verified in common.hpp with correct >= 1200 && < 1300 range guard
  • is_device_capability_family(120) verified: uses to_int() // 10 == 120 // 10
  • Pre-commit hooks pass

🤖 Generated with Claude Code

Changed files

  • csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm.cuh (modified, +1/-1)
  • csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh (modified, +1/-1)
  • csrc/moe/marlin_moe_wna16/generate_kernels.py (modified, +3/-3)
  • csrc/moe/marlin_moe_wna16/ops.cu (modified, +2/-2)
  • csrc/quantization/marlin/generate_kernels.py (modified, +3/-3)
  • tests/kernels/moe/test_moe.py (modified, +5/-1)
  • tests/kernels/quantization/test_marlin_gemm.py (modified, +2/-1)
  • vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +2/-2)
RAW_BUFFERClick to expand / collapse

test_fused_marlin_moe fails on L4 (SM89) at m=666, K=2048 with max absolute error 0.04297 vs atol=4e-2. Failures are 1-4 elements out of 1.4M (0.0003%).

Investigation:

  • FP32 accumulation is used on SM80+ (use_fp16_accum = false) — not an accumulation precision issue
  • Partial block handling is correct: predicated loads skip padding rows, output writes are guarded by row < block_num_valid_tokens
  • The default_vllm_config fixture does not change Marlin kernel dispatch — moe_wna16_marlin_gemm is a direct C++ call with no config-dependent paths
  • The error is inherent 4-bit weight quantization noise accumulated over K=2048 at the test's fixed seed (torch.cuda.manual_seed(1))
  • m=1 and m=123 pass at the same K=2048. Only m=666 (chosen as a non-aligned stress test) is borderline
  • The error varies by L4 instance — PR #39024 passed all MOE tests on a different L4

test_fused_marlin_moe_with_bias already uses @pytest.mark.flaky(reruns=2) for similar borderline behavior. The base test should probably match, since the 0.003 margin over tolerance is within hardware variance range.

Observed during CI for #35568. Only surfaces when MOE tests are triggered by csrc/ changes.

extent analysis

TL;DR

Marking the test_fused_marlin_moe as flaky with reruns may help mitigate the intermittent failure due to 4-bit weight quantization noise.

Guidance

  • Consider adding @pytest.mark.flaky(reruns=2) to test_fused_marlin_moe to account for borderline behavior similar to test_fused_marlin_moe_with_bias.
  • Verify that the error margin is within the expected hardware variance range to determine if the failure is due to quantization noise.
  • Investigate if the test's fixed seed (torch.cuda.manual_seed(1)) contributes to the reproducibility of the failure.
  • Review the test's tolerance value (atol=4e-2) to ensure it is suitable for the specific test case.

Example

No code snippet is provided as it is not clearly supported by the issue.

Notes

The solution may not apply if the failure is not due to quantization noise or hardware variance. Further investigation may be needed to determine the root cause of the failure.

Recommendation

Apply workaround: Marking the test as flaky with reruns is a reasonable workaround given the intermittent nature of the failure and the small error margin.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING