vllm - 💡(How to fix) Fix [Bug]:[SM90][FP8 blockwise] swap_ab path for small/non-multiple-of-4 M fails in can_implement() with kInvalid [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40016Fetched 2026-04-17 08:27:38
View on GitHub
Comments
2
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×2cross-referenced ×1labeled ×1

Error Message

I can paste the full traceback if needed.

Root Cause

The most likely root cause appears to be the scale path, not the ordinary A/B pointer swap.

Fix Action

Fix / Workaround

Local Investigation Patch

The investigation patch was applied in:

  • csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh
  • csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh (for expanded status reporting)
  • tests/kernels/quantization/test_cutlass_scaled_mm.py

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.3.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:46:43) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-124-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No devices found.
Nvidia driver version        : Could not collect
cuDNN version                : Probably one of the following:
...
==============================
         vLLM Info
==============================
vLLM Version                 : 0.1.dev15639+g1472223c4.d20260416 (git sha: 1472223c4, date: 20260416)
...
==============================
     Environment Variables
==============================
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root


Note: the attached `collect_env.py` output was captured when the instance was in a no-GPU state, so it reports `Is CUDA available: False` / `No devices found`. The bug itself was reproduced earlier on H800 / SM90 with GPU available; the issue here is not that the code failed to load, but that after the SM90 swap_ab port the failing cases move from the old explicit `m must be divisible by 4` check to CUTLASS `can_implement() -> kInvalid`.

---

m must be divisible by 4

---

cutlass_gemm_caller can_implement failed with kInvalid (Invalid status, code=11)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.3.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:46:43) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-124-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No devices found.
Nvidia driver version        : Could not collect
cuDNN version                : Probably one of the following:
...
==============================
         vLLM Info
==============================
vLLM Version                 : 0.1.dev15639+g1472223c4.d20260416 (git sha: 1472223c4, date: 20260416)
...
==============================
     Environment Variables
==============================
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root


Note: the attached `collect_env.py` output was captured when the instance was in a no-GPU state, so it reports `Is CUDA available: False` / `No devices found`. The bug itself was reproduced earlier on H800 / SM90 with GPU available; the issue here is not that the code failed to load, but that after the SM90 swap_ab port the failing cases move from the old explicit `m must be divisible by 4` check to CUTLASS `can_implement() -> kInvalid`.

Describe the Bug

This appears to be a separate SM90 FP8 blockwise issue, independent from the decode-side performance investigation in #38697.

On SM90, FP8 blockwise currently hard-fails for cases such as m=1 or m=33 with:

m must be divisible by 4

I tried porting the SM100/SM120-style swap_ab graceful handling to SM90. That removes the original hard check, and the new path is definitely exercised, but all minimal repro cases now fail earlier in CUTLASS can_implement() with:

cutlass_gemm_caller can_implement failed with kInvalid (Invalid status, code=11)

So the issue is no longer the old explicit m % 4 == 0 guard. The failure has moved into the SM90 CUTLASS blockwise implementation itself.

Minimal Repro Cases

All of the following fail consistently in the new SM90 swap_ab path:

  • m=1, n=256, k=128
  • m=1, n=16384, k=1024
  • m=33, n=1024, k=1024
  • m=33, n=8192, k=128

Before the port, these were blocked by the explicit m must be divisible by 4 check.
After the port, they all reach ops.cutlass_scaled_mm(...), but fail in can_implement() with kInvalid.

Findings

I compared the SM90 host-side swap_ab argument construction against the SM100/SM120 implementations and did not find an obvious missing host-side field swap.

The following are all swapped consistently:

  • ptr_A/B
  • dA/dB
  • ptr_SFA/SFB
  • layout_SFA/SFB
  • prob_shape
  • c_stride
  • LayoutC/D transpose

The most likely root cause appears to be the scale path, not the ordinary A/B pointer swap.

Under the current SM90 non-array TMA blockwise mainloop, with the swap_ab config:

  • ScaleGranularity = (128, 1, 128)
  • TileShape = (128, 32, 128)

the mainloop compiles into a form where SFB is loaded via TMA. Then can_implement() performs TMA alignment checks on args.layout_SFB.

For these failing cases, layout_SFB ends up with a dynamic stride that depends on the original m. When m=1 or m=33, that stride is not 4-float aligned, so check_alignment<4>(args.layout_SFB) fails and the adapter only reports kInvalid.

Why This Looks SM90-Specific

SM100/SM120 already have graceful handling for:

  • m < 16
  • m % 4 != 0

However, they use a different blockwise layout/mainloop family.

The current SM90 path appears to be a non-array TMA blockwise kernel family with stricter scale-layout alignment requirements, so directly porting the SM100/SM120 host-side swap_ab logic is not sufficient.

Conclusion

This no longer looks like test-only busywork or a missing host-side swap field.

It looks like the current SM90 non-array TMA FP8 blockwise path does not support the intended swap_ab semantics for small or non-4-aligned M, because the scale-layout alignment constraints fail in can_implement().

So this likely needs either:

  • a different SM90 kernel family/path for this case, or
  • an explicit statement that SM90 does not support this graceful swap_ab handling yet.

Old vs New Behavior

mnkOld SM90 behaviorNew behavior after swap_ab port
1256128m must be divisible by 4can_implement -> kInvalid (code=11)
1163841024m must be divisible by 4can_implement -> kInvalid (code=11)
3310241024m must be divisible by 4can_implement -> kInvalid (code=11)
338192128m must be divisible by 4can_implement -> kInvalid (code=11)

Local Investigation Patch

The investigation patch was applied in:

  • csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh
  • csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh (for expanded status reporting)
  • tests/kernels/quantization/test_cutlass_scaled_mm.py

Local Logs

I have local logs for the 4 minimal repro cases showing the post-port kInvalid failure path:

  • blockwise_case_1_256_128_status.log
  • blockwise_case_1_16384_1024_status.log
  • blockwise_case_33_1024_1024_status.log
  • blockwise_case_33_8192_128_status.log

I can paste the full traceback if needed.

extent analysis

TL;DR

The SM90 non-array TMA FP8 blockwise path does not support the intended swap_ab semantics for small or non-4-aligned M due to scale-layout alignment constraints, and a different kernel family or path may be needed.

Guidance

  • Investigate alternative SM90 kernel families or paths that can handle small or non-4-aligned M values with the swap_ab semantics.
  • Review the scale-layout alignment requirements for the current SM90 non-array TMA blockwise kernel family to determine if any modifications can be made to support the intended swap_ab behavior.
  • Consider adding explicit statements or documentation to indicate that SM90 does not currently support the swap_ab handling for small or non-4-aligned M values.
  • Examine the local investigation patch and logs to gain a deeper understanding of the issue and potential solutions.

Example

No code snippet is provided as the issue is related to the underlying kernel family and scale-layout alignment constraints, which requires a more in-depth analysis of the CUTLASS implementation.

Notes

The issue appears to be specific to the SM90 architecture and the non-array TMA FP8 blockwise kernel family. The SM100/SM120 architectures have different blockwise layouts and mainloop families, which may not be directly applicable to SM90.

Recommendation

Apply a workaround by using a different kernel family or path for SM90 that can handle small or non-4-aligned M values with the swap_ab semantics, or add explicit statements to indicate the current limitations of SM90 support.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]:[SM90][FP8 blockwise] swap_ab path for small/non-multiple-of-4 M fails in can_implement() with kInvalid [2 comments, 1 participants]