vllm - 💡(How to fix) Fix [CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40758Fetched 2026-04-24 10:36:25
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
added_to_project_v2 ×1labeled ×1

Error Message

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Fix Action

Fix / Workaround

If yes, is there a known workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10?

Code Example

--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager

---

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

---

torch.ops._C.cutlass_scaled_mm.default(...)
RAW_BUFFERClick to expand / collapse

Name of failing test

Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Summary

I am testing Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system with:

  • NVIDIA-SMI 580.142
  • Driver Version: 580.142
  • CUDA Version: 13.0

I can reproduce a startup failure in vLLM when launching the FP8 model through the OpenAI server.

The failure happens during engine initialization / profile run and crashes inside:

  • torch.ops._C.cutlass_scaled_mm.default(...)
  • cutlass_gemm_caller ... Error Internal

📝 History of failing test

Environment

  • Hardware: NVIDIA GB10
  • Driver: 580.142
  • CUDA: 13.0
  • Image: vllm/vllm-openai:nightly
  • vLLM in container log:
    • 0.19.2rc1.dev134+gfe9c3d6c5
  • Host Python env:
    • torch 2.11.0+cu130
    • vllm 0.19.2rc1.dev142+g4a79262e0

Model

  • Qwen3.6-35B-A3B-FP8

Launch args

Current relevant launch args:

--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager

Before adding --enforce-eager, the crash also went through:

  • vllm/compilation/cuda_graph.py
  • torch/_inductor
  • cutlass_scaled_mm

With --enforce-eager, vLLM reports that torch.compile and CUDAGraph are disabled, which avoids the old path, but I am still validating whether the model can fully come up in this mode.

Error

Relevant traceback:

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

And the call site is:

torch.ops._C.cutlass_scaled_mm.default(...)

CC List.

What I already checked

  • Switched from vllm/vllm-openai:latest to vllm/vllm-openai:nightly
  • Upgraded host environment to:
    • torch 2.11.0+cu130
    • vllm 0.19.2rc1.dev142+...
  • Confirmed this is not only an old-image issue
  • Confirmed this is not the earlier KV-cache sizing failure
  • Confirmed the FP8 path specifically is involved

Question

Is Qwen3.6-35B-A3B-FP8 on GB10 / CUDA 13.0 currently expected to work in vLLM nightly?

If yes, is there a known workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10?

Possible things I would like guidance on:

  • recommended nightly image / commit for GB10
  • required environment variables or flags
  • whether FP8 currently requires disabling a specific backend
  • whether this is a known CUTLASS / torch / vLLM issue on sm_121

extent analysis

TL;DR

The most likely fix or workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10 is to investigate compatibility issues between the Qwen3.6-35B-A3B-FP8 model, GB10 hardware, and CUDA 13.0, potentially requiring specific environment variables, flags, or disabling certain backends.

Guidance

  • Verify that the Qwen3.6-35B-A3B-FP8 model is compatible with the GB10 hardware and CUDA 13.0 by checking the official documentation or release notes for any known issues or limitations.
  • Investigate the effect of disabling specific backends, such as torch.compile or CUDAGraph, on the model's performance and stability, as hinted by the --enforce-eager flag.
  • Check for any recommended environment variables or flags for running the Qwen3.6-35B-A3B-FP8 model on GB10 with CUDA 13.0, which might be specific to the nightly image or commit being used.
  • Consider testing with a different version of the vllm-openai image or a specific commit to see if the issue persists, given that the problem was confirmed not to be an old-image issue.

Example

No specific code snippet can be provided without further details on the exact implementation or requirements of the Qwen3.6-35B-A3B-FP8 model and its interaction with the GB10 hardware and CUDA 13.0.

Notes

The solution may depend on the specific configuration and versioning of the software and hardware components involved, including the vllm version, torch version, and CUDA version. Compatibility issues between these components could be the root cause of the failure.

Recommendation

Apply a workaround

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 [1 participants]