vllm - 💡(How to fix) Fix [CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 [1 participants]

vllm2026-04-24 02:45:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40758•Fetched 2026-04-24 10:36:25

View on GitHub

Comments

Participants

Timeline

Reactions

Author

amuin-2hz

Participants

amuin-2hz

Timeline (top)

added_to_project_v2 ×1labeled ×1

Error Message

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

Fix Action

Fix / Workaround

If yes, is there a known workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10?

Code Example

--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager

---

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

---

torch.ops._C.cutlass_scaled_mm.default(...)

RAW_BUFFERClick to expand / collapse

Name of failing test

Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Summary

I am testing Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system with:

NVIDIA-SMI 580.142
Driver Version: 580.142
CUDA Version: 13.0

I can reproduce a startup failure in vLLM when launching the FP8 model through the OpenAI server.

The failure happens during engine initialization / profile run and crashes inside:

torch.ops._C.cutlass_scaled_mm.default(...)
cutlass_gemm_caller ... Error Internal

📝 History of failing test

Environment

Hardware: NVIDIA GB10
Driver: 580.142
CUDA: 13.0
Image: vllm/vllm-openai:nightly
vLLM in container log:
- 0.19.2rc1.dev134+gfe9c3d6c5
Host Python env:
- torch 2.11.0+cu130
- vllm 0.19.2rc1.dev142+g4a79262e0

Model

Qwen3.6-35B-A3B-FP8

Launch args

Current relevant launch args:

--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager

Before adding --enforce-eager, the crash also went through:

vllm/compilation/cuda_graph.py
torch/_inductor
cutlass_scaled_mm

With --enforce-eager, vLLM reports that torch.compile and CUDAGraph are disabled, which avoids the old path, but I am still validating whether the model can fully come up in this mode.

Error

Relevant traceback:

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

And the call site is:

torch.ops._C.cutlass_scaled_mm.default(...)

CC List.

What I already checked

Switched from vllm/vllm-openai:latest to vllm/vllm-openai:nightly
Upgraded host environment to:
- torch 2.11.0+cu130
- vllm 0.19.2rc1.dev142+...
Confirmed this is not only an old-image issue
Confirmed this is not the earlier KV-cache sizing failure
Confirmed the FP8 path specifically is involved

Question

Is Qwen3.6-35B-A3B-FP8 on GB10 / CUDA 13.0 currently expected to work in vLLM nightly?

If yes, is there a known workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10?

Possible things I would like guidance on:

recommended nightly image / commit for GB10
required environment variables or flags
whether FP8 currently requires disabling a specific backend
whether this is a known CUTLASS / torch / vLLM issue on sm_121

extent analysis

TL;DR

The most likely fix or workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10 is to investigate compatibility issues between the Qwen3.6-35B-A3B-FP8 model, GB10 hardware, and CUDA 13.0, potentially requiring specific environment variables, flags, or disabling certain backends.

Guidance

Verify that the Qwen3.6-35B-A3B-FP8 model is compatible with the GB10 hardware and CUDA 13.0 by checking the official documentation or release notes for any known issues or limitations.
Investigate the effect of disabling specific backends, such as torch.compile or CUDAGraph, on the model's performance and stability, as hinted by the --enforce-eager flag.
Check for any recommended environment variables or flags for running the Qwen3.6-35B-A3B-FP8 model on GB10 with CUDA 13.0, which might be specific to the nightly image or commit being used.
Consider testing with a different version of the vllm-openai image or a specific commit to see if the issue persists, given that the problem was confirmed not to be an old-image issue.

Example

No specific code snippet can be provided without further details on the exact implementation or requirements of the Qwen3.6-35B-A3B-FP8 model and its interaction with the GB10 hardware and CUDA 13.0.

Notes

The solution may depend on the specific configuration and versioning of the software and hardware components involved, including the vllm version, torch version, and CUDA version. Compatibility issues between these components could be the root cause of the failure.

Recommendation

Apply a workaround

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#pipeline error #runtime error #dependency conflict #environment setup #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

Summary

📝 History of failing test

Environment

Model

Launch args

Error

CC List.

What I already checked

Question

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

Summary

📝 History of failing test

Environment

Model

Launch args

Error

CC List.

What I already checked

Question

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING