vllm - 💡(How to fix) Fix [Bug]: --quantization fp8 fails on Qwen3.5 hybrid (qwen3_next gated delta net) with cutlass_scaled_mm Error Internal on GB10 sm_121 [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40934Fetched 2026-04-27 05:29:13
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
0
Timeline (top)
commented ×1

Loading a Qwen3.5 hybrid (Qwen3_5ForConditionalGeneration / qwen3_next) model with --quantization fp8 fails during profile_run with RuntimeError: Error Internal from torch.ops._C.cutlass_scaled_mm. The same model loads and serves cleanly at --dtype bfloat16 without --quantization. The crash reproduces on a fine-tuned 9B variant of this architecture; I haven't been able to test the public weights yet because of model size.

The failure originates in vllm/_custom_ops.py:845, cutlass_scaled_mm — a generic "Error Internal" assert from the cutlass kernel itself, with no further detail. The traceback site differs depending on whether the model has populated vision weights, but the failing op is the same in both cases.

Error Message

INFO ... Selected CutlassFP8ScaledMMLinearKernel for Fp8OnlineLinearMethod INFO ... Using FLASHINFER attention backend ... INFO ... Starting to load model ... ERROR ... RuntimeError: Error Internal RuntimeError: Engine core initialization failed.

Root Cause

Loading a Qwen3.5 hybrid (Qwen3_5ForConditionalGeneration / qwen3_next) model with --quantization fp8 fails during profile_run with RuntimeError: Error Internal from torch.ops._C.cutlass_scaled_mm. The same model loads and serves cleanly at --dtype bfloat16 without --quantization. The crash reproduces on a fine-tuned 9B variant of this architecture; I haven't been able to test the public weights yet because of model size.

Fix Action

Fix / Workaround

Workarounds I tried

Code Example

vllm serve /path/to/Qwen3.5-9B-fine-tuned \
  --host 0.0.0.0 --port 9003 \
  --served-model-name qwen-alfred-router \
  --enforce-eager \
  --dtype bfloat16 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e4m3 \
  --skip-mm-profiling \
  --gpu-memory-utilization 0.2 \
  --max-model-len 8192

---

INFO ... Selected CutlassFP8ScaledMMLinearKernel for Fp8OnlineLinearMethod
INFO ... Using FLASHINFER attention backend ...
INFO ... Starting to load model ...
ERROR ... RuntimeError: Error Internal
RuntimeError: Engine core initialization failed.

---

File ".../vllm/v1/worker/gpu_worker.py", line 388, in determine_available_memory
    self.model_runner.profile_run()
File ".../vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
    hidden_states, last_hidden_states = self._dummy_run(...)
File ".../vllm/model_executor/models/qwen2_5_vl.py", line 372, in forward
    x, _ = self.qkv(x)
File ".../vllm/model_executor/layers/linear.py", line 582, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
File ".../vllm/model_executor/layers/quantization/fp8.py", line 516, in apply
    return self.fp8_linear.apply_weights(layer, x, bias)
File ".../vllm/model_executor/kernels/linear/scaled_mm/ScaledMMLinearKernel.py", line 148, in apply_weights
File ".../vllm/model_executor/kernels/linear/scaled_mm/cutlass.py", line 170, in apply_scaled_mm
    output = ops.cutlass_scaled_mm(...)
File ".../vllm/_custom_ops.py", line 845, in cutlass_scaled_mm
    torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
RuntimeError: Error Internal

---

File ".../vllm/model_executor/models/qwen3_5.py", line 765, in forward
    hidden_states = self.language_model.model(...)
File ".../vllm/model_executor/models/qwen3_next.py", line 1385, in forward
    hidden_states, residual = layer(...)
File ".../vllm/model_executor/models/qwen3_next.py", line 1269, in forward
    self.linear_attn(...)
File ".../vllm/model_executor/models/qwen3_5.py", line 183, in forward
    mixed_qkvz, _ = self.in_proj_qkvz(hidden_states)
File ".../vllm/model_executor/layers/linear.py", line 582, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
File ".../vllm/model_executor/layers/quantization/fp8.py", line 516, in apply
    return self.fp8_linear.apply_weights(layer, x, bias)
File ".../vllm/model_executor/kernels/linear/scaled_mm/cutlass.py", line 170, in apply_scaled_mm
    output = ops.cutlass_scaled_mm(...)
File ".../vllm/_custom_ops.py", line 845, in cutlass_scaled_mm
    torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
RuntimeError: Error Internal
RAW_BUFFERClick to expand / collapse

Summary

Loading a Qwen3.5 hybrid (Qwen3_5ForConditionalGeneration / qwen3_next) model with --quantization fp8 fails during profile_run with RuntimeError: Error Internal from torch.ops._C.cutlass_scaled_mm. The same model loads and serves cleanly at --dtype bfloat16 without --quantization. The crash reproduces on a fine-tuned 9B variant of this architecture; I haven't been able to test the public weights yet because of model size.

The failure originates in vllm/_custom_ops.py:845, cutlass_scaled_mm — a generic "Error Internal" assert from the cutlass kernel itself, with no further detail. The traceback site differs depending on whether the model has populated vision weights, but the failing op is the same in both cases.

Environment

  • vLLM: 0.18.0
  • Hardware: NVIDIA GB10 (DGX Spark), compute capability 12.1 (sm_121)
  • PyTorch: 2.10.0+cu130, CUDA 13.0
  • transformers: 5.5.4
  • compressed-tensors: 0.13.0
  • OS: Ubuntu 24.04 (aarch64)
  • Driver / runtime: cu130 with LD_LIBRARY_PATH=/path/to/cuda12-compat for legacy soname symlinks (this is the documented setup pattern for GB10 + cu130 right now)

Repro

vllm serve /path/to/Qwen3.5-9B-fine-tuned \
  --host 0.0.0.0 --port 9003 \
  --served-model-name qwen-alfred-router \
  --enforce-eager \
  --dtype bfloat16 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e4m3 \
  --skip-mm-profiling \
  --gpu-memory-utilization 0.2 \
  --max-model-len 8192

--quantization fp8 alone (without --kv-cache-dtype or --skip-mm-profiling) reproduces it identically.

Removing --quantization fp8 (everything else identical, util raised to 0.3): boots cleanly, serves traffic.

Behavior

vLLM logs:

INFO ... Selected CutlassFP8ScaledMMLinearKernel for Fp8OnlineLinearMethod
INFO ... Using FLASHINFER attention backend ...
INFO ... Starting to load model ...
ERROR ... RuntimeError: Error Internal
RuntimeError: Engine core initialization failed.

Traceback A — vision tower (with all-zero vision weights)

The model has injected zero-init visual.* weights (a text-only fine-tune of a multimodal base) so vllm's strict weight-completeness check passes. With FP8 dynamic quant on, warmup hits:

File ".../vllm/v1/worker/gpu_worker.py", line 388, in determine_available_memory
    self.model_runner.profile_run()
File ".../vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
    hidden_states, last_hidden_states = self._dummy_run(...)
File ".../vllm/model_executor/models/qwen2_5_vl.py", line 372, in forward
    x, _ = self.qkv(x)
File ".../vllm/model_executor/layers/linear.py", line 582, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
File ".../vllm/model_executor/layers/quantization/fp8.py", line 516, in apply
    return self.fp8_linear.apply_weights(layer, x, bias)
File ".../vllm/model_executor/kernels/linear/scaled_mm/ScaledMMLinearKernel.py", line 148, in apply_weights
File ".../vllm/model_executor/kernels/linear/scaled_mm/cutlass.py", line 170, in apply_scaled_mm
    output = ops.cutlass_scaled_mm(...)
File ".../vllm/_custom_ops.py", line 845, in cutlass_scaled_mm
    torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
RuntimeError: Error Internal

Initially I thought this was a degenerate-FP8-scale issue from the all-zero vision weights. So I re-ran the inject script with small random init (0.02 * randn) instead of zeros to give the per-tensor scales something non-degenerate to work with.

Traceback B — language model in_proj_qkvz (with random-init vision weights)

With non-zero vision weights, warmup gets past the vision tower and crashes inside the language model's gated-delta-net layer:

File ".../vllm/model_executor/models/qwen3_5.py", line 765, in forward
    hidden_states = self.language_model.model(...)
File ".../vllm/model_executor/models/qwen3_next.py", line 1385, in forward
    hidden_states, residual = layer(...)
File ".../vllm/model_executor/models/qwen3_next.py", line 1269, in forward
    self.linear_attn(...)
File ".../vllm/model_executor/models/qwen3_5.py", line 183, in forward
    mixed_qkvz, _ = self.in_proj_qkvz(hidden_states)
File ".../vllm/model_executor/layers/linear.py", line 582, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
File ".../vllm/model_executor/layers/quantization/fp8.py", line 516, in apply
    return self.fp8_linear.apply_weights(layer, x, bias)
File ".../vllm/model_executor/kernels/linear/scaled_mm/cutlass.py", line 170, in apply_scaled_mm
    output = ops.cutlass_scaled_mm(...)
File ".../vllm/_custom_ops.py", line 845, in cutlass_scaled_mm
    torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
RuntimeError: Error Internal

This is Qwen3NextGatedDeltaNet.in_proj_qkvz — a fused 4-output (Q / K / V / Z) projection in the linear-attention branch of the hybrid model. The Z gate makes the output dim 4× hidden rather than the usual 3×, which I suspect is the kernel-side issue.

Hypothesis

The cutlass_scaled_mm kernel selected by CutlassFP8ScaledMMLinearKernel doesn't handle the projection output layout used by Qwen3NextGatedDeltaNet.in_proj_qkvz (Q/K/V/Z fused, not Q/K/V) and probably also struggles with degenerate per-tensor scales from the all-zero/very-small vision tower weights typical of text-only fine-tunes of multimodal bases.

The fact that random-init vision weights moved the crash into the language model — rather than fixing it — is what makes me think this is a layout/shape issue in cutlass kernel selection, not just a numerics issue.

Workarounds I tried

  1. --skip-mm-profiling — does not help. It skips MM memory profiling but the language-model warmup still runs _dummy_run which exercises the same FP8 path on the language model's in_proj_qkvz, hitting Traceback B.
  2. Random-init vision weights instead of zero-init — moves the crash from vision tower to language model (Traceback A → B), confirms vision weights aren't the root cause.
  3. --kv-cache-dtype fp8_e4m3 is independent — the failure happens before KV cache allocation, in determine_available_memory -> profile_run.

What I'd love to know

  1. Is cutlass_scaled_mm known not to handle the in_proj_qkvz layout (4-way fused output)? If yes, can Fp8LinearMethod route this layer through a different kernel, or skip quantization for layers with fused Z gates?
  2. Would a compressed-tensors static FP8 W8A8 model produced by llmcompressor route through the same cutlass_scaled_mm at runtime, or a different kernel? (i.e., is offline calibration likely to dodge this, or hit the same wall?)
  3. Is there a more informative error mode I can enable for cutlass_scaled_mm so the assertion text comes through instead of "Error Internal"?

Happy to gather more diagnostics on this hardware (GB10 / sm_121 is unusual but real).

extent analysis

TL;DR

The most likely fix is to modify the Fp8LinearMethod to route the in_proj_qkvz layer through a different kernel or skip quantization for layers with fused Z gates.

Guidance

  • Investigate if cutlass_scaled_mm is known to not handle the in_proj_qkvz layout and if there's an alternative kernel that can be used.
  • Consider modifying the Fp8LinearMethod to skip quantization for layers with fused Z gates, such as Qwen3NextGatedDeltaNet.in_proj_qkvz.
  • Look into enabling a more informative error mode for cutlass_scaled_mm to get a more detailed error message instead of "Error Internal".
  • Verify if using a compressed-tensors static FP8 W8A8 model produced by llmcompressor would route through a different kernel at runtime.

Example

No code snippet is provided as the issue requires investigation into the cutlass_scaled_mm kernel and the Fp8LinearMethod implementation.

Notes

The issue seems to be specific to the Qwen3NextGatedDeltaNet.in_proj_qkvz layer and the cutlass_scaled_mm kernel. The fact that random-init vision weights moved the crash to the language model suggests that the issue is related to the layout/shape of the projection output.

Recommendation

Apply a workaround by modifying the Fp8LinearMethod to skip quantization for layers with fused Z gates, as this is likely to resolve the issue. This is because the cutlass_scaled_mm kernel seems to be the culprit, and skipping quantization for the affected layers may allow the model to load and serve correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING