vllm - 💡(How to fix) Fix [Bug] MarlinFP8 kernel silently skips weight transpose for square matrices (N==K), corrupting FP8 inference on sm_75–sm_88 GPUs [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

MarlinFP8ScaledMMLinearKernel.process_weights_after_loading uses a shape comparison to decide whether to transpose the weight to (K, N) before passing it to the Marlin kernel:

# vllm/model_executor/kernels/linear/scaled_mm/marlin.py
if w_q.shape != (layer.input_size_per_partition, layer.output_size_per_partition):
    replace_parameter(layer, "weight", w_q.t())

For square matrices where N == K (e.g. q_proj and o_proj where hidden_dim × hidden_dim = 4096×4096), the tuples (N, K) and (K, N) are identical. The condition always evaluates False and the transpose is silently skipped. The weight stays in (N, K) checkpoint orientation when Marlin requires (K, N), producing completely wrong GEMM results.

In dense models, q_proj and o_proj are always square. Their corrupted outputs overwhelm the residual stream and cascade through all layers.

Fix Action

Fixed

Code Example

# FP8 model on A40 (sm_86)
OUTPUT: ',,,,,,,,,,,,,,,,,,,,'

# BF16 model on A40 (sm_86)
OUTPUT: 'The capital of France is Paris...'

---

# vllm/model_executor/kernels/linear/scaled_mm/marlin.py
if w_q.shape != (layer.input_size_per_partition, layer.output_size_per_partition):
    replace_parameter(layer, "weight", w_q.t())
RAW_BUFFERClick to expand / collapse

Environment

  • GPU: sm_75–sm_88 (e.g. A10, A40, RTX A6000) — any GPU where supports_fp8() returns False
  • Model: any FP8 model with square weight matrices, e.g. ibm-granite/granite-4.1-8b-fp8
  • Quantization: compressed-tensors or modelopt FP8 with channel/tensor-wise scales

Symptoms

FP8 models produce garbage output (repeated punctuation, incoherent tokens) on GPUs lacking native FP8 compute, while the identical non-FP8 model works correctly on the same hardware. No errors, no NaN — silent data corruption.

# FP8 model on A40 (sm_86)
OUTPUT: ',,,,,,,,,,,,,,,,,,,,'

# BF16 model on A40 (sm_86)
OUTPUT: 'The capital of France is Paris...'

Root Cause

MarlinFP8ScaledMMLinearKernel.process_weights_after_loading uses a shape comparison to decide whether to transpose the weight to (K, N) before passing it to the Marlin kernel:

# vllm/model_executor/kernels/linear/scaled_mm/marlin.py
if w_q.shape != (layer.input_size_per_partition, layer.output_size_per_partition):
    replace_parameter(layer, "weight", w_q.t())

For square matrices where N == K (e.g. q_proj and o_proj where hidden_dim × hidden_dim = 4096×4096), the tuples (N, K) and (K, N) are identical. The condition always evaluates False and the transpose is silently skipped. The weight stays in (N, K) checkpoint orientation when Marlin requires (K, N), producing completely wrong GEMM results.

In dense models, q_proj and o_proj are always square. Their corrupted outputs overwhelm the residual stream and cascade through all layers.

Why it escaped CI

  • Most test shapes are non-square (e.g. MLP gate_proj at 12800×4096) — those always transposed correctly.
  • sm_89+ GPUs (H100, H200) use the native FP8 GEMM path and never reach the Marlin branch.
  • Isolated Marlin kernel tests all pass — the kernel math is correct; only the input weight layout is wrong.

Affected Code Path

On sm_75–sm_88, FP8 models fall back through:

  • CompressedTensors: CompressedTensorsW8A8Fp8 (needs sm_89) → CompressedTensorsW8A16Fp8MarlinFP8ScaledMMLinearKernel
  • ModelOpt: ModelOptFp8LinearMethod / ModelOptFp8PcPtLinearMethod → same MarlinFP8ScaledMMLinearKernel

Related

  • PR #38092 — introduced the regression
  • Issue #33314 — layout canonicalization (referenced in the TODO comment in the affected code)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] MarlinFP8 kernel silently skips weight transpose for square matrices (N==K), corrupting FP8 inference on sm_75–sm_88 GPUs [1 pull requests]