vllm - 💡(How to fix) Fix [Bug] MarlinFP8 kernel silently skips weight transpose for square matrices (N==K), corrupting FP8 inference on sm_75–sm_88 GPUs [1 pull requests]

vllm2026-05-31 12:11:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

MarlinFP8ScaledMMLinearKernel.process_weights_after_loading uses a shape comparison to decide whether to transpose the weight to (K, N) before passing it to the Marlin kernel:

# vllm/model_executor/kernels/linear/scaled_mm/marlin.py
if w_q.shape != (layer.input_size_per_partition, layer.output_size_per_partition):
    replace_parameter(layer, "weight", w_q.t())

For square matrices where N == K (e.g. q_proj and o_proj where hidden_dim × hidden_dim = 4096×4096), the tuples (N, K) and (K, N) are identical. The condition always evaluates False and the transpose is silently skipped. The weight stays in (N, K) checkpoint orientation when Marlin requires (K, N), producing completely wrong GEMM results.

In dense models, q_proj and o_proj are always square. Their corrupted outputs overwhelm the residual stream and cascade through all layers.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix MarlinFP8 weight transpose silently skipped for square matrices (N==K) (https://github.com/vllm-project/vllm/pull/44113)

Code Example

# FP8 model on A40 (sm_86)
OUTPUT: ',,,,,,,,,,,,,,,,,,,,'

# BF16 model on A40 (sm_86)
OUTPUT: 'The capital of France is Paris...'

---

# vllm/model_executor/kernels/linear/scaled_mm/marlin.py
if w_q.shape != (layer.input_size_per_partition, layer.output_size_per_partition):
    replace_parameter(layer, "weight", w_q.t())

RAW_BUFFERClick to expand / collapse

Environment

GPU: sm_75–sm_88 (e.g. A10, A40, RTX A6000) — any GPU where supports_fp8() returns False
Model: any FP8 model with square weight matrices, e.g. ibm-granite/granite-4.1-8b-fp8
Quantization: compressed-tensors or modelopt FP8 with channel/tensor-wise scales

Symptoms

FP8 models produce garbage output (repeated punctuation, incoherent tokens) on GPUs lacking native FP8 compute, while the identical non-FP8 model works correctly on the same hardware. No errors, no NaN — silent data corruption.

# FP8 model on A40 (sm_86)
OUTPUT: ',,,,,,,,,,,,,,,,,,,,'

# BF16 model on A40 (sm_86)
OUTPUT: 'The capital of France is Paris...'

Root Cause

MarlinFP8ScaledMMLinearKernel.process_weights_after_loading uses a shape comparison to decide whether to transpose the weight to (K, N) before passing it to the Marlin kernel:

# vllm/model_executor/kernels/linear/scaled_mm/marlin.py
if w_q.shape != (layer.input_size_per_partition, layer.output_size_per_partition):
    replace_parameter(layer, "weight", w_q.t())

In dense models, q_proj and o_proj are always square. Their corrupted outputs overwhelm the residual stream and cascade through all layers.

Why it escaped CI

Most test shapes are non-square (e.g. MLP gate_proj at 12800×4096) — those always transposed correctly.
sm_89+ GPUs (H100, H200) use the native FP8 GEMM path and never reach the Marlin branch.
Isolated Marlin kernel tests all pass — the kernel math is correct; only the input weight layout is wrong.

Affected Code Path

On sm_75–sm_88, FP8 models fall back through:

CompressedTensors: CompressedTensorsW8A8Fp8 (needs sm_89) → CompressedTensorsW8A16Fp8 → MarlinFP8ScaledMMLinearKernel
ModelOpt: ModelOptFp8LinearMethod / ModelOptFp8PcPtLinearMethod → same MarlinFP8ScaledMMLinearKernel

PR #38092 — introduced the regression
Issue #33314 — layout canonicalization (referenced in the TODO comment in the affected code)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] MarlinFP8 kernel silently skips weight transpose for square matrices (N==K), corrupting FP8 inference on sm_75–sm_88 GPUs [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Code Example

Environment

Symptoms

Root Cause

Why it escaped CI

Affected Code Path

Related

Still need to ship something?

TRENDING