vllm - ✅(Solved) Fix [Bug]: Ampere sm_86 can't load W4A16 quant at TP=2 when a layer's output dim halves to <64 (Marlin min_thread_n block) [1 pull requests]

vllm2026-04-20 12:29:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Kernel-selection fallback chain (from error) 2. Bundle conch-triton-kernels in the stable Docker images (it's already referenced in the fallback chain; users hitting the error are asked to 3. Better error message pointing at the actual root cause — the current "Consider reducing tensor_parallel_size or running with --quantization

Root Cause

On RTX 3090 (Ampere sm_86), vllm/vllm-openai:latest 0.19.1, :gemma4-cu130, and :nightly (0.19.2rc1.dev21) all fail to load Intel/Qwen3.6-35B-A3B-int4-AutoRound at --tensor-parallel-size 2, because one layer has an output dim of 64 that halves to 32 per rank and then trips Marlin's min_thread_n = 64 constraint. TP=1 works fine (benches 156 TPS short-prompt on single card). The quant is unusable on any consumer Ampere user's dual-card setup.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD EPYC 7543 32-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 BogoMIPS: 5599.65 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor fsrm Virtualization: AMD-V L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 256 MiB (16 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Vulnerable: Clear CPU buffers attempted, no microcode Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Why the workaround advice doesn't help on Ampere

PR fix notes

PR #40361: [Kernel][Bugfix] Marlin W4A16: pad sub-tile output dims on load

Repository: vllm-project/vllm
Author: noonghunna
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40361

Description (problem / solution / changelog)

Summary

The Marlin W4A16 kernel requires each per-rank output dim to be a multiple of GPTQ_MARLIN_MIN_THREAD_N = 64. When a packed layer's natural output dim shards below 64 under TP (e.g. Qwen3.5 GatedDeltaNet.in_proj_ba with num_v_heads=64 at TP>=2, or Intel/Qwen3.6-35B-A3B-int4-AutoRound whose packing yields an n=32 shard at TP=2), load fails with size_n=... not divisible by tile_n_size=64. On Ampere there is no stock fallback (Machete / CutlassW4A8 are sm_90+, AllSpark requires group_size=-1, etc.), so TP=2 is effectively unusable for these quants.

This PR adds a generic pad-on-load path inside MarlinLinearKernel:

can_implement validates the shape check against round_up(n, 64).
process_weights_after_loading pads qweight, scales, qzeros and bias along the output dim to the next tile multiple with zeros, swaps self.config.partition_weight_shape to the padded value so downstream repack / permute / zero-point transforms see the padded size, and stores the original n on the layer for later slicing.
apply_weights calls marlin_gemm at padded_n and slices the extra columns off the output.

Runtime cost is zero — padding happens once at load. VRAM cost is a few KB per affected layer (zero-filled padding along output dim). When the shard is already tile-aligned, _maybe_pad_n returns early with padded_n == orig_n and the path is a no-op.

Also adds explicit attribute annotations to MPLinearKernel so mypy can type-check self.config / self.w_*_name accesses the new code introduces (also clears two pre-existing has-type errors in marlin.py).

Why not duplicating #36329?

	#36329 (sonusflow)	This PR
Scope	Qwen3.5 GDN `in_proj_ba` only	Any layer, any Marlin-backed quant
Approach	Swap `MergedColumnParallelLinear` -> `ReplicatedLinear`	Pad inside the kernel itself
Future models with the same bug	Still need a per-model PR	Auto-handled
VRAM cost per affected layer	Full replication (~MB)	Zero-padded columns (~KB)

These are complementary — #36329 removes a specific Marlin call site; this PR makes the remaining Marlin call sites tolerate sub-tile n.

Closes #35924 (generically, not Qwen3.5-GDN-specific) Related: #40354

Testing

test_marlin_gemm_sub_tile_n_pad[{32,48,96}] (new) in tests/kernels/quantization/test_marlin_gemm.py exercises the pad-before-repack -> marlin_gemm -> slice sequence and checks numerical parity against a reference matmul on the un-padded weight. All three parametrizations pass on RTX 3090 (sm_86, CUDA 12.9, nightly base).
End-to-end on 2x RTX 3090 with Intel/Qwen3.6-35B-A3B-int4-AutoRound at TP=2: model loads, produces coherent output on OpenAI-compat /v1/completions, benches at 170.5 tok/s (up from 156 tok/s at TP=1 on the same hardware, and 137 tok/s palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4 TP=2 baseline).
Regression on Intel/gemma-4-31B-it-int4-AutoRound at TP=2 (tile-aligned shards): no padding log fires, 60.4 tok/s vs 61.0 on :latest — within noise.
pre-commit run --files clean (ruff, ruff format, typos, mypy, SPDX, ...).

Test plan

pytest tests/kernels/quantization/test_marlin_gemm.py::test_marlin_gemm_sub_tile_n_pad -v
End-to-end TP=2 smoke bench on an AutoRound n=32-shard model
Regression bench on a tile-aligned AutoRound model (fast path untouched)
CI tests (needs ready label from a maintainer)

AI assistance (Claude Opus 4.7) was used to draft and verify the patch; the submitter reviewed every changed line.

Changed files

tests/kernels/quantization/test_marlin_gemm.py (modified, +76/-0)
vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py (modified, +6/-0)
vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +94/-3)

RAW_BUFFERClick to expand / collapse

Your current environment

Collecting environment information...

    System Info

============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.35

============================== PyTorch Info

PyTorch version : 2.11.0+cu129 Is debug build : False CUDA used to build PyTorch : 12.9 ROCM used to build PyTorch : N/A XPU used to build PyTorch : N/A

============================== Python Environment

Python version : 3.12.13 (main, Mar 4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime) Python platform : Linux-6.8.0-110-generic-x86_64-with-glibc2.35

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : 12.9.86 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version : 595.58.03 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.7 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.9.1.4 [pip3] nvidia-cuda-cupti-cu12==12.9.79 [pip3] nvidia-cuda-nvrtc-cu12==12.9.86 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] nvidia-cudnn-cu12==9.17.1.4 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.4.1.4 [pip3] nvidia-cufile-cu12==1.14.1.1 [pip3] nvidia-curand-cu12==10.3.10.19 [pip3] nvidia-cusolver-cu12==11.7.5.82 [pip3] nvidia-cusparse-cu12==12.5.10.65 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.2 [pip3] nvidia-cutlass-dsl-libs-base==4.4.2 [pip3] nvidia-ml-py==13.595.45 [pip3] nvidia-nccl-cu12==2.28.9 [pip3] nvidia-nvjitlink-cu12==12.9.86 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.9.79 [pip3] pyzmq==27.1.0 [pip3] torch==2.11.0+cu129 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torchaudio==2.11.0+cu129 [pip3] torchvision==0.26.0+cu129 [pip3] transformers==5.5.4 [pip3] triton==3.6.0 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.19.2rc1.dev21+g893611813 (git sha: 893611813) vLLM Build Flags: CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-15 0 N/A GPU1 PHB X 0-15 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

============================== Environment Variables

NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 CUDA_VERSION=12.9.1 LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64 NVIDIA_DRIVER_CAPABILITIES=compute,utility VLLM_ENABLE_CUDA_COMPATIBILITY=0 TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0 VLLM_USAGE_SOURCE=production-docker-image NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu NVIDIA_VISIBLE_DEVICES=void PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

🐛 Describe the bug

Problem

Repro

docker run --rm --gpus all --ipc host --shm-size 16g -p 8000:8000
-v $HF_CACHE:/root/.cache/huggingface
vllm/vllm-openai:latest
--model Intel/Qwen3.6-35B-A3B-int4-AutoRound
--tensor-parallel-size 2 --max-model-len 32768
--gpu-memory-utilization 0.92 --disable-custom-all-reduce
--trust-remote-code

Same failure on :gemma4-cu130 and :nightly.

Kernel-selection fallback chain (from error)

ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons: CutlassW4A8LinearKernel requires capability 90, current compute capability is 86 MacheteLinearKernel requires capability 90, current compute capability is 86 AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported. MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq. ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via pip install conch-triton-kernels and try again! ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations

Stack trace points at gdn_linear_attn.py → MergedColumnParallelLinear → gptq_marlin.create_weights → choose_mp_linear_kernel.

Why the workaround advice doesn't help on Ampere

--quantization gptq: routes through the same WNA16 kernel set, same failure.
Reduce tensor-parallel-size: works (TP=1 loads), but dual-card users lose capacity.
pip install conch-triton-kernels: not present in any published vLLM image — verified on :latest, :gemma4-cu130, :nightly. Users would need a custom image.

Model provider's position

I raised this on the HF repo ([Intel/Qwen3.6-35B-A3B-int4-AutoRound discussion]). Their response: padding the tensor at quantization time trades TP flexibility for TP=1 overhead — they suggested vLLM handle it at inference time instead.

Request

One of the following would unblock Ampere dual-card users:

Automatic pad-at-load-time in the Marlin path. When output_size_per_partition % min_thread_n != 0, pad the loaded weight slice to the next multiple of 64 and mask the extra columns during matmul. This is a one-time load-time cost; runtime is unaffected.
Bundle conch-triton-kernels in the stable Docker images (it's already referenced in the fallback chain; users hitting the error are asked to install it but can't easily).
Better error message pointing at the actual root cause — the current "Consider reducing tensor_parallel_size or running with --quantization gptq" suggestion doesn't work.

Option (1) would be the most general fix — any W4A16 quant with a small-dim layer would benefit, not just this specific model.

Environment

GPUs: 2× NVIDIA RTX 3090 (Ampere, compute capability 8.6), PCIe-only, no NVLink
Driver: 595.58.03 / CUDA 13.2
Images tested: vllm/vllm-openai:latest (0.19.1), :gemma4-cu130, :nightly (0.19.2rc1.dev21)
Quant: Intel/Qwen3.6-35B-A3B-int4-AutoRound (auto-round packing method, packed as auto_round:auto_gptq, bits=4, group_size=128, symmetric)
Comparison: Intel/gemma-4-31B-it-int4-AutoRound loads and benches cleanly at TP=2 on the same hardware (61 TPS short-prompt decode) — confirms the issue is specific to this quant's tensor shapes, not the AutoRound format in general.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the issue is to implement automatic pad-at-load-time in the Marlin path to handle cases where the output dimension is not divisible by the minimum thread count.

Guidance

The error occurs due to the MarlinLinearKernel's requirement that the weight output size per partition must be divisible by the minimum thread count (64), which is not met in this case.
The current workaround suggestions, such as reducing the tensor parallel size or running with --quantization gptq, do not resolve the issue for Ampere dual-card users.
To fix the issue, one of the following solutions can be implemented:
- Automatic pad-at-load-time in the Marlin path to pad the loaded weight slice to the next multiple of 64 and mask the extra columns during matmul.
- Bundle conch-triton-kernels in the stable Docker images to provide an alternative kernel implementation.
- Improve the error message to point to the actual root cause of the issue.

Example

No code snippet is provided as the issue is related to the underlying kernel implementation and requires changes to the Marlin path or the addition of alternative kernel implementations.

Notes

The issue is specific to the Intel/Qwen3.6-35B-A3B-int4-AutoRound model and the Ampere GPU architecture, and the proposed solutions aim to address this specific use case.

Recommendation

Apply the workaround by implementing automatic pad-at-load-time in the Marlin path, as it provides a general fix for any W4A16 quantization with small-dim layers, not just the specific model in question.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #tensor shape #environment variable #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Ampere sm_86 can't load W4A16 quant at TP=2 when a layer's output dim halves to <64 (Marlin min_thread_n block) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #40361: [Kernel][Bugfix] Marlin W4A16: pad sub-tile output dims on load

Description (problem / solution / changelog)

Summary

Why not duplicating #36329?

Testing

Test plan

Changed files

Your current environment

Collecting environment information...

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

============================== Versions of relevant libraries

============================== vLLM Info

============================== Environment Variables

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING