vllm - ✅(Solved) Fix [Bug/Feature] TurboQuant + Hybrid MoE (Qwen3.6-35B-A3B) broken on Ampere (SM 80-86) — 13 patches with fixes [3 pull requests, 3 comments, 3 participants]

vllm2026-04-17 09:17:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40124•Fetched 2026-04-18 05:52:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

referenced ×4subscribed ×4commented ×3cross-referenced ×3

TurboQuant KV cache (k8v4) combined with hybrid MoE models (Qwen3.6-35B-A3B-FP8 — 30 MoE + 10 dense layers, E=256, top_k=8, block_shape=[128,128]) does not work on Ampere GPUs (SM 80-86). Multiple code paths assume Hopper+ (SM ≥ 89) and hybrid model geometry is not handled correctly.

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Root Cause

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Fix Action

Fix / Workaround

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Issues Found (13 patches)

PR fix notes

PR #40127: fix: add SM>=89 guard for Triton block FP8 and Marlin fallback on Ampere

Repository: vllm-project/vllm
Author: Sandermage
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40127

Description (problem / solution / changelog)

Summary

Triton block FP8 kernel (TritonFp8BlockScaledMMKernel) uses fp8e4nv format which requires SM >= 89 (Ada Lovelace+). On Ampere GPUs (SM 86, e.g. RTX A5000/A6000), this causes a silent failure when loading block-quantized FP8 models (e.g. block_shape=[128,128]).

Changes

1. `TritonFp8BlockScaledMMKernel.is_supported()` — SM capability check

Query current_platform.get_device_capability() when compute_capability is not passed
Return False with descriptive message when SM < 89
Follows the same pattern used by other kernels (e.g. DeepGemmFp8BlockScaledMMKernel)

2. `get_fp8_linear_kernel()` — graceful fallback

Wrap block FP8 kernel selection in try/except ValueError
When all block FP8 kernels are unsupported, fall back to per-tensor FP8 kernels (_POSSIBLE_FP8_KERNELS) which includes Marlin
Marlin FP8 works correctly on SM 86 with per-tensor quantization

Testing

Tested on 2× NVIDIA RTX A5000 (SM 86, Ampere) with:

Model: Qwen3.6-35B-A3B-FP8 (MoE, block_shape=[128,128])
vLLM: v0.19.1 nightly, V1 engine, tensor_parallel_size=2
Result: 145+ tok/s generation speed, stable across 10 benchmark runs
Without this fix: crash during model loading (no supported FP8 block kernel)

Closes https://github.com/vllm-project/vllm/issues/40124 (Ampere FP8 part)
Genesis Project patches: https://github.com/Sandermage/genesis-vllm-patches

Changed files

vllm/model_executor/kernels/linear/__init__.py (modified, +46/-17)
vllm/model_executor/kernels/linear/scaled_mm/triton.py (modified, +9/-0)

PR #40128: fix: handle non-divisible page sizes in hybrid model KV cache unification

Repository: vllm-project/vllm
Author: Sandermage
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40128

Description (problem / solution / changelog)

Summary

unify_kv_cache_spec_page_size() raises NotImplementedError when page sizes across different layer types are not evenly divisible. This breaks hybrid models that mix attention (with TurboQuant KV cache) and recurrent layers (Mamba/DeltaNet).

Root cause: TurboQuant k8v4 with head_dim=256 yields page_size = block_size × num_kv_heads × 388 = 12416 bytes, while DeltaNet/Mamba state is ~12.6 MiB. 12648448 % 12416 ≠ 0, triggering the error.

Changes

Replace the hard NotImplementedError with LCM-based padding:

Fast path preserved: when all smaller page sizes divide max_page_size evenly, behavior is unchanged
Slow path (new): compute LCM of all smaller page sizes, pad max_page_size UP to the nearest multiple of that LCM
For the padded layer, use page_size_padded via dataclasses.replace(); for layers that divide evenly, scale block_size as before
Memory overhead is typically <0.1% (logged at INFO level)

Testing

Tested on Qwen3.6-35B-A3B-FP8 (hybrid: 30 MoE + 10 dense layers) with TurboQuant k8v4 KV cache on 2× RTX A5000:

Model loads successfully with unified page sizes
145+ tok/s, 160k context window, 10/10 stability runs

Without this fix: crash at model init with NotImplementedError: The page size of the layer is not divisible by the maximum page size.

Refs https://github.com/vllm-project/vllm/issues/40124
Genesis Project patches: https://github.com/Sandermage/genesis-vllm-patches

Changed files

vllm/v1/core/kv_cache_utils.py (modified, +51/-11)

PR #40129: perf: add tuned Triton MoE configs for RTX A5000 (E=256, N=512)

Repository: vllm-project/vllm
Author: Sandermage
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40129

Description (problem / solution / changelog)

Summary

Add pre-tuned Triton kernel configurations for the NVIDIA RTX A5000 (SM 86, Ampere) with E=256 experts, N=512 intermediate size.

There are currently no tuned MoE configs for Ampere consumer/workstation GPUs. The Triton autotuner defaults work but leave ~15-20% performance on the table.

Tuning Methodology

Configs were generated using vLLM's built-in Triton autotuner on 2× RTX A5000 with:

Model: Qwen3.6-35B-A3B-FP8 (256 experts, top_k=8, block_shape=[128,128])
Batch sizes: 1, 2, 4, 8, 16, 32, 64, 128, 256
Best configs selected by lowest kernel execution time across 100 iterations per batch size

Key observations for SM 86 (Ampere):

BLOCK_SIZE_M=16 is optimal — MoE routes few tokens per expert
BLOCK_SIZE_K varies 64-256 depending on batch size
num_stages kept conservative (1-4) due to SM 86 shared memory constraints

Performance

Metric	Without config	With config
Generation tok/s	~125	~145
Improvement	baseline	+16%

File

vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_RTX_A5000.json

Follows the same naming convention as existing configs (e.g., E=256,N=512,device_name=NVIDIA_H100_80GB_HBM3.json).

Refs https://github.com/vllm-project/vllm/issues/40124
Genesis Project: https://github.com/Sandermage/genesis-vllm-patches

Changed files

vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_RTX_A5000,dtype=fp8_w8a8,block_shape=[128,128].json (added, +74/-0)

RAW_BUFFERClick to expand / collapse

Summary

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Environment

vLLM: 0.19.1rc1 nightly (vllm/vllm-openai:nightly as of 2026-04-16)
GPUs: 2× NVIDIA RTX A5000 24 GB (SM86, Ampere), tensor_parallel_size=2
Model: Qwen/Qwen3-35B-A3B-FP8 — hybrid MoE, block_shape=[128,128]
Config: kv_cache_dtype=turboquant_k8v4, max_model_len=163840, gpu_memory_utilization=0.905
Driver: NVIDIA 570.211.01, CUDA 12.8

Issues Found (13 patches)

Critical — Model fails to run on Ampere without these

#	Issue	Related PR
1	`TritonBlockFP8ScaledMM` uses `fp8e4nv` which requires SM ≥ 89 → silent compute errors on Ampere	—
2	Kernel selector picks Triton for block FP8 on Ampere instead of Marlin fallback	—
3	TurboQuant BF16→FP8 online cast uses Triton SM ≥ 89 features	Pre #39988
4	Hybrid MoE models break TQ — excluded layers have different cache geometry	Improved #39931
5	MoE vs dense layers produce different block sizes → page allocation mismatch	—
6	KV cache interface does not account for TQ modified block sizes in skipped layers	#39931
7	Backend selector routes excluded (non-TQ) layers through TQ dispatch → shape mismatch	#39931

Stability — Crashes at scale

#	Issue	Related PR
9	`max_num_kv_tokens` includes Mamba state in hybrid models, inflating capacity	ai-jz/vllm#1
12	NaN/Inf in MoE router logits crashes FlashInfer topk sort	#39391
13	`<tool_call>` not treated as end-of-reasoning in `<think>` blocks for Qwen3	#35687

Performance

#	Improvement	Related PR	Impact
8	Dual-stream GDN input projection	#39748	+8% decode
10	`naive_block_assignment` for MoE	#39016 / #29354	+3-5% decode
11	Tuned Triton MoE kernel configs for A5000	Original	-8% to -16.5% kernel latency

Benchmark Results (after all patches)

Hardware: 2× RTX A5000, TP=2, TurboQuant K8V4, max_model_len=160k

Decode speed: 145.3 tok/s (single request, short context)
160k context: 48.6 tok/s ✅
Stability: 10/10 runs, σ=0.11 tok/s
Stress: 6/6 concurrent OK
CUDAGraphs mandatory: --enforce-eager drops speed 143→11 tok/s (13× slower)

Key Findings

FP8 on Ampere works via Marlin — Triton FP8 kernels must be bypassed (patches 1-2)
TurboQuant hybrid model support is incomplete — multiple issues with skip-layer handling (patches 4-7, relates to #39931)
Expert Parallelism NOT needed for 3B active params — TP=2 sufficient
160k context achievable on 2×24 GB with TQ K8V4

Suggested Action

Several of these fixes overlap with open PRs (#39931, #39988, #39748, #39016, #39391, #35687). The Ampere-specific FP8 issues (patches 1, 2, 5) and the tuned Triton configs (patch 11) are original and not covered by any existing PR.

We are happy to contribute proper PRs for the original patches if the maintainers are interested.

Full details, code, and benchmarks: https://github.com/Sandermage/genesis-vllm-patches

extent analysis

TL;DR

Apply the provided monkey-patches from the genesis-vllm-patches repository to enable TurboQuant KV cache with hybrid MoE models on Ampere GPUs.

Guidance

Apply the 13 patches provided in the genesis-vllm-patches repository to address the identified issues with TurboQuant KV cache and hybrid MoE models on Ampere GPUs.
Verify that the patches resolve the issues by running benchmarks and checking for stability and performance improvements.
Consider contributing the original patches to the main repository as proper PRs to ensure long-term support and maintenance.
Be aware that some fixes overlap with existing open PRs, and coordinate with maintainers to avoid duplication of effort.

Example

No code snippet is provided as the issue is resolved through applying existing patches.

Notes

The provided patches are specific to the Ampere GPU architecture and may not be applicable to other hardware configurations. Additionally, the patches are provided as a monkey-patch, which may not be a permanent solution and should be reviewed and integrated into the main codebase.

Recommendation

Apply the workaround by using the provided monkey-patches from the genesis-vllm-patches repository, as the issues are specific to the Ampere GPU architecture and the hybrid MoE models, and the patches provide a functional solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug/Feature] TurboQuant + Hybrid MoE (Qwen3.6-35B-A3B) broken on Ampere (SM 80-86) — 13 patches with fixes [3 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Issues Found (13 patches)

PR fix notes

PR #40127: fix: add SM>=89 guard for Triton block FP8 and Marlin fallback on Ampere

Description (problem / solution / changelog)

Summary

Changes

1. TritonFp8BlockScaledMMKernel.is_supported() — SM capability check

2. get_fp8_linear_kernel() — graceful fallback

Testing

Related

Changed files

PR #40128: fix: handle non-divisible page sizes in hybrid model KV cache unification

Description (problem / solution / changelog)

Summary

Changes

Testing

Related

Changed files

PR #40129: perf: add tuned Triton MoE configs for RTX A5000 (E=256, N=512)

Description (problem / solution / changelog)

Summary

Tuning Methodology

Performance

File

Related

Changed files

Summary

Environment

Issues Found (13 patches)

Critical — Model fails to run on Ampere without these

Stability — Crashes at scale

Performance

Benchmark Results (after all patches)

Key Findings

Suggested Action

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `TritonFp8BlockScaledMMKernel.is_supported()` — SM capability check

2. `get_fp8_linear_kernel()` — graceful fallback