vllm - ✅(Solved) Fix [Bug/Feature] TurboQuant + Hybrid MoE (Qwen3.6-35B-A3B) broken on Ampere (SM 80-86) — 13 patches with fixes [3 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40124Fetched 2026-04-18 05:52:27
View on GitHub
Comments
3
Participants
3
Timeline
16
Reactions
0
Timeline (top)
referenced ×4subscribed ×4commented ×3cross-referenced ×3

TurboQuant KV cache (k8v4) combined with hybrid MoE models (Qwen3.6-35B-A3B-FP8 — 30 MoE + 10 dense layers, E=256, top_k=8, block_shape=[128,128]) does not work on Ampere GPUs (SM 80-86). Multiple code paths assume Hopper+ (SM ≥ 89) and hybrid model geometry is not handled correctly.

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Root Cause

TurboQuant KV cache (k8v4) combined with hybrid MoE models (Qwen3.6-35B-A3B-FP8 — 30 MoE + 10 dense layers, E=256, top_k=8, block_shape=[128,128]) does not work on Ampere GPUs (SM 80-86). Multiple code paths assume Hopper+ (SM ≥ 89) and hybrid model geometry is not handled correctly.

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Fix Action

Fix / Workaround

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Issues Found (13 patches)

PR fix notes

PR #40127: fix: add SM>=89 guard for Triton block FP8 and Marlin fallback on Ampere

Description (problem / solution / changelog)

Summary

Triton block FP8 kernel (TritonFp8BlockScaledMMKernel) uses fp8e4nv format which requires SM >= 89 (Ada Lovelace+). On Ampere GPUs (SM 86, e.g. RTX A5000/A6000), this causes a silent failure when loading block-quantized FP8 models (e.g. block_shape=[128,128]).

Changes

1. TritonFp8BlockScaledMMKernel.is_supported() — SM capability check

  • Query current_platform.get_device_capability() when compute_capability is not passed
  • Return False with descriptive message when SM < 89
  • Follows the same pattern used by other kernels (e.g. DeepGemmFp8BlockScaledMMKernel)

2. get_fp8_linear_kernel() — graceful fallback

  • Wrap block FP8 kernel selection in try/except ValueError
  • When all block FP8 kernels are unsupported, fall back to per-tensor FP8 kernels (_POSSIBLE_FP8_KERNELS) which includes Marlin
  • Marlin FP8 works correctly on SM 86 with per-tensor quantization

Testing

Tested on 2× NVIDIA RTX A5000 (SM 86, Ampere) with:

  • Model: Qwen3.6-35B-A3B-FP8 (MoE, block_shape=[128,128])
  • vLLM: v0.19.1 nightly, V1 engine, tensor_parallel_size=2
  • Result: 145+ tok/s generation speed, stable across 10 benchmark runs
  • Without this fix: crash during model loading (no supported FP8 block kernel)

Related

Changed files

  • vllm/model_executor/kernels/linear/__init__.py (modified, +46/-17)
  • vllm/model_executor/kernels/linear/scaled_mm/triton.py (modified, +9/-0)

PR #40128: fix: handle non-divisible page sizes in hybrid model KV cache unification

Description (problem / solution / changelog)

Summary

unify_kv_cache_spec_page_size() raises NotImplementedError when page sizes across different layer types are not evenly divisible. This breaks hybrid models that mix attention (with TurboQuant KV cache) and recurrent layers (Mamba/DeltaNet).

Root cause: TurboQuant k8v4 with head_dim=256 yields page_size = block_size × num_kv_heads × 388 = 12416 bytes, while DeltaNet/Mamba state is ~12.6 MiB. 12648448 % 12416 ≠ 0, triggering the error.

Changes

Replace the hard NotImplementedError with LCM-based padding:

  1. Fast path preserved: when all smaller page sizes divide max_page_size evenly, behavior is unchanged
  2. Slow path (new): compute LCM of all smaller page sizes, pad max_page_size UP to the nearest multiple of that LCM
  3. For the padded layer, use page_size_padded via dataclasses.replace(); for layers that divide evenly, scale block_size as before
  4. Memory overhead is typically <0.1% (logged at INFO level)

Testing

Tested on Qwen3.6-35B-A3B-FP8 (hybrid: 30 MoE + 10 dense layers) with TurboQuant k8v4 KV cache on 2× RTX A5000:

  • Model loads successfully with unified page sizes
  • 145+ tok/s, 160k context window, 10/10 stability runs

Without this fix: crash at model init with NotImplementedError: The page size of the layer is not divisible by the maximum page size.

Related

Changed files

  • vllm/v1/core/kv_cache_utils.py (modified, +51/-11)

PR #40129: perf: add tuned Triton MoE configs for RTX A5000 (E=256, N=512)

Description (problem / solution / changelog)

Summary

Add pre-tuned Triton kernel configurations for the NVIDIA RTX A5000 (SM 86, Ampere) with E=256 experts, N=512 intermediate size.

There are currently no tuned MoE configs for Ampere consumer/workstation GPUs. The Triton autotuner defaults work but leave ~15-20% performance on the table.

Tuning Methodology

Configs were generated using vLLM's built-in Triton autotuner on 2× RTX A5000 with:

  • Model: Qwen3.6-35B-A3B-FP8 (256 experts, top_k=8, block_shape=[128,128])
  • Batch sizes: 1, 2, 4, 8, 16, 32, 64, 128, 256
  • Best configs selected by lowest kernel execution time across 100 iterations per batch size

Key observations for SM 86 (Ampere):

  • BLOCK_SIZE_M=16 is optimal — MoE routes few tokens per expert
  • BLOCK_SIZE_K varies 64-256 depending on batch size
  • num_stages kept conservative (1-4) due to SM 86 shared memory constraints

Performance

MetricWithout configWith config
Generation tok/s~125~145
Improvementbaseline+16%

File

vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_RTX_A5000.json

Follows the same naming convention as existing configs (e.g., E=256,N=512,device_name=NVIDIA_H100_80GB_HBM3.json).

Related

Changed files

  • vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_RTX_A5000,dtype=fp8_w8a8,block_shape=[128,128].json (added, +74/-0)
RAW_BUFFERClick to expand / collapse

Summary

TurboQuant KV cache (k8v4) combined with hybrid MoE models (Qwen3.6-35B-A3B-FP8 — 30 MoE + 10 dense layers, E=256, top_k=8, block_shape=[128,128]) does not work on Ampere GPUs (SM 80-86). Multiple code paths assume Hopper+ (SM ≥ 89) and hybrid model geometry is not handled correctly.

We have identified and fixed 13 issues, packaged as a runtime monkey-patch applied at container startup. All patches are available as open source:

Repository: https://github.com/Sandermage/genesis-vllm-patches

Environment

  • vLLM: 0.19.1rc1 nightly (vllm/vllm-openai:nightly as of 2026-04-16)
  • GPUs: 2× NVIDIA RTX A5000 24 GB (SM86, Ampere), tensor_parallel_size=2
  • Model: Qwen/Qwen3-35B-A3B-FP8 — hybrid MoE, block_shape=[128,128]
  • Config: kv_cache_dtype=turboquant_k8v4, max_model_len=163840, gpu_memory_utilization=0.905
  • Driver: NVIDIA 570.211.01, CUDA 12.8

Issues Found (13 patches)

Critical — Model fails to run on Ampere without these

#IssueRelated PR
1TritonBlockFP8ScaledMM uses fp8e4nv which requires SM ≥ 89 → silent compute errors on Ampere
2Kernel selector picks Triton for block FP8 on Ampere instead of Marlin fallback
3TurboQuant BF16→FP8 online cast uses Triton SM ≥ 89 featuresPre #39988
4Hybrid MoE models break TQ — excluded layers have different cache geometryImproved #39931
5MoE vs dense layers produce different block sizes → page allocation mismatch
6KV cache interface does not account for TQ modified block sizes in skipped layers#39931
7Backend selector routes excluded (non-TQ) layers through TQ dispatch → shape mismatch#39931

Stability — Crashes at scale

#IssueRelated PR
9max_num_kv_tokens includes Mamba state in hybrid models, inflating capacityai-jz/vllm#1
12NaN/Inf in MoE router logits crashes FlashInfer topk sort#39391
13<tool_call> not treated as end-of-reasoning in <think> blocks for Qwen3#35687

Performance

#ImprovementRelated PRImpact
8Dual-stream GDN input projection#39748+8% decode
10naive_block_assignment for MoE#39016 / #29354+3-5% decode
11Tuned Triton MoE kernel configs for A5000Original-8% to -16.5% kernel latency

Benchmark Results (after all patches)

Hardware: 2× RTX A5000, TP=2, TurboQuant K8V4, max_model_len=160k

  • Decode speed: 145.3 tok/s (single request, short context)
  • 160k context: 48.6 tok/s ✅
  • Stability: 10/10 runs, σ=0.11 tok/s
  • Stress: 6/6 concurrent OK
  • CUDAGraphs mandatory: --enforce-eager drops speed 143→11 tok/s (13× slower)

Key Findings

  1. FP8 on Ampere works via Marlin — Triton FP8 kernels must be bypassed (patches 1-2)
  2. TurboQuant hybrid model support is incomplete — multiple issues with skip-layer handling (patches 4-7, relates to #39931)
  3. Expert Parallelism NOT needed for 3B active params — TP=2 sufficient
  4. 160k context achievable on 2×24 GB with TQ K8V4

Suggested Action

Several of these fixes overlap with open PRs (#39931, #39988, #39748, #39016, #39391, #35687). The Ampere-specific FP8 issues (patches 1, 2, 5) and the tuned Triton configs (patch 11) are original and not covered by any existing PR.

We are happy to contribute proper PRs for the original patches if the maintainers are interested.

Full details, code, and benchmarks: https://github.com/Sandermage/genesis-vllm-patches

extent analysis

TL;DR

Apply the provided monkey-patches from the genesis-vllm-patches repository to enable TurboQuant KV cache with hybrid MoE models on Ampere GPUs.

Guidance

  • Apply the 13 patches provided in the genesis-vllm-patches repository to address the identified issues with TurboQuant KV cache and hybrid MoE models on Ampere GPUs.
  • Verify that the patches resolve the issues by running benchmarks and checking for stability and performance improvements.
  • Consider contributing the original patches to the main repository as proper PRs to ensure long-term support and maintenance.
  • Be aware that some fixes overlap with existing open PRs, and coordinate with maintainers to avoid duplication of effort.

Example

No code snippet is provided as the issue is resolved through applying existing patches.

Notes

The provided patches are specific to the Ampere GPU architecture and may not be applicable to other hardware configurations. Additionally, the patches are provided as a monkey-patch, which may not be a permanent solution and should be reviewed and integrated into the main codebase.

Recommendation

Apply the workaround by using the provided monkey-patches from the genesis-vllm-patches repository, as the issues are specific to the Ampere GPU architecture and the hybrid MoE models, and the patches provide a functional solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug/Feature] TurboQuant + Hybrid MoE (Qwen3.6-35B-A3B) broken on Ampere (SM 80-86) — 13 patches with fixes [3 pull requests, 3 comments, 3 participants]