vllm - ✅(Solved) Fix [Bug]: Kimi-K2.5 compressed-tensors MoE Marlin repack fails with PTX toolchain error on H200 (CUDA 12.8, driver 570.133.20) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38619Fetched 2026-04-08 01:58:55
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
referenced ×2cross-referenced ×1subscribed ×1

Error Message

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Root Cause

--enforce-eager does not help because the crash occurs during weight repacking, not during inference graph capture.

Fix Action

Workaround

Loading in BF16 (--quantization None) bypasses the Marlin MoE repack entirely but requires significantly more GPU memory.

PR fix notes

PR #38669: Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8)

Description (problem / solution / changelog)

Summary

Fixes #38619. The Marlin MoE repack kernel (gptq_marlin_moe_repack) crashes with CUDA error: the provided PTX was compiled with an unsupported toolchain when serving quantized MoE models (e.g. Kimi K2.5) on H100/H200 with a CUDA 12.8 driver, because pre-built wheels compiled with a newer CUDA toolkit generate PTX that the 12.8 driver cannot JIT-compile.

Root cause: MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS in CMakeLists.txt were set to "7.5;8.0+PTX", meaning on sm_90 (H100/H200) the driver must JIT-compile sm_80 PTX at runtime. If the wheel was built with CTK 12.9+, the embedded PTX uses a newer ISA version than the 12.8 driver supports.

Changes:

  • CMakeLists.txt: Add 9.0 to both MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS ("7.5;8.0;9.0+PTX"), so H100/H200 get native sm_90 SASS for Marlin repack kernels. The +PTX moves to 9.0 to preserve forward compatibility for future architectures.
  • vllm/_custom_ops.py: Wrap all four Marlin repack functions (gptq_marlin_repack, awq_marlin_repack, and their MoE variants) with try/except that catches the "unsupported toolchain" CUDA error and raises a diagnostic message including the driver version and build-from-source instructions.

Testing

Validated on an M2 cluster node:

  • Hardware: NVIDIA H200 (144GB), driver 570.133.20, CUDA 12.8
  • Build: vLLM built from source with CUDA_HOME=/usr/local/cuda-12.8, PyTorch 2.10.0+cu128
  • Model: moonshotai/Kimi-K2.5 (1T params, compressed-tensors WNA16 INT4, 384 MoE experts)
  • Config: TP8, --enforce-eager, --max-model-len 32768
  • Result: All 64 checkpoint shards loaded, Marlin MoE repack completed without errors, server started and responded to health checks on port 8042. Previously this crashed during process_weights_after_loading with the PTX toolchain error.

Changed files

  • CMakeLists.txt (modified, +12/-4)
  • vllm/_custom_ops.py (modified, +60/-12)

Code Example

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

---

File "compressed_tensors_moe.py", line 1492, in process_weights_after_loading
    marlin_w13_qweight = ops.gptq_marlin_moe_repack(
File "_custom_ops.py", line 1332, in gptq_marlin_moe_repack
    output[e] = torch.ops._C.gptq_marlin_repack(
torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

---

pip install vllm==0.17.1  # or 0.18.0, both fail
vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --trust-remote-code \
    --distributed-executor-backend ray \
    --max-model-len 262144 \
    --served-model-name kimi-k2.5 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --enforce-eager
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM versions tested: 0.17.1 and 0.18.0 (both fail identically)
  • GPU: NVIDIA H200 (144GB) x 32 (4 nodes, TP8 x PP4)
  • Driver: 570.133.20
  • CUDA: 12.8
  • PyTorch: 2.10.0+cu128
  • Python: 3.12.13
  • OS: Linux Ubuntu 22.04.5 LTS running kernel 5.15.0-153-generic on x86_64
  • Install method: pip install vllm (prebuilt wheel)

Describe the bug

Serving moonshotai/Kimi-K2.5 fails during process_weights_after_loading when the Marlin MoE repack kernel (gptq_marlin_moe_repack) attempts to execute. The model uses compressed-tensors quantization (WNA16, INT4, group_size=32) with MoE (384 experts). The checkpoint shards load successfully (64/64), but the subsequent Marlin weight repacking crashes with:

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Key observation: A standard (non-MoE) GPTQ-INT4 model (Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4) loads and serves correctly on the same cluster with the same vLLM install. The gptq_marlin path for dense models works; only the gptq_marlin_moe_repack path for MoE models fails. This suggests the PTX incompatibility is specific to the MoE Marlin kernel, not the dense Marlin kernel.

--enforce-eager does not help because the crash occurs during weight repacking, not during inference graph capture.

Failing stack trace

File "compressed_tensors_moe.py", line 1492, in process_weights_after_loading
    marlin_w13_qweight = ops.gptq_marlin_moe_repack(
File "_custom_ops.py", line 1332, in gptq_marlin_moe_repack
    output[e] = torch.ops._C.gptq_marlin_repack(
torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Steps to reproduce

pip install vllm==0.17.1  # or 0.18.0, both fail
vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --trust-remote-code \
    --distributed-executor-backend ray \
    --max-model-len 262144 \
    --served-model-name kimi-k2.5 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --enforce-eager

Workaround

Loading in BF16 (--quantization None) bypasses the Marlin MoE repack entirely but requires significantly more GPU memory.

Related issues

  • #36235 (same PTX error for AWQ models on same driver version)
  • #35718 (garbled output from Kimi-K2.5 INT4 on H200)
  • #30834 (closed, PTX toolchain error on H100, resolved by building from source)

Expected behavior

The Marlin MoE repack kernel should work on H200 with CUDA 12.8 / driver 570.133.20, or vLLM should fall back to a compatible code path when the PTX is incompatible.

extent analysis

TL;DR

The most likely fix is to rebuild the Marlin MoE repack kernel from source to ensure compatibility with the current CUDA and driver versions.

Guidance

  • Verify that the issue is specific to the MoE Marlin kernel by testing other models and kernels, as the dense Marlin kernel works correctly.
  • Consider rebuilding the Marlin MoE repack kernel from source, as suggested by the resolution of issue #30834, to ensure compatibility with CUDA 12.8 and driver 570.133.20.
  • As a temporary workaround, loading the model in BF16 (--quantization None) bypasses the Marlin MoE repack, but this requires significantly more GPU memory.
  • Review related issues #36235 and #35718 to see if there are any additional insights or workarounds that may be applicable.

Example

No code snippet is provided, as the issue is related to a specific kernel and CUDA compatibility.

Notes

The issue seems to be specific to the MoE Marlin kernel and the current CUDA and driver versions. Rebuilding the kernel from source may resolve the issue, but this requires additional development and testing.

Recommendation

Apply the workaround of loading the model in BF16 (--quantization None) until a more permanent solution, such as rebuilding the Marlin MoE repack kernel from source, can be implemented. This is because rebuilding the kernel may require significant development and testing, while the BF16 workaround can provide a temporary solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The Marlin MoE repack kernel should work on H200 with CUDA 12.8 / driver 570.133.20, or vLLM should fall back to a compatible code path when the PTX is incompatible.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING