vllm - ✅(Solved) Fix [Bug]: Kimi-K2.5 compressed-tensors MoE Marlin repack fails with PTX toolchain error on H200 (CUDA 12.8, driver 570.133.20) [1 pull requests, 1 participants]

DavidBellamy · 2026-03-31T08:53:13Z

[vllm] PR 38669: Fix Marlin repack PTX incompatibility on H100/H200 CUDA 12.8 - Repository: vllm-project/vllm - Author: DavidBellamy - State: open | merged: Fa… # PR #38669: Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8) - Repository: vllm-project/vllm - Author: DavidBellamy - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/38669 ## Description (problem / solution / changelog) ## Summary Fixes #38619. The Marlin MoE repack kernel (`gptq_marlin_moe_repack`) crashes with `CUDA error: the provided PTX was compiled with an unsupported toolchain` when serving quantized MoE models (e.g. Kimi K2.5) on H100/H200 with a CUDA 12.8 driver, because pre-built wheels compiled with a newer CUDA toolkit generate PTX that the 12.8 driver cannot JIT-compile. **Root cause**: `MARLIN_OTHER_ARCHS` and `MARLIN_MOE_OTHER_ARCHS` in `CMakeLists.txt` were set to `"7.5;8.0+PTX"`, meaning on sm_90 (H100/H200) the driver must JIT-compile sm_80 PTX at runtime. If the wheel was built with CTK 12.9+, the embedded PTX uses a newer ISA version than the 12.8 driver supports. **Changes**: - **CMakeLists.txt**: Add `9.0` to both `MARLIN_OTHER_ARCHS` and `MARLIN_MOE_OTHER_ARCHS` (`"7.5;8.0;9.0+PTX"`), so H100/H200 get native sm_90 SASS for Marlin repack kernels. The `+PTX` moves to 9.0 to preserve forward compatibility for future architectures. - **vllm/_custom_ops.py**: Wrap all four Marlin repack functions (`gptq_marlin_repack`, `awq_marlin_repack`, and their MoE variants) with try/except that catches the "unsupported toolchain" CUDA error and raises a diagnostic message including the driver version and build-from-source instructions. ## Testing Validated on an M2 cluster node: - **Hardware**: NVIDIA H200 (144GB), driver 570.133.20, CUDA 12.8 - **Build**: vLLM built from source with `CUDA_HOME=/usr/local/cuda-12.8`, PyTorch 2.10.0+cu128 - **Model**: `moonshotai/Kimi-K2.5` (1T params, compressed-tensors WNA16 INT4, 384 MoE experts) - **Config**: TP8, `--enforce-eager`, `--max-model-len 32768` - **Result**: All 64 checkpoint shards loaded, Marlin MoE repack completed without errors, server started and responded to health checks on port 8042. Previously this crashed during `process_weights_after_loading` with the PTX toolchain error. ## Changed files - `CMakeLists.txt` (modified, +12/-4) - `vllm/_custom_ops.py` (modified, +60/-12) ## Workaround Loading in BF16 (`--quantization None`) bypasses the Marlin MoE repack entirely but requires significantly more GPU memory. ### Your current environment - **vLLM versions tested**: 0.17.1 and 0.18.0 (both fail identically) - **GPU**: NVIDIA H200 (144GB) x 32 (4 nodes, TP8 x PP4) - **Driver**: 570.133.20 - **CUDA**: 12.8 - **PyTorch**: 2.10.0+cu128 - **Python**: 3.12.13 - **OS**: Linux Ubuntu 22.04.5 LTS running kernel 5.15.0-153-generic on x86_64 - **Install method**: `pip install vllm` (prebuilt wheel) ### Describe the bug Serving `moonshotai/Kimi-K2.5` fails during `process_weights_after_loading` when the Marlin MoE repack kernel (`gptq_marlin_moe_repack`) attempts to execute. The model uses `compressed-tensors` quantization (WNA16, INT4, group_size=32) with MoE (384 experts). The checkpoint shards load successfully (64/64), but the subsequent Marlin weight repacking crashes with: ``` torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain. ``` **Key observation**: A standard (non-MoE) GPTQ-INT4 model (`Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4`) loads and serves correctly on the same cluster with the same vLLM install. The `gptq_marlin` path for dense models works; only the `gptq_marlin_moe_repack` path for MoE models fails. This suggests the PTX incompatibility is specific to the MoE Marlin kernel, not the dense Marlin kernel. `--enforce-eager` does not help because the crash occurs during weight repacking, not during inference graph capture. ### Failing stack trace ``` File "compressed_tensors_moe.py", line 1492, in process_weights_after_loading marlin_w13_qweight = ops.gptq_marlin_moe_repack( File "_custom_ops.py", line 1332, in gptq_marlin_moe_repack output[e] = torch.ops._C.gptq_marlin_repack( torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain. ``` ### Steps to reproduce ```bash pip install vllm==0.17.1 # or 0.18.0, both fail vllm serve moonshotai/Kimi-K2.5 \ --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ --trust-remote-code \ --distributed-executor-backend ray \ --max-model-len 262144 \ --served-model-name kimi-k2.5 \ --enable-auto-tool-choice \ --tool-call-parser kimi_k2 \ --enforce-eager ``` ### Workaround Loading in BF16 (`--quantization None`) bypasses the Marlin MoE repack entirely but requires significantly more GPU memory. ### Related issues - #36235 (same PTX error for AWQ models on same driver version) - #35718 (garbled output from Kimi-K2.5 INT4 on H200) - #30834 (closed, PTX toolchain error on H100, resolved by building from source) ### Expected behavior

vllm2026-03-31 08:53:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38619•Fetched 2026-04-08 01:58:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

DavidBellamy

Participants

DavidBellamy

Timeline (top)

referenced ×2cross-referenced ×1subscribed ×1

Error Message

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Root Cause

--enforce-eager does not help because the crash occurs during weight repacking, not during inference graph capture.

Fix Action

Workaround

Loading in BF16 (--quantization None) bypasses the Marlin MoE repack entirely but requires significantly more GPU memory.

PR fix notes

PR #38669: Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8)

Repository: vllm-project/vllm
Author: DavidBellamy
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38669

Description (problem / solution / changelog)

Summary

Fixes #38619. The Marlin MoE repack kernel (gptq_marlin_moe_repack) crashes with CUDA error: the provided PTX was compiled with an unsupported toolchain when serving quantized MoE models (e.g. Kimi K2.5) on H100/H200 with a CUDA 12.8 driver, because pre-built wheels compiled with a newer CUDA toolkit generate PTX that the 12.8 driver cannot JIT-compile.

Root cause: MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS in CMakeLists.txt were set to "7.5;8.0+PTX", meaning on sm_90 (H100/H200) the driver must JIT-compile sm_80 PTX at runtime. If the wheel was built with CTK 12.9+, the embedded PTX uses a newer ISA version than the 12.8 driver supports.

Changes:

CMakeLists.txt: Add 9.0 to both MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS ("7.5;8.0;9.0+PTX"), so H100/H200 get native sm_90 SASS for Marlin repack kernels. The +PTX moves to 9.0 to preserve forward compatibility for future architectures.
vllm/_custom_ops.py: Wrap all four Marlin repack functions (gptq_marlin_repack, awq_marlin_repack, and their MoE variants) with try/except that catches the "unsupported toolchain" CUDA error and raises a diagnostic message including the driver version and build-from-source instructions.

Testing

Validated on an M2 cluster node:

Hardware: NVIDIA H200 (144GB), driver 570.133.20, CUDA 12.8
Build: vLLM built from source with CUDA_HOME=/usr/local/cuda-12.8, PyTorch 2.10.0+cu128
Model: moonshotai/Kimi-K2.5 (1T params, compressed-tensors WNA16 INT4, 384 MoE experts)
Config: TP8, --enforce-eager, --max-model-len 32768
Result: All 64 checkpoint shards loaded, Marlin MoE repack completed without errors, server started and responded to health checks on port 8042. Previously this crashed during process_weights_after_loading with the PTX toolchain error.

Changed files

CMakeLists.txt (modified, +12/-4)
vllm/_custom_ops.py (modified, +60/-12)

Code Example

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

---

File "compressed_tensors_moe.py", line 1492, in process_weights_after_loading
    marlin_w13_qweight = ops.gptq_marlin_moe_repack(
File "_custom_ops.py", line 1332, in gptq_marlin_moe_repack
    output[e] = torch.ops._C.gptq_marlin_repack(
torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

---

pip install vllm==0.17.1  # or 0.18.0, both fail
vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --trust-remote-code \
    --distributed-executor-backend ray \
    --max-model-len 262144 \
    --served-model-name kimi-k2.5 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --enforce-eager

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM versions tested: 0.17.1 and 0.18.0 (both fail identically)
GPU: NVIDIA H200 (144GB) x 32 (4 nodes, TP8 x PP4)
Driver: 570.133.20
CUDA: 12.8
PyTorch: 2.10.0+cu128
Python: 3.12.13
OS: Linux Ubuntu 22.04.5 LTS running kernel 5.15.0-153-generic on x86_64
Install method: pip install vllm (prebuilt wheel)

Describe the bug

Serving moonshotai/Kimi-K2.5 fails during process_weights_after_loading when the Marlin MoE repack kernel (gptq_marlin_moe_repack) attempts to execute. The model uses compressed-tensors quantization (WNA16, INT4, group_size=32) with MoE (384 experts). The checkpoint shards load successfully (64/64), but the subsequent Marlin weight repacking crashes with:

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Key observation: A standard (non-MoE) GPTQ-INT4 model (Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4) loads and serves correctly on the same cluster with the same vLLM install. The gptq_marlin path for dense models works; only the gptq_marlin_moe_repack path for MoE models fails. This suggests the PTX incompatibility is specific to the MoE Marlin kernel, not the dense Marlin kernel.

--enforce-eager does not help because the crash occurs during weight repacking, not during inference graph capture.

Failing stack trace

File "compressed_tensors_moe.py", line 1492, in process_weights_after_loading
    marlin_w13_qweight = ops.gptq_marlin_moe_repack(
File "_custom_ops.py", line 1332, in gptq_marlin_moe_repack
    output[e] = torch.ops._C.gptq_marlin_repack(
torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Steps to reproduce

pip install vllm==0.17.1  # or 0.18.0, both fail
vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --trust-remote-code \
    --distributed-executor-backend ray \
    --max-model-len 262144 \
    --served-model-name kimi-k2.5 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --enforce-eager

Workaround

Loading in BF16 (--quantization None) bypasses the Marlin MoE repack entirely but requires significantly more GPU memory.

Related issues

#36235 (same PTX error for AWQ models on same driver version)
#35718 (garbled output from Kimi-K2.5 INT4 on H200)
#30834 (closed, PTX toolchain error on H100, resolved by building from source)

Expected behavior

The Marlin MoE repack kernel should work on H200 with CUDA 12.8 / driver 570.133.20, or vLLM should fall back to a compatible code path when the PTX is incompatible.

extent analysis

TL;DR

The most likely fix is to rebuild the Marlin MoE repack kernel from source to ensure compatibility with the current CUDA and driver versions.

Guidance

Verify that the issue is specific to the MoE Marlin kernel by testing other models and kernels, as the dense Marlin kernel works correctly.
Consider rebuilding the Marlin MoE repack kernel from source, as suggested by the resolution of issue #30834, to ensure compatibility with CUDA 12.8 and driver 570.133.20.
As a temporary workaround, loading the model in BF16 (--quantization None) bypasses the Marlin MoE repack, but this requires significantly more GPU memory.
Review related issues #36235 and #35718 to see if there are any additional insights or workarounds that may be applicable.

Example

No code snippet is provided, as the issue is related to a specific kernel and CUDA compatibility.

Notes

The issue seems to be specific to the MoE Marlin kernel and the current CUDA and driver versions. Rebuilding the kernel from source may resolve the issue, but this requires additional development and testing.

Recommendation

Apply the workaround of loading the model in BF16 (--quantization None) until a more permanent solution, such as rebuilding the Marlin MoE repack kernel from source, can be implemented. This is because rebuilding the kernel may require significant development and testing, while the BF16 workaround can provide a temporary solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The Marlin MoE repack kernel should work on H200 with CUDA 12.8 / driver 570.133.20, or vLLM should fall back to a compatible code path when the PTX is incompatible.

#chain error #indexing error #inference speed #output truncation #response parsing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Kimi-K2.5 compressed-tensors MoE Marlin repack fails with PTX toolchain error on H200 (CUDA 12.8, driver 570.133.20) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #38669: Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8)

Description (problem / solution / changelog)

Summary

Testing

Changed files

Code Example

Your current environment

Describe the bug

Failing stack trace

Steps to reproduce

Workaround

Related issues

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING