- `--moe-backend triton` should either: - switch the Kimi compressed-tensors MoE path away from Marlin, or - fail clearly if that model path cannot use Triton. - The default runtime should not collapse into degenerate token repetition on GB10 / SM 12.1.

vllm - 💡(How to fix) Fix [Bug]: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin under --moe-backend triton and degenerates into repeated tokens

vllm2026-04-20 13:05:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

moonshotai/Kimi-K2.5 on a 16-node DGX Spark cluster with NVIDIA GB10 (SM 12.1) is not usable through public vLLM builds.

On vllm/vllm-openai:nightly, a low-memory launch profile is stable enough to reach:

/health = 200
/v1/models returns kimi-k2.5

But completions still collapse into degenerate token repetition:

The capital of France is -> foss foss foss ...

Additionally, an explicit --moe-backend triton request reaches the vllm serve command line, but runtime still resolves the model path to Marlin:

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

This looks like either:

compressed-tensors Kimi path ignoring the requested MoE backend, or
Marlin WNA16 MoE producing numerically incorrect results on GB10 / SM 12.1.

Root Cause

This confirms that the degenerate output is not caused by broken weights or broken dequantization metadata. The input to the Marlin WNA16 MoE kernel looks correct, but the kernel path produces numerically incorrect output on GB10 / SM 12.1.

Fix Action

Fix / Workaround

We patched our launcher so that --moe-backend triton is explicitly present in the final vllm serve command line.

Code Example

--max-model-len 32768
--gpu-memory-utilization 0.85
--enforce-eager
--max-num-batched-tokens 512

---

vllm serve /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
  --served-model-name kimi-k2.5 \
  --tensor-parallel-size 16 \
  --distributed-executor-backend ray \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --max-num-batched-tokens 512 \
  --disable-custom-all-reduce \
  --host 0.0.0.0 --port 8001

---

curl -s http://127.0.0.1:8001/health
curl -s http://127.0.0.1:8001/v1/models
curl -s http://127.0.0.1:8001/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"kimi-k2.5",
    "prompt":"The capital of France is",
    "max_tokens":10,
    "temperature":0.0
  }'

---

foss foss foss foss foss foss foss foss foss foss

---

--moe-backend triton

---

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

---

docker run --rm --gpus all vllm/vllm-openai:nightly \
  python3 -c "
import torch
d = torch.cuda.get_device_properties(0)
print(f'SM {d.major}.{d.minor}, arch_list={torch.cuda.get_arch_list()}')
"

---

--max-model-len 32768
--gpu-memory-utilization 0.85
--enforce-eager
--max-num-batched-tokens 512

---

vllm serve /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
  --served-model-name kimi-k2.5 \
  --tensor-parallel-size 16 \
  --distributed-executor-backend ray \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --max-num-batched-tokens 512 \
  --disable-custom-all-reduce \
  --host 0.0.0.0 --port 8001

---

curl -s http://127.0.0.1:8001/health
curl -s http://127.0.0.1:8001/v1/models
curl -s http://127.0.0.1:8001/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"kimi-k2.5",
    "prompt":"The capital of France is",
    "max_tokens":10,
    "temperature":0.0
  }'

---

foss foss foss foss foss foss foss foss foss foss

---

--moe-backend triton

---

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

---

docker run --rm --gpus all vllm/vllm-openai:nightly \
  python3 -c "
import torch
d = torch.cuda.get_device_properties(0)
print(f'SM {d.major}.{d.minor}, arch_list={torch.cuda.get_arch_list()}')
"

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM issue draft: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin and returns degenerate token repetition

Summary

moonshotai/Kimi-K2.5 on a 16-node DGX Spark cluster with NVIDIA GB10 (SM 12.1) is not usable through public vLLM builds.

On vllm/vllm-openai:nightly, a low-memory launch profile is stable enough to reach:

/health = 200
/v1/models returns kimi-k2.5

But completions still collapse into degenerate token repetition:

The capital of France is -> foss foss foss ...

Additionally, an explicit --moe-backend triton request reaches the vllm serve command line, but runtime still resolves the model path to Marlin:

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

This looks like either:

compressed-tensors Kimi path ignoring the requested MoE backend, or
Marlin WNA16 MoE producing numerically incorrect results on GB10 / SM 12.1.

Hardware

16x DGX Spark
GPU: NVIDIA GB10
Compute capability: SM 12.1

Model

moonshotai/Kimi-K2.5
revision: 54383e83fa343a1331754112fb9e3410c55efa2f
compressed-tensors
pack-quantized
group_size=32
num_bits=4
type=int

Model integrity was verified across all 16 nodes:

64/64 shards present
safetensors total size consistent with the index

We also checked a representative packed expert tensor and its BF16 scales inside the live container:

shard: model-00030-of-000064.safetensors
tensor: language_model.model.layers.29.mlp.experts.0.gate_proj.weight_packed
scale: language_model.model.layers.29.mlp.experts.0.gate_proj.weight_scale
observed:
- packed shape [2048, 896], dtype torch.int32
- packed min/max -2129572924 / 2147462455
- scale shape [2048, 224], dtype torch.bfloat16
- scale min/max 0.002411 / 0.014526
- scale mean/std 0.005768 / 0.001194
- no NaN, no Inf, not all-zero

So this does not look like a corrupted cache or obviously broken scale metadata.

Dequantization metadata verified

The packed weights and BF16 scales for a representative MoE expert tensor were read directly from the snapshot inside the running vLLM container. Values are within the expected numerical range for INT4 pack-quantized compressed-tensors format:

packed int32 range covers full representation space, which is expected for bit-packed INT4 storage
BF16 scales are in the [0.0024, 0.0145] range with a non-degenerate distribution
no NaN, no Inf, not all-zero

Images tested

Known-bad public builds

nvcr.io/nvidia/vllm:26.03-py3
- image id: sha256:4c5e61c590207edb771c294014193c719ca9eee330c0b51756f4b6a25951360d
- symptom: service runs but outputs are corrupted (foss foss foss ...)
vllm/vllm-openai:nightly
- image id: sha256:d78917343e4159618d8fd766d800809246a68b56cf132464b0eeec91a23bc5ca
- torch.cuda.get_arch_list():
  - ['sm_80', 'sm_90', 'sm_100', 'sm_120', 'compute_120']

For cluster rollout we built a derived image with Ray installed:

big3/vllm-openai:nightly-ray

Stable low-memory profile

The nightly image needed a reduced-memory profile to avoid tail-worker instability:

--max-model-len 32768
--gpu-memory-utilization 0.85
--enforce-eager
--max-num-batched-tokens 512

With that profile, /health and /v1/models work, but generation still collapses into degenerate token repetition.

Launch command shape

We launch with:

vllm serve /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
  --served-model-name kimi-k2.5 \
  --tensor-parallel-size 16 \
  --distributed-executor-backend ray \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --max-num-batched-tokens 512 \
  --disable-custom-all-reduce \
  --host 0.0.0.0 --port 8001

Reproduction result

After startup:

curl -s http://127.0.0.1:8001/health
curl -s http://127.0.0.1:8001/v1/models
curl -s http://127.0.0.1:8001/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"kimi-k2.5",
    "prompt":"The capital of France is",
    "max_tokens":10,
    "temperature":0.0
  }'

Observed completion text:

foss foss foss foss foss foss foss foss foss foss

Triton override attempt

We patched our launcher so that --moe-backend triton is explicitly present in the final vllm serve command line.

Confirmed running command line included:

--moe-backend triton

But runtime logs still showed:

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

So the requested backend does not appear to take effect for this Kimi compressed-tensors path.

Quick reproduction check

Anyone with access to GB10 or other SM 12.x hardware can first confirm the runtime target with:

docker run --rm --gpus all vllm/vllm-openai:nightly \
  python3 -c "
import torch
d = torch.cuda.get_device_properties(0)
print(f'SM {d.major}.{d.minor}, arch_list={torch.cuda.get_arch_list()}')
"

Expected on this platform:

SM 12.1 on GB10
arch list including sm_120

Expected behavior

--moe-backend triton should either:
- switch the Kimi compressed-tensors MoE path away from Marlin, or
- fail clearly if that model path cannot use Triton.
The default runtime should not collapse into degenerate token repetition on GB10 / SM 12.1.

Actual behavior

Runtime becomes healthy enough to serve /health and /v1/models
Generation still collapses into degenerate token repetition
Requested --moe-backend triton still resolves to Marlin at runtime

Request

Please clarify whether moonshotai/Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 is expected to support:

correct generation on Marlin WNA16 MoE
explicit fallback to Triton via --moe-backend triton

At the moment this looks like a correctness bug on GB10 / SM 12.1 rather than a pure startup bug.

🐛 Describe the bug

vLLM issue draft: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin and returns degenerate token repetition

Summary

moonshotai/Kimi-K2.5 on a 16-node DGX Spark cluster with NVIDIA GB10 (SM 12.1) is not usable through public vLLM builds.

On vllm/vllm-openai:nightly, a low-memory launch profile is stable enough to reach:

/health = 200
/v1/models returns kimi-k2.5

But completions still collapse into degenerate token repetition:

The capital of France is -> foss foss foss ...

Additionally, an explicit --moe-backend triton request reaches the vllm serve command line, but runtime still resolves the model path to Marlin:

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

This looks like either:

compressed-tensors Kimi path ignoring the requested MoE backend, or
Marlin WNA16 MoE producing numerically incorrect results on GB10 / SM 12.1.

Hardware

16x DGX Spark
GPU: NVIDIA GB10
Compute capability: SM 12.1

Model

moonshotai/Kimi-K2.5
revision: 54383e83fa343a1331754112fb9e3410c55efa2f
compressed-tensors
pack-quantized
group_size=32
num_bits=4
type=int

Model integrity was verified across all 16 nodes:

64/64 shards present
safetensors total size consistent with the index

We also checked a representative packed expert tensor and its BF16 scales inside the live container:

shard: model-00030-of-000064.safetensors
tensor: language_model.model.layers.29.mlp.experts.0.gate_proj.weight_packed
scale: language_model.model.layers.29.mlp.experts.0.gate_proj.weight_scale
observed:
- packed shape [2048, 896], dtype torch.int32
- packed min/max -2129572924 / 2147462455
- scale shape [2048, 224], dtype torch.bfloat16
- scale min/max 0.002411 / 0.014526
- scale mean/std 0.005768 / 0.001194
- no NaN, no Inf, not all-zero

So this does not look like a corrupted cache or obviously broken scale metadata.

Dequantization metadata verified

packed int32 range covers full representation space, which is expected for bit-packed INT4 storage
BF16 scales are in the [0.0024, 0.0145] range with a non-degenerate distribution
no NaN, no Inf, not all-zero

Images tested

Known-bad public builds

nvcr.io/nvidia/vllm:26.03-py3
- image id: sha256:4c5e61c590207edb771c294014193c719ca9eee330c0b51756f4b6a25951360d
- symptom: service runs but outputs are corrupted (foss foss foss ...)
vllm/vllm-openai:nightly
- image id: sha256:d78917343e4159618d8fd766d800809246a68b56cf132464b0eeec91a23bc5ca
- torch.cuda.get_arch_list():
  - ['sm_80', 'sm_90', 'sm_100', 'sm_120', 'compute_120']

For cluster rollout we built a derived image with Ray installed:

big3/vllm-openai:nightly-ray

Stable low-memory profile

The nightly image needed a reduced-memory profile to avoid tail-worker instability:

--max-model-len 32768
--gpu-memory-utilization 0.85
--enforce-eager
--max-num-batched-tokens 512

With that profile, /health and /v1/models work, but generation still collapses into degenerate token repetition.

Launch command shape

We launch with:

vllm serve /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
  --served-model-name kimi-k2.5 \
  --tensor-parallel-size 16 \
  --distributed-executor-backend ray \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --max-num-batched-tokens 512 \
  --disable-custom-all-reduce \
  --host 0.0.0.0 --port 8001

Reproduction result

After startup:

curl -s http://127.0.0.1:8001/health
curl -s http://127.0.0.1:8001/v1/models
curl -s http://127.0.0.1:8001/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"kimi-k2.5",
    "prompt":"The capital of France is",
    "max_tokens":10,
    "temperature":0.0
  }'

Observed completion text:

foss foss foss foss foss foss foss foss foss foss

Triton override attempt

We patched our launcher so that --moe-backend triton is explicitly present in the final vllm serve command line.

Confirmed running command line included:

--moe-backend triton

But runtime logs still showed:

Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)

So the requested backend does not appear to take effect for this Kimi compressed-tensors path.

Quick reproduction check

Anyone with access to GB10 or other SM 12.x hardware can first confirm the runtime target with:

docker run --rm --gpus all vllm/vllm-openai:nightly \
  python3 -c "
import torch
d = torch.cuda.get_device_properties(0)
print(f'SM {d.major}.{d.minor}, arch_list={torch.cuda.get_arch_list()}')
"

Expected on this platform:

SM 12.1 on GB10
arch list including sm_120

Expected behavior

--moe-backend triton should either:
- switch the Kimi compressed-tensors MoE path away from Marlin, or
- fail clearly if that model path cannot use Triton.
The default runtime should not collapse into degenerate token repetition on GB10 / SM 12.1.

Actual behavior

Runtime becomes healthy enough to serve /health and /v1/models
Generation still collapses into degenerate token repetition
Requested --moe-backend triton still resolves to Marlin at runtime

Request

Please clarify whether moonshotai/Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 is expected to support:

correct generation on Marlin WNA16 MoE
explicit fallback to Triton via --moe-backend triton

At the moment this looks like a correctness bug on GB10 / SM 12.1 rather than a pure startup bug.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to moonshotai/Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 not supporting correct generation on Marlin WNA16 MoE or explicit fallback to Triton via --moe-backend triton, resulting in degenerate token repetition.

Guidance

Verify that the --moe-backend triton flag is correctly passed to the vllm serve command and check the runtime logs to confirm that it is being used.
Check the model integrity and dequantization metadata to ensure that they are correct and not causing the issue.
Test the model on a different hardware platform or with a different MoE backend to isolate the issue.
Consider updating to a newer version of the vllm image or moonshotai/Kimi-K2.5 model to see if the issue is resolved.

Example

No code example is provided as the issue is related to a specific model and hardware configuration.

Notes

The issue seems to be specific to the moonshotai/Kimi-K2.5 model on GB10 / SM 12.1 hardware, and it is unclear if this is a bug or a limitation of the model or hardware. Further investigation is needed to determine the root cause of the issue.

Recommendation

Apply a workaround by testing the model on a different hardware platform or with a different MoE backend to isolate the issue, or update to a newer version of the vllm image or moonshotai/Kimi-K2.5 model to see if the issue is resolved.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

--moe-backend triton should either:
- switch the Kimi compressed-tensors MoE path away from Marlin, or
- fail clearly if that model path cannot use Triton.
The default runtime should not collapse into degenerate token repetition on GB10 / SM 12.1.

#api #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - 💡(How to fix) Fix [Bug]: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin under --moe-backend triton and degenerates into repeated tokens

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

vLLM issue draft: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin and returns degenerate token repetition

Summary

Hardware

Model

Dequantization metadata verified

Images tested

Known-bad public builds

Stable low-memory profile

Launch command shape

Reproduction result

Triton override attempt

Quick reproduction check

Expected behavior

Actual behavior

Request

🐛 Describe the bug

vLLM issue draft: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin and returns degenerate token repetition

Summary

Hardware

Model

Dequantization metadata verified

Images tested

Known-bad public builds

Stable low-memory profile

Launch command shape

Reproduction result

Triton override attempt

Quick reproduction check

Expected behavior

Actual behavior

Request

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING