vllm - 💡(How to fix) Fix [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model

ccgibson · 2026-03-21T21:50:08Z

[vllm] Your current environment - vLLM v0.18.0 release, not nightly - 8x NVIDIA B200 141 GB HBM each - Model: moonshotai/Kimi-K2.5 1T MoE, 32B active, compress… ### Your current environment - vLLM v0.18.0 (release, not nightly) - 8x NVIDIA B200 (141 GB HBM each) - Model: `moonshotai/Kimi-K2.5` (1T MoE, 32B active, compressed-tensors 4-bit) - CUDA 13.0, Driver 580.126.09 ### 🐛 Describe the bug EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when `--max-model-len 262144`. This causes the model to produce repetitive/degenerate output until `max_tokens` is hit. The bug is **100% reproducible** and occurs within seconds of starting generation — even on small prompts (~100 tokens). **Critical detail**: The same model + EAGLE-3 head works perfectly at `--max-model-len 32768`, achieving 36.5% overall acceptance and +43% throughput improvement. The bug only manifests at 262K. ### What we tested We systematically tested every combination we could think of: | Config | Draft Head | `num_speculative_tokens` | Draft `max_model_len` | Result | |--------|-----------|-------------------------|----------------------|--------| | A | `nvidia/Kimi-K2.5-Thinking-Eagle3` | 3 | inherited (262K) | ❌ 0% collapse | | B | `lightseekorg/kimi-k2.5-eagle3` | 3 | inherited (262K) | ❌ 0% collapse | | C | `lightseekorg/kimi-k2.5-eagle3` | 3 | 32768 | ❌ 0% collapse | | D | `lightseekorg/kimi-k2.5-eagle3` | 1 | 32768 | ❌ 0% collapse | | E (benchmark) | `nvidia/Kimi-K2.5-Thinking-Eagle3` | 3 | 32768 (target also 32K) | ✅ 36.5% acceptance, +43% throughput | **Configs A-D all fail identically.** The only working config (E) had both target AND draft at 32K — which defeats the purpose of a 262K context model. ### Acceptance rate collapse pattern The collapse is progressive and happens mid-generation: ``` # Healthy start SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853 # 10 seconds later - degrading SpecDecoding metrics: acceptance rate: 31.9%, Per-position: 0.319 # 10 seconds later - fully collapsed SpecDecoding metrics: acceptance rate: 0.0%, Accepted: 0, Drafted: 752 ``` With `num_speculative_tokens=3`, the degenerate pattern locks to: ``` Per-position acceptance rate: 1.000, 0.000, 0.000 ``` ...indicating position 0 is auto-accepted but positions 1-2 are always rejected. With `num_speculative_tokens=1`, it collapses to `0.000` entirely. ### Two different EAGLE-3 heads, same bug We tested both available heads: - **`nvidia/Kimi-K2.5-Thinking-Eagle3`** (1.8B, DeepSeek MoE arch, YaRN scaling factor=64 from base 4096) - **`lightseekorg/kimi-k2.5-eagle3`** (3B, Llama arch, native RoPE theta=1M, no YaRN) Both exhibit identical collapse behavior. Since they have completely different architectures and RoPE strategies, the bug is **in vLLM's EAGLE-3 verification/coordination logic**, not in the draft models. ### Likely root cause We suspect this is related to #37435 (speculative/MTP draft config drops target hf-overrides). Kimi-K2.5 uses YaRN RoPE scaling (`original_max_position_embeddings: 4096` → `max_position_embeddings: 262144`). If the draft model's positional encoding doesn't match the target model's at runtime — even if the config files are correct — the verification step will reject all draft tokens once generation extends beyond certain position ranges. This would also explain why it works at `max_model_len=32768`: at shorter contexts, the positional encoding discrepancy hasn't accumulated enough to cause divergence. ### vLLM launch command ```bash vllm serve moonshotai/Kimi-K2.5 \ --tensor-parallel-size 8 \ --max-model-len 262144 \ --max-num-batched-tokens 32768 \ --max-num-seqs 512 \ --dtype bfloat16 \ --gpu-memory-utilization 0.9 \ --enable-chunked-prefill \ --trust-remote-code \ --enable-prefix-caching \ --enable-expert-parallel \ --compilation_config.pass_config.fuse_allreduce_rms true \ --mm-encoder-tp-mode data \ --speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 32768}' ``` ### Impact This blocks production deployment of EAGLE-3 for Kimi-K2.5 at its full 262K context length. The +43% throughput gain is validated at 32K but unusable at production context lengths. We've had to disable speculative decoding entirely. ### Related issues - #21269 — "Endless Generation near Context Window with Eagle3/Spec Dec" (same symptom, different model) - #37435 — "Speculative/MTP draft config drops target hf-overrides" (likely root cause) - #36872 — Progressive acceptance rate collapse with Qwen3.5 + MTP (same pattern) ### Before submitting a new issue... - [x] I have searched existing issues - [x] I have verified this is reproducible on v0.18.0 - [x] I have tested multiple draft models and configurations

Code Example

# Healthy start
SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853

# 10 seconds later - degrading
SpecDecoding metrics: acceptance rate: 31.9%, Per-position: 0.319

# 10 seconds later - fully collapsed
SpecDecoding metrics: acceptance rate: 0.0%, Accepted: 0, Drafted: 752

---

Per-position acceptance rate: 1.000, 0.000, 0.000

---

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 512 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --trust-remote-code \
    --enable-prefix-caching \
    --enable-expert-parallel \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --mm-encoder-tp-mode data \
    --speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 32768}'

Your current environment

vLLM v0.18.0 (release, not nightly)
8x NVIDIA B200 (141 GB HBM each)
Model: moonshotai/Kimi-K2.5 (1T MoE, 32B active, compressed-tensors 4-bit)
CUDA 13.0, Driver 580.126.09

🐛 Describe the bug

EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when --max-model-len 262144. This causes the model to produce repetitive/degenerate output until max_tokens is hit. The bug is 100% reproducible and occurs within seconds of starting generation — even on small prompts (~100 tokens).

Critical detail: The same model + EAGLE-3 head works perfectly at --max-model-len 32768, achieving 36.5% overall acceptance and +43% throughput improvement. The bug only manifests at 262K.

What we tested

We systematically tested every combination we could think of:

Config	Draft Head	`num_speculative_tokens`	Draft `max_model_len`	Result
A	`nvidia/Kimi-K2.5-Thinking-Eagle3`	3	inherited (262K)	❌ 0% collapse
B	`lightseekorg/kimi-k2.5-eagle3`	3	inherited (262K)	❌ 0% collapse
C	`lightseekorg/kimi-k2.5-eagle3`	3	32768	❌ 0% collapse
D	`lightseekorg/kimi-k2.5-eagle3`	1	32768	❌ 0% collapse
E (benchmark)	`nvidia/Kimi-K2.5-Thinking-Eagle3`	3	32768 (target also 32K)	✅ 36.5% acceptance, +43% throughput

Configs A-D all fail identically. The only working config (E) had both target AND draft at 32K — which defeats the purpose of a 262K context model.

Acceptance rate collapse pattern

The collapse is progressive and happens mid-generation:

# Healthy start
SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853

# 10 seconds later - degrading
SpecDecoding metrics: acceptance rate: 31.9%, Per-position: 0.319

# 10 seconds later - fully collapsed
SpecDecoding metrics: acceptance rate: 0.0%, Accepted: 0, Drafted: 752

With num_speculative_tokens=3, the degenerate pattern locks to:

Per-position acceptance rate: 1.000, 0.000, 0.000

...indicating position 0 is auto-accepted but positions 1-2 are always rejected.

With num_speculative_tokens=1, it collapses to 0.000 entirely.

Two different EAGLE-3 heads, same bug

We tested both available heads:

nvidia/Kimi-K2.5-Thinking-Eagle3 (1.8B, DeepSeek MoE arch, YaRN scaling factor=64 from base 4096)
lightseekorg/kimi-k2.5-eagle3 (3B, Llama arch, native RoPE theta=1M, no YaRN)

Both exhibit identical collapse behavior. Since they have completely different architectures and RoPE strategies, the bug is in vLLM's EAGLE-3 verification/coordination logic, not in the draft models.

Likely root cause

We suspect this is related to #37435 (speculative/MTP draft config drops target hf-overrides). Kimi-K2.5 uses YaRN RoPE scaling (original_max_position_embeddings: 4096 → max_position_embeddings: 262144). If the draft model's positional encoding doesn't match the target model's at runtime — even if the config files are correct — the verification step will reject all draft tokens once generation extends beyond certain position ranges.

This would also explain why it works at max_model_len=32768: at shorter contexts, the positional encoding discrepancy hasn't accumulated enough to cause divergence.

vLLM launch command

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 512 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --trust-remote-code \
    --enable-prefix-caching \
    --enable-expert-parallel \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --mm-encoder-tp-mode data \
    --speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 32768}'

Impact

This blocks production deployment of EAGLE-3 for Kimi-K2.5 at its full 262K context length. The +43% throughput gain is validated at 32K but unusable at production context lengths. We've had to disable speculative decoding entirely.

Related issues

#21269 — "Endless Generation near Context Window with Eagle3/Spec Dec" (same symptom, different model)
#37435 — "Speculative/MTP draft config drops target hf-overrides" (likely root cause)
#36872 — Progressive acceptance rate collapse with Qwen3.5 + MTP (same pattern)

Before submitting a new issue...

I have searched existing issues
I have verified this is reproducible on v0.18.0
I have tested multiple draft models and configurations

extent analysis

Fix Plan

To address the issue of the EAGLE-3 speculative decoding acceptance rate collapsing to 0% when --max-model-len 262144, we need to ensure that the positional encoding of the draft model matches the target model's at runtime.

Update the speculative-config to match the target model's max_model_len:

--speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 262144}'

2. **Verify that the draft model's positional encoding is correctly configured**:
   Ensure that the `original_max_position_embeddings` and `max_position_embeddings` are correctly set in the model configuration to match the target model's settings.
3. **Test with different `num_speculative_tokens` values**:
   Try reducing the `num_speculative_tokens` value to 1 and test if the issue persists.

### Verification
To verify that the fix worked, monitor the acceptance rate during generation and check for any signs of collapse:
```bash
# Healthy start
SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853

The acceptance rate should remain stable throughout the generation process.

Extra Tips

Ensure that the model and draft configurations are correctly set up and match the target model's settings.
Monitor the acceptance rate and adjust the num_speculative_tokens value as needed to achieve a stable acceptance rate.
Refer to related issues (#21269, #37435, #36872) for additional information and potential workarounds.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Likely root cause

Code Example

Your current environment

🐛 Describe the bug

What we tested

Acceptance rate collapse pattern

Two different EAGLE-3 heads, same bug

Likely root cause

vLLM launch command

Impact

Related issues

Before submitting a new issue...

extent analysis

Fix Plan

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Likely root cause

Code Example

Your current environment

🐛 Describe the bug

What we tested

Acceptance rate collapse pattern

Two different EAGLE-3 heads, same bug

Likely root cause

vLLM launch command

Impact

Related issues

Before submitting a new issue...

extent analysis

Fix Plan

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING