vllm - 💡(How to fix) Fix [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37773Fetched 2026-04-08 01:12:58
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1

Root Cause

Likely root cause

Code Example

# Healthy start
SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853

# 10 seconds later - degrading
SpecDecoding metrics: acceptance rate: 31.9%, Per-position: 0.319

# 10 seconds later - fully collapsed
SpecDecoding metrics: acceptance rate: 0.0%, Accepted: 0, Drafted: 752

---

Per-position acceptance rate: 1.000, 0.000, 0.000

---

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 512 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --trust-remote-code \
    --enable-prefix-caching \
    --enable-expert-parallel \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --mm-encoder-tp-mode data \
    --speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 32768}'
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM v0.18.0 (release, not nightly)
  • 8x NVIDIA B200 (141 GB HBM each)
  • Model: moonshotai/Kimi-K2.5 (1T MoE, 32B active, compressed-tensors 4-bit)
  • CUDA 13.0, Driver 580.126.09

🐛 Describe the bug

EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when --max-model-len 262144. This causes the model to produce repetitive/degenerate output until max_tokens is hit. The bug is 100% reproducible and occurs within seconds of starting generation — even on small prompts (~100 tokens).

Critical detail: The same model + EAGLE-3 head works perfectly at --max-model-len 32768, achieving 36.5% overall acceptance and +43% throughput improvement. The bug only manifests at 262K.

What we tested

We systematically tested every combination we could think of:

ConfigDraft Headnum_speculative_tokensDraft max_model_lenResult
Anvidia/Kimi-K2.5-Thinking-Eagle33inherited (262K)❌ 0% collapse
Blightseekorg/kimi-k2.5-eagle33inherited (262K)❌ 0% collapse
Clightseekorg/kimi-k2.5-eagle3332768❌ 0% collapse
Dlightseekorg/kimi-k2.5-eagle3132768❌ 0% collapse
E (benchmark)nvidia/Kimi-K2.5-Thinking-Eagle3332768 (target also 32K)✅ 36.5% acceptance, +43% throughput

Configs A-D all fail identically. The only working config (E) had both target AND draft at 32K — which defeats the purpose of a 262K context model.

Acceptance rate collapse pattern

The collapse is progressive and happens mid-generation:

# Healthy start
SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853

# 10 seconds later - degrading
SpecDecoding metrics: acceptance rate: 31.9%, Per-position: 0.319

# 10 seconds later - fully collapsed
SpecDecoding metrics: acceptance rate: 0.0%, Accepted: 0, Drafted: 752

With num_speculative_tokens=3, the degenerate pattern locks to:

Per-position acceptance rate: 1.000, 0.000, 0.000

...indicating position 0 is auto-accepted but positions 1-2 are always rejected.

With num_speculative_tokens=1, it collapses to 0.000 entirely.

Two different EAGLE-3 heads, same bug

We tested both available heads:

  • nvidia/Kimi-K2.5-Thinking-Eagle3 (1.8B, DeepSeek MoE arch, YaRN scaling factor=64 from base 4096)
  • lightseekorg/kimi-k2.5-eagle3 (3B, Llama arch, native RoPE theta=1M, no YaRN)

Both exhibit identical collapse behavior. Since they have completely different architectures and RoPE strategies, the bug is in vLLM's EAGLE-3 verification/coordination logic, not in the draft models.

Likely root cause

We suspect this is related to #37435 (speculative/MTP draft config drops target hf-overrides). Kimi-K2.5 uses YaRN RoPE scaling (original_max_position_embeddings: 4096max_position_embeddings: 262144). If the draft model's positional encoding doesn't match the target model's at runtime — even if the config files are correct — the verification step will reject all draft tokens once generation extends beyond certain position ranges.

This would also explain why it works at max_model_len=32768: at shorter contexts, the positional encoding discrepancy hasn't accumulated enough to cause divergence.

vLLM launch command

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 512 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --trust-remote-code \
    --enable-prefix-caching \
    --enable-expert-parallel \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --mm-encoder-tp-mode data \
    --speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 32768}'

Impact

This blocks production deployment of EAGLE-3 for Kimi-K2.5 at its full 262K context length. The +43% throughput gain is validated at 32K but unusable at production context lengths. We've had to disable speculative decoding entirely.

Related issues

  • #21269 — "Endless Generation near Context Window with Eagle3/Spec Dec" (same symptom, different model)
  • #37435 — "Speculative/MTP draft config drops target hf-overrides" (likely root cause)
  • #36872 — Progressive acceptance rate collapse with Qwen3.5 + MTP (same pattern)

Before submitting a new issue...

  • I have searched existing issues
  • I have verified this is reproducible on v0.18.0
  • I have tested multiple draft models and configurations

extent analysis

Fix Plan

To address the issue of the EAGLE-3 speculative decoding acceptance rate collapsing to 0% when --max-model-len 262144, we need to ensure that the positional encoding of the draft model matches the target model's at runtime.

  1. Update the speculative-config to match the target model's max_model_len:

--speculative-config '{"method": "eagle3", "model": "lightseekorg/kimi-k2.5-eagle3", "num_speculative_tokens": 3, "max_model_len": 262144}'

2. **Verify that the draft model's positional encoding is correctly configured**:
   Ensure that the `original_max_position_embeddings` and `max_position_embeddings` are correctly set in the model configuration to match the target model's settings.
3. **Test with different `num_speculative_tokens` values**:
   Try reducing the `num_speculative_tokens` value to 1 and test if the issue persists.

### Verification
To verify that the fix worked, monitor the acceptance rate during generation and check for any signs of collapse:
```bash
# Healthy start
SpecDecoding metrics: acceptance rate: 85.3%, Per-position: 0.853

The acceptance rate should remain stable throughout the generation process.

Extra Tips

  • Ensure that the model and draft configurations are correctly set up and match the target model's settings.
  • Monitor the acceptance rate and adjust the num_speculative_tokens value as needed to achieve a stable acceptance rate.
  • Refer to related issues (#21269, #37435, #36872) for additional information and potential workarounds.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING