vllm - ✅(Solved) Fix [Kimi] Track Kimi K2.5/K2.6 MLA + EAGLE serving on Blackwell (DCP4/DCP8, FP8 KV, draft backend split) [6 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40608Fetched 2026-04-23 07:23:57
View on GitHub
Comments
1
Participants
1
Timeline
9
Reactions
0
Participants
Timeline (top)
cross-referenced ×6subscribed ×2commented ×1

Root Cause

Current K2.6 recipe needs target/draft backend split because the draft is non-MLA:

Fix Action

Fix / Workaround

Getting Kimi running well on Blackwell required a mix of:

  • MLA + DCP + FP8 KV correctness fixes
  • speculative decode runtime fixes
  • draft/target backend decoupling
  • model-specific speculative decode improvements
  • narrow local runtime workarounds and tuning knobs

Notes:

  • DCP=8 also works and is benchmarked below.
  • the VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN threshold is currently a local runtime policy, not an upstream default.
  • this recipe currently still performs best with an external NCCL graph XML; a no-XML patched-NCCL path is functional but not yet performance-equivalent.

We also tested a patched NCCL 2.29.7 path without NCCL_GRAPH_FILE.

PR fix notes

PR #40609: [Core] Enable FP8 KV cache with DCP for MLA

Description (problem / solution / changelog)

Summary

  • enable MLA decode-context-parallel prefill with FP8 KV cache
  • gather and dequantize KV cache correctly for the DCP prefill path
  • add distributed coverage for kv_cache_dtype=fp8

Context

This PR supersedes the dormant work in #34795 and is tracked from #40608.

Scope

  • vllm/model_executor/layers/attention/mla_attention.py
  • tests/distributed/test_context_parallel.py

Why

Kimi-style MLA serving on Blackwell relies on TRITON_MLA + DCP + FP8 KV, and the DCP prefill path needs explicit FP8 cache handling to work correctly.

Validation

Validated locally on Blackwell/SM120 with Kimi MLA workloads and added test coverage for DCP + FP8 KV.

Changed files

  • tests/distributed/test_context_parallel.py (modified, +10/-1)
  • vllm/model_executor/layers/attention/mla_attention.py (modified, +92/-19)

PR #40610: [SpecDecode] Fix async proposer synchronization

Description (problem / solution / changelog)

Summary

  • move prepare_inputs_event.record() so speculative proposer GPU work is included in the synchronization window
  • prevent the next async batch from reusing persistent state and block-table-related metadata while proposer work from the previous batch is still in flight

Context

Tracked from #40608.

Root cause

This came from debugging async scheduling + speculative decode ordering, not from a single clean deterministic stacktrace.

The key issue was the lifetime of prepare_inputs_event versus the actual proposer GPU work:

  • the event was being recorded from the input-preparation path, which made the step look "finished" too early
  • proposer-side GPU work for speculative decode runs later from sample_tokens()
  • that ordering allowed the next batch to enter execute_model() / _update_states() and start mutating persistent batch state while the proposer from the previous batch was still reading it on the GPU

In practice this is a concurrency-sensitive race: the failure mode is timing-dependent and can show up as nondeterministic instability / stale state usage rather than one obvious reproducible exception.

The shared state at risk here includes the persistent structures used across batches in the async path, especially block-table-related metadata and other proposer inputs derived from the batch state.

Why this fix

This patch makes prepare_inputs_event represent the full GPU lifetime of the step, not just the input-preparation portion of it.

Concretely, it re-records the event after sample_tokens() finishes, which is the point where speculative proposer GPU work has also completed. That closes the race where the next batch could otherwise observe and mutate shared state too early.

I kept the change intentionally small and local to gpu_model_runner.py because the regression surface is concurrency-sensitive, and this is the least invasive way to restore the intended happens-before relationship.

Scope

  • vllm/v1/worker/gpu_model_runner.py

Notes

This is kept as a draft PR because the code change is small but the regression surface is concurrency-sensitive.

Changed files

  • vllm/v1/worker/gpu_model_runner.py (modified, +16/-0)

PR #40611: [SpecDecode] Allow draft-specific attention backend and KV dtype

Description (problem / solution / changelog)

Summary

  • allow speculative draft models to use a different attention backend than the target model
  • allow speculative draft models to use a different KV-cache dtype than the target model
  • plumb the per-draft overrides through EAGLE setup

Context

Tracked from #40608.

Why

Kimi serving needs target/draft backend decoupling:

  • the target model runs best with TRITON_MLA
  • some draft models are non-MLA and need a different backend or KV-cache dtype

Without this split, otherwise valid target/draft pairings are forced onto a single backend/dtype choice.

Scope

  • vllm/config/speculative.py
  • vllm/v1/spec_decode/eagle.py
  • vllm/v1/attention/backends/flash_attn.py
  • vllm/v1/worker/cp_utils.py

Notes

Kept as draft for discussion because it broadens speculative-config surface area.

Changed files

  • vllm/config/speculative.py (modified, +18/-1)
  • vllm/v1/attention/backend.py (modified, +27/-0)
  • vllm/v1/attention/backends/flash_attn.py (modified, +3/-2)
  • vllm/v1/attention/backends/flashinfer.py (modified, +3/-2)
  • vllm/v1/spec_decode/eagle.py (modified, +33/-4)
  • vllm/v1/worker/cp_utils.py (modified, +4/-2)

PR #40612: [SpecDecode] Add local argmax helper for Llama Eagle3 drafts

Description (problem / solution / changelog)

Summary

  • add get_top_tokens() support for LlamaForCausalLMEagle3
  • enable local argmax reduction for Llama-based EAGLE3 drafts when vocab mapping is identity

Context

Tracked from #40608.

Why

Large-vocab speculative draft models benefit from reducing TP communication with local argmax reduction. This brings the Llama-based EAGLE3 path in line with the same optimization already being discussed for other draft families.

Scope

  • vllm/model_executor/models/llama_eagle3.py

Related

  • complements existing upstream work in #39419 rather than replacing it

Changed files

  • vllm/model_executor/models/llama_eagle3.py (modified, +19/-0)

PR #40613: [SpecDecode] Add seq-length gate for speculative decode

Description (problem / solution / changelog)

Summary

  • add an opt-in seq-length gate that disables speculative decode above a configured threshold
  • make the gate effective both in the scheduler path and in the worker-side drafter fit check

Context

Tracked from #40608.

Why

For Kimi MLA long-context workloads, speculative decode can become slower than target-only decoding even when acceptance is healthy. An explicit seq-length gate is a pragmatic way to keep short-context wins without forcing long-context regressions.

Scope

  • vllm/v1/core/sched/async_scheduler.py
  • vllm/v1/worker/gpu_model_runner.py

Notes

This is intentionally a draft/discussion PR. The current implementation is an environment-based runtime knob used in local deployment recipes, not a proposed final public API shape.

Changed files

  • vllm/v1/core/sched/async_scheduler.py (modified, +10/-1)
  • vllm/v1/spec_decode/utils.py (modified, +22/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +38/-1)

PR #40614: [Attention] Tune TRITON_MLA for SM120 + FP8 decode

Description (problem / solution / changelog)

Summary

  • cap num_kv_splits for B=1, SM120, FP8 KV MLA decode
  • reduce BLOCK_H for the same narrow configuration

Context

Tracked from #40608.

Why

For Blackwell/SM120 MLA decode with FP8 KV cache, the default split heuristic can overshoot and increase split/merge overhead at large local sequence lengths. A narrower BLOCK_H setting is also useful for the same configuration.

Scope

  • vllm/v1/attention/backends/mla/triton_mla.py
  • vllm/v1/attention/ops/triton_decode_attention.py

Notes

This is intentionally a draft/discussion PR. The tuning is narrow (B=1, SM120, FP8 KV) and should be evaluated on its own merits rather than mixed with correctness changes.

Changed files

  • vllm/v1/attention/backends/mla/triton_mla.py (modified, +13/-0)
  • vllm/v1/attention/ops/triton_decode_attention.py (modified, +9/-0)

Code Example

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=7000 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.5 \
  --served-model-name Kimi-K2.5 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.5-eagle3-mla","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA","draft_kv_cache_dtype":"fp8","rejection_sample_method":"probabilistic"}'

---

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=16384 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.6 \
  --served-model-name Kimi-K2.6 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --language-model-only \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 8 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"FLASH_ATTN","draft_kv_cache_dtype":"bfloat16","use_local_argmax_reduction":true,"rejection_sample_method":"probabilistic"}'
RAW_BUFFERClick to expand / collapse

This issue tracks the current Blackwell Kimi serving stack we validated locally and the split PRs needed to upstream the useful pieces.

Scope

Target workloads:

  • moonshotai/Kimi-K2.5
  • moonshotai/Kimi-K2.6

Target hardware/runtime assumptions:

  • Blackwell / SM120
  • TRITON_MLA
  • FP8 KV cache
  • tensor parallel size 8
  • decode context parallel size 4 or 8
  • speculative decode with EAGLE-style drafters where it is performance-positive

Why this issue exists

Getting Kimi running well on Blackwell required a mix of:

  • MLA + DCP + FP8 KV correctness fixes
  • speculative decode runtime fixes
  • draft/target backend decoupling
  • model-specific speculative decode improvements
  • narrow local runtime workarounds and tuning knobs

This issue is meant to:

  1. document the current working launch recipes,
  2. record the DCP4/DCP8 benchmark state,
  3. link the split PRs,
  4. separate upstream-worthy fixes from local-only runtime knobs.

Related upstream context

Existing upstream work:

  • #34795 [Core] Enable FP8 KV cache with Decode Context Parallel (DCP) for MLA
  • #39419 [SpecDecode] Reduce TP communication for large-vocab draft models in DFlash/PARD speculative decoding
  • #39930 [Attention] Fix attention backend selection with DFlash
  • #39995 dflash with flashinfer

Working recipes

Kimi-K2.5 + K2.5 Eagle3 MLA draft

This is the strongest current path for Kimi speculative decode on Blackwell.

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=7000 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.5 \
  --served-model-name Kimi-K2.5 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.5-eagle3-mla","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA","draft_kv_cache_dtype":"fp8","rejection_sample_method":"probabilistic"}'

Notes:

  • DCP=8 also works and is benchmarked below.
  • the VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN threshold is currently a local runtime policy, not an upstream default.
  • this recipe currently still performs best with an external NCCL graph XML; a no-XML patched-NCCL path is functional but not yet performance-equivalent.

Kimi-K2.6 + K2.6 Eagle3 draft

Current K2.6 recipe needs target/draft backend split because the draft is non-MLA:

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=16384 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.6 \
  --served-model-name Kimi-K2.6 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --language-model-only \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 8 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"FLASH_ATTN","draft_kv_cache_dtype":"bfloat16","use_local_argmax_reduction":true,"rejection_sample_method":"probabilistic"}'

Notes:

  • long-context K2.6 speculative decode still needs the threshold gate; otherwise speculation is often slower than baseline.
  • the K2.6 draft path is not as mature as the K2.5 MLA draft path.

Benchmarks

Kimi-K2.5 + lightseekorg/kimi-k2.5-eagle3-mla

Using llm_decode_bench.py against the XML-based best-known runtime recipe.

DCP=4

Selected decode throughput (tok/s):

  • ctx=0, C=1: 85.5
  • ctx=16k, C=1: 52.7
  • ctx=32k, C=1: 52.7
  • ctx=64k, C=1: 52.7
  • ctx=0, C=64: 929.6
  • ctx=16k, C=64: 826.7
  • ctx=32k, C=64: 698.8
  • ctx=64k, C=64: 508.4

Selected prefill throughput:

  • 8k: 7885 tok/s
  • 16k: 7998 tok/s
  • 32k: 7693 tok/s
  • 64k: 6999 tok/s
  • 128k: 6108 tok/s

DCP=8

Selected decode throughput (tok/s):

  • ctx=0, C=1: 78.6
  • ctx=16k, C=1: 47.7
  • ctx=32k, C=1: 49.7
  • ctx=64k, C=1: 49.7
  • ctx=0, C=64: 924.6
  • ctx=16k, C=64: 763.5
  • ctx=32k, C=64: 635.7
  • ctx=64k, C=64: 445.2

Selected prefill throughput:

  • 8k: 7907 tok/s
  • 16k: 8015 tok/s
  • 32k: 7695 tok/s
  • 64k: 6991 tok/s
  • 128k: 6096 tok/s

Interpretation

  • For the validated XML-based setup, DCP=4 currently wins over DCP=8 on decode throughput for Kimi-K2.5 + MLA draft.
  • Prefill throughput is very similar between DCP=4 and DCP=8.
  • Long-context speculative decode is only beneficial below a tuned threshold; above it, target-only decode is faster.

No-XML status

We also tested a patched NCCL 2.29.7 path without NCCL_GRAPH_FILE.

Current result:

  • end-to-end works without XML
  • decode throughput at long context is roughly comparable or slightly better in some cases
  • short-context decode is worse
  • prefill is significantly worse

Conclusion for now:

  • the no-XML path is promising but not yet a drop-in replacement for the current best-known XML-based runtime recipe.

Proposed split PRs

These are the changes we currently believe should be split into separate PRs.

  1. MLA + DCP + FP8 KV correctness/support
  2. Async speculative proposer synchronization fix
  3. Draft-specific attention backend + draft KV dtype
  4. Llama Eagle3 local argmax helper
  5. Optional seq-length gate for speculative decode (discussion draft)
  6. Narrow SM120 FP8 Triton MLA tuning (discussion draft)

Qwen3/Qwen3-DFlash local argmax is already covered by #39419.

Request for feedback

I would like feedback on:

  • whether #34795 should be superseded by a rebased continuation PR if the original author remains inactive,
  • whether the seq-length speculation gate should be represented as a config/CLI option rather than an env-only knob,
  • whether the narrow SM120 FP8 Triton MLA tuning should be considered separately from correctness fixes,
  • and whether the Kimi K2.5/K2.6 recipes should live in documentation once the split PRs land.

extent analysis

TL;DR

To improve the performance of Kimi on Blackwell, apply the validated launch recipes and consider splitting the proposed changes into separate PRs for easier maintenance and upstreaming.

Guidance

  • Review the provided launch recipes for Kimi-K2.5 and Kimi-K2.6 to ensure correct configuration and environment variables.
  • Consider the trade-offs between DCP=4 and DCP=8 for decode throughput and prefill throughput based on the benchmark results.
  • Evaluate the proposed split PRs to determine the best approach for upstreaming the changes, including the MLA + DCP + FP8 KV correctness/support and draft-specific attention backend + draft KV dtype.
  • Provide feedback on the requested topics, such as whether #34795 should be superseded by a rebased continuation PR and whether the seq-length speculation gate should be represented as a config/CLI option.

Example

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=7000 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.5 \
  --served-model-name Kimi-K2.5 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.5-eagle3-mla","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA","draft_kv_cache_dtype":"fp8","rejection_sample_method":"probabilistic"}'

Notes

The provided launch

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING