vllm - ✅(Solved) Fix [Kimi] Track Kimi K2.5/K2.6 MLA + EAGLE serving on Blackwell (DCP4/DCP8, FP8 KV, draft backend split) [6 pull requests, 1 comments, 1 participants]

vllm2026-04-22 11:09:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40608•Fetched 2026-04-23 07:23:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

voipmonitor

Participants

voipmonitor

Timeline (top)

cross-referenced ×6subscribed ×2commented ×1

Root Cause

Current K2.6 recipe needs target/draft backend split because the draft is non-MLA:

Fix Action

Fix / Workaround

Getting Kimi running well on Blackwell required a mix of:

MLA + DCP + FP8 KV correctness fixes
speculative decode runtime fixes
draft/target backend decoupling
model-specific speculative decode improvements
narrow local runtime workarounds and tuning knobs

Notes:

DCP=8 also works and is benchmarked below.
the VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN threshold is currently a local runtime policy, not an upstream default.
this recipe currently still performs best with an external NCCL graph XML; a no-XML patched-NCCL path is functional but not yet performance-equivalent.

We also tested a patched NCCL 2.29.7 path without NCCL_GRAPH_FILE.

PR fix notes

PR #40609: [Core] Enable FP8 KV cache with DCP for MLA

Repository: vllm-project/vllm
Author: voipmonitor
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40609

Description (problem / solution / changelog)

Summary

enable MLA decode-context-parallel prefill with FP8 KV cache
gather and dequantize KV cache correctly for the DCP prefill path
add distributed coverage for kv_cache_dtype=fp8

Context

This PR supersedes the dormant work in #34795 and is tracked from #40608.

Scope

vllm/model_executor/layers/attention/mla_attention.py
tests/distributed/test_context_parallel.py

Why

Kimi-style MLA serving on Blackwell relies on TRITON_MLA + DCP + FP8 KV, and the DCP prefill path needs explicit FP8 cache handling to work correctly.

Validation

Validated locally on Blackwell/SM120 with Kimi MLA workloads and added test coverage for DCP + FP8 KV.

Changed files

tests/distributed/test_context_parallel.py (modified, +10/-1)
vllm/model_executor/layers/attention/mla_attention.py (modified, +92/-19)

PR #40610: [SpecDecode] Fix async proposer synchronization

Repository: vllm-project/vllm
Author: voipmonitor
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40610

Description (problem / solution / changelog)

Summary

move prepare_inputs_event.record() so speculative proposer GPU work is included in the synchronization window
prevent the next async batch from reusing persistent state and block-table-related metadata while proposer work from the previous batch is still in flight

Context

Tracked from #40608.

Root cause

This came from debugging async scheduling + speculative decode ordering, not from a single clean deterministic stacktrace.

The key issue was the lifetime of prepare_inputs_event versus the actual proposer GPU work:

the event was being recorded from the input-preparation path, which made the step look "finished" too early
proposer-side GPU work for speculative decode runs later from sample_tokens()
that ordering allowed the next batch to enter execute_model() / _update_states() and start mutating persistent batch state while the proposer from the previous batch was still reading it on the GPU

In practice this is a concurrency-sensitive race: the failure mode is timing-dependent and can show up as nondeterministic instability / stale state usage rather than one obvious reproducible exception.

The shared state at risk here includes the persistent structures used across batches in the async path, especially block-table-related metadata and other proposer inputs derived from the batch state.

Why this fix

This patch makes prepare_inputs_event represent the full GPU lifetime of the step, not just the input-preparation portion of it.

Concretely, it re-records the event after sample_tokens() finishes, which is the point where speculative proposer GPU work has also completed. That closes the race where the next batch could otherwise observe and mutate shared state too early.

I kept the change intentionally small and local to gpu_model_runner.py because the regression surface is concurrency-sensitive, and this is the least invasive way to restore the intended happens-before relationship.

Scope

vllm/v1/worker/gpu_model_runner.py

Notes

This is kept as a draft PR because the code change is small but the regression surface is concurrency-sensitive.

Changed files

vllm/v1/worker/gpu_model_runner.py (modified, +16/-0)

PR #40611: [SpecDecode] Allow draft-specific attention backend and KV dtype

Repository: vllm-project/vllm
Author: voipmonitor
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40611

Description (problem / solution / changelog)

Summary

allow speculative draft models to use a different attention backend than the target model
allow speculative draft models to use a different KV-cache dtype than the target model
plumb the per-draft overrides through EAGLE setup

Context

Tracked from #40608.

Why

Kimi serving needs target/draft backend decoupling:

the target model runs best with TRITON_MLA
some draft models are non-MLA and need a different backend or KV-cache dtype

Without this split, otherwise valid target/draft pairings are forced onto a single backend/dtype choice.

Scope

vllm/config/speculative.py
vllm/v1/spec_decode/eagle.py
vllm/v1/attention/backends/flash_attn.py
vllm/v1/worker/cp_utils.py

Notes

Kept as draft for discussion because it broadens speculative-config surface area.

Changed files

vllm/config/speculative.py (modified, +18/-1)
vllm/v1/attention/backend.py (modified, +27/-0)
vllm/v1/attention/backends/flash_attn.py (modified, +3/-2)
vllm/v1/attention/backends/flashinfer.py (modified, +3/-2)
vllm/v1/spec_decode/eagle.py (modified, +33/-4)
vllm/v1/worker/cp_utils.py (modified, +4/-2)

PR #40612: [SpecDecode] Add local argmax helper for Llama Eagle3 drafts

Repository: vllm-project/vllm
Author: voipmonitor
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40612

Description (problem / solution / changelog)

Summary

add get_top_tokens() support for LlamaForCausalLMEagle3
enable local argmax reduction for Llama-based EAGLE3 drafts when vocab mapping is identity

Context

Tracked from #40608.

Why

Large-vocab speculative draft models benefit from reducing TP communication with local argmax reduction. This brings the Llama-based EAGLE3 path in line with the same optimization already being discussed for other draft families.

Scope

vllm/model_executor/models/llama_eagle3.py

complements existing upstream work in #39419 rather than replacing it

Changed files

vllm/model_executor/models/llama_eagle3.py (modified, +19/-0)

PR #40613: [SpecDecode] Add seq-length gate for speculative decode

Repository: vllm-project/vllm
Author: voipmonitor
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40613

Description (problem / solution / changelog)

Summary

add an opt-in seq-length gate that disables speculative decode above a configured threshold
make the gate effective both in the scheduler path and in the worker-side drafter fit check

Context

Tracked from #40608.

Why

For Kimi MLA long-context workloads, speculative decode can become slower than target-only decoding even when acceptance is healthy. An explicit seq-length gate is a pragmatic way to keep short-context wins without forcing long-context regressions.

Scope

vllm/v1/core/sched/async_scheduler.py
vllm/v1/worker/gpu_model_runner.py

Notes

This is intentionally a draft/discussion PR. The current implementation is an environment-based runtime knob used in local deployment recipes, not a proposed final public API shape.

Changed files

vllm/v1/core/sched/async_scheduler.py (modified, +10/-1)
vllm/v1/spec_decode/utils.py (modified, +22/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +38/-1)

PR #40614: [Attention] Tune TRITON_MLA for SM120 + FP8 decode

Repository: vllm-project/vllm
Author: voipmonitor
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40614

Description (problem / solution / changelog)

Summary

cap num_kv_splits for B=1, SM120, FP8 KV MLA decode
reduce BLOCK_H for the same narrow configuration

Context

Tracked from #40608.

Why

For Blackwell/SM120 MLA decode with FP8 KV cache, the default split heuristic can overshoot and increase split/merge overhead at large local sequence lengths. A narrower BLOCK_H setting is also useful for the same configuration.

Scope

vllm/v1/attention/backends/mla/triton_mla.py
vllm/v1/attention/ops/triton_decode_attention.py

Notes

This is intentionally a draft/discussion PR. The tuning is narrow (B=1, SM120, FP8 KV) and should be evaluated on its own merits rather than mixed with correctness changes.

Changed files

vllm/v1/attention/backends/mla/triton_mla.py (modified, +13/-0)
vllm/v1/attention/ops/triton_decode_attention.py (modified, +9/-0)

Code Example

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=7000 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.5 \
  --served-model-name Kimi-K2.5 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.5-eagle3-mla","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA","draft_kv_cache_dtype":"fp8","rejection_sample_method":"probabilistic"}'

---

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=16384 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.6 \
  --served-model-name Kimi-K2.6 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --language-model-only \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 8 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"FLASH_ATTN","draft_kv_cache_dtype":"bfloat16","use_local_argmax_reduction":true,"rejection_sample_method":"probabilistic"}'

RAW_BUFFERClick to expand / collapse

This issue tracks the current Blackwell Kimi serving stack we validated locally and the split PRs needed to upstream the useful pieces.

Scope

Target workloads:

moonshotai/Kimi-K2.5
moonshotai/Kimi-K2.6

Target hardware/runtime assumptions:

Blackwell / SM120
TRITON_MLA
FP8 KV cache
tensor parallel size 8
decode context parallel size 4 or 8
speculative decode with EAGLE-style drafters where it is performance-positive

Why this issue exists

Getting Kimi running well on Blackwell required a mix of:

MLA + DCP + FP8 KV correctness fixes
speculative decode runtime fixes
draft/target backend decoupling
model-specific speculative decode improvements
narrow local runtime workarounds and tuning knobs

This issue is meant to:

document the current working launch recipes,
record the DCP4/DCP8 benchmark state,
link the split PRs,
separate upstream-worthy fixes from local-only runtime knobs.

Related upstream context

Existing upstream work:

#34795 [Core] Enable FP8 KV cache with Decode Context Parallel (DCP) for MLA
#39419 [SpecDecode] Reduce TP communication for large-vocab draft models in DFlash/PARD speculative decoding
#39930 [Attention] Fix attention backend selection with DFlash
#39995 dflash with flashinfer

Working recipes

Kimi-K2.5 + K2.5 Eagle3 MLA draft

This is the strongest current path for Kimi speculative decode on Blackwell.

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=7000 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.5 \
  --served-model-name Kimi-K2.5 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.5-eagle3-mla","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA","draft_kv_cache_dtype":"fp8","rejection_sample_method":"probabilistic"}'

Notes:

DCP=8 also works and is benchmarked below.
the VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN threshold is currently a local runtime policy, not an upstream default.
this recipe currently still performs best with an external NCCL graph XML; a no-XML patched-NCCL path is functional but not yet performance-equivalent.

Kimi-K2.6 + K2.6 Eagle3 draft

Current K2.6 recipe needs target/draft backend split because the draft is non-MLA:

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=16384 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.6 \
  --served-model-name Kimi-K2.6 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --language-model-only \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 8 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"FLASH_ATTN","draft_kv_cache_dtype":"bfloat16","use_local_argmax_reduction":true,"rejection_sample_method":"probabilistic"}'

Notes:

long-context K2.6 speculative decode still needs the threshold gate; otherwise speculation is often slower than baseline.
the K2.6 draft path is not as mature as the K2.5 MLA draft path.

Benchmarks

Kimi-K2.5 + `lightseekorg/kimi-k2.5-eagle3-mla`

Using llm_decode_bench.py against the XML-based best-known runtime recipe.

DCP=4

Selected decode throughput (tok/s):

ctx=0, C=1: 85.5
ctx=16k, C=1: 52.7
ctx=32k, C=1: 52.7
ctx=64k, C=1: 52.7
ctx=0, C=64: 929.6
ctx=16k, C=64: 826.7
ctx=32k, C=64: 698.8
ctx=64k, C=64: 508.4

Selected prefill throughput:

8k: 7885 tok/s
16k: 7998 tok/s
32k: 7693 tok/s
64k: 6999 tok/s
128k: 6108 tok/s

DCP=8

Selected decode throughput (tok/s):

ctx=0, C=1: 78.6
ctx=16k, C=1: 47.7
ctx=32k, C=1: 49.7
ctx=64k, C=1: 49.7
ctx=0, C=64: 924.6
ctx=16k, C=64: 763.5
ctx=32k, C=64: 635.7
ctx=64k, C=64: 445.2

Selected prefill throughput:

8k: 7907 tok/s
16k: 8015 tok/s
32k: 7695 tok/s
64k: 6991 tok/s
128k: 6096 tok/s

Interpretation

For the validated XML-based setup, DCP=4 currently wins over DCP=8 on decode throughput for Kimi-K2.5 + MLA draft.
Prefill throughput is very similar between DCP=4 and DCP=8.
Long-context speculative decode is only beneficial below a tuned threshold; above it, target-only decode is faster.

No-XML status

We also tested a patched NCCL 2.29.7 path without NCCL_GRAPH_FILE.

Current result:

end-to-end works without XML
decode throughput at long context is roughly comparable or slightly better in some cases
short-context decode is worse
prefill is significantly worse

Conclusion for now:

the no-XML path is promising but not yet a drop-in replacement for the current best-known XML-based runtime recipe.

Proposed split PRs

These are the changes we currently believe should be split into separate PRs.

MLA + DCP + FP8 KV correctness/support
Async speculative proposer synchronization fix
Draft-specific attention backend + draft KV dtype
Llama Eagle3 local argmax helper
Optional seq-length gate for speculative decode (discussion draft)
Narrow SM120 FP8 Triton MLA tuning (discussion draft)

Qwen3/Qwen3-DFlash local argmax is already covered by #39419.

Request for feedback

I would like feedback on:

whether #34795 should be superseded by a rebased continuation PR if the original author remains inactive,
whether the seq-length speculation gate should be represented as a config/CLI option rather than an env-only knob,
whether the narrow SM120 FP8 Triton MLA tuning should be considered separately from correctness fixes,
and whether the Kimi K2.5/K2.6 recipes should live in documentation once the split PRs land.

extent analysis

TL;DR

To improve the performance of Kimi on Blackwell, apply the validated launch recipes and consider splitting the proposed changes into separate PRs for easier maintenance and upstreaming.

Guidance

Review the provided launch recipes for Kimi-K2.5 and Kimi-K2.6 to ensure correct configuration and environment variables.
Consider the trade-offs between DCP=4 and DCP=8 for decode throughput and prefill throughput based on the benchmark results.
Evaluate the proposed split PRs to determine the best approach for upstreaming the changes, including the MLA + DCP + FP8 KV correctness/support and draft-specific attention backend + draft KV dtype.
Provide feedback on the requested topics, such as whether #34795 should be superseded by a rebased continuation PR and whether the seq-length speculation gate should be represented as a config/CLI option.

Example

VLLM_SPECULATIVE_DISABLE_ABOVE_SEQ_LEN=7000 \
VLLM_ENABLE_PCIE_ALLREDUCE=1 \
NCCL_P2P_LEVEL=SYS \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
NCCL_GRAPH_FILE=/path/to/nccl_graph_opt.xml \
VLLM_LOG_STATS_INTERVAL=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
VLLM_MARLIN_INPUT_DTYPE=fp8 \
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.5 \
  --served-model-name Kimi-K2.5 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 128 \
  --mm-processor-cache-gb 0 \
  --mm-encoder-tp-mode weights \
  --attention-backend TRITON_MLA \
  --kv-cache-dtype fp8 \
  --decode-context-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --speculative-config '{"model":"lightseekorg/kimi-k2.5-eagle3-mla","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA","draft_kv_cache_dtype":"fp8","rejection_sample_method":"probabilistic"}'

Notes

The provided launch

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Kimi] Track Kimi K2.5/K2.6 MLA + EAGLE serving on Blackwell (DCP4/DCP8, FP8 KV, draft backend split) [6 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #40609: [Core] Enable FP8 KV cache with DCP for MLA

Description (problem / solution / changelog)

Summary

Context

Scope

Why

Validation

Changed files

PR #40610: [SpecDecode] Fix async proposer synchronization

Description (problem / solution / changelog)

Summary

Context

Root cause

Why this fix

Scope

Notes

Changed files

PR #40611: [SpecDecode] Allow draft-specific attention backend and KV dtype

Description (problem / solution / changelog)

Summary

Context

Why

Scope

Notes

Changed files

PR #40612: [SpecDecode] Add local argmax helper for Llama Eagle3 drafts

Description (problem / solution / changelog)

Summary

Context

Why

Scope

Related

Changed files

PR #40613: [SpecDecode] Add seq-length gate for speculative decode

Description (problem / solution / changelog)

Summary

Context

Why

Scope

Notes

Changed files

PR #40614: [Attention] Tune TRITON_MLA for SM120 + FP8 decode

Description (problem / solution / changelog)

Summary

Context

Why

Scope

Notes

Changed files

Code Example

Scope

Why this issue exists

Related upstream context

Working recipes

Kimi-K2.5 + K2.5 Eagle3 MLA draft

Kimi-K2.6 + K2.6 Eagle3 draft

Benchmarks

Kimi-K2.5 + lightseekorg/kimi-k2.5-eagle3-mla

DCP=4

DCP=8

Interpretation

No-XML status

Proposed split PRs

Request for feedback

extent analysis

TL;DR

Guidance

Example

Notes

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Kimi-K2.5 + `lightseekorg/kimi-k2.5-eagle3-mla`