vllm - 💡(How to fix) Fix [Performance] DSV3.2 Indexer: Overlap indexer op || q_b_proj + MLA RoPE on separate CUDA streams [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39308Fetched 2026-04-09 07:51:59
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Code Example

#4  QKV A Proj (fused_a_gemm)                                   7.7us
 #8  q_a_rmsnorm + split qkv_lora                                2.3us
 #9  kv_a_rmsnorm + split kv_lora                                2.5us
#14  Q B Proj (q_b_proj)                                        15.1us   → reads q_c
#15  MLA RoPE (Q RoPE + KV RoPE)                                 2.0us   → reads q + k_pe
      Full indexer (self.indexer(hidden_states, q_c, ...)):
 #5  wk_weights_proj (splitK)                                    4.8us
 #6  splitKreduce                                                2.8us
 #7  k_norm (LayerNorm)                                          2.2us
#10  wq_b                                                        7.2us
#11  Indexer RoPE (q+k)                                           2.0us
#12  FP8 quant                                                    2.3us
#13  W scale                                                      1.4us
#16  Indexer Cache (k_quant_and_cache)                            2.5us
#17  fill                                                         1.1us
#18  Indexer MQA (paged_mqa_logits)                               4.5us
#19  Logits Top K (topk_kernel)                                   1.4us
#20  concat_and_cache_mla                                         2.0us   ← start of mla_attn
      ... MLA attention continues ...

---

#4  QKV A Proj                                                   7.7us
 #8  q_a_rmsnorm + split                                          2.3us
 #9  kv_a_rmsnorm + split                                         2.5us
                    ┌──────────────────────────┴─────────────────────────┐
Default stream (full indexer):              Aux stream (q_b_proj + MLA RoPE):
──────────────────────────                  ────────────────────────────
 #5  wk_weights_proj         4.8us          #14  Q B Proj (q_b_proj)     15.1us
 #6  splitKreduce            2.8us          #15  MLA RoPE (Q+KV)          2.0us
 #7  k_norm                  2.2us               → MLA q ready
#10  wq_b                    7.2us               (17.1us total)
#11  Indexer RoPE            2.0us
#12  FP8 quant               2.3us
#13  W scale                 1.4us
#16  Indexer Cache            2.5us
#17  fill                    1.1us
#18  Indexer MQA             4.5us
#19  Logits Top K            1.4us
     → topk_indices ready
     (32.2us total)
                    └──────────────────────────┬─────────────────────────┘
                                         sync (mla_attn needs both MLA q + topk_indices)
                            #20  concat_and_cache_mla               2.0us
                            #21  kv_b_proj (W_UV)                   5.9us
                                 ... sparse FlashMLA ...

---

AR + Add + RMS → hidden_states ready
  ↓ fork #1 (issue 1)
  Default: QKV A Proj → q_a_rmsnorm → kv_a_rmsnorm → wq_b     
  Aux:     wk_weights_proj → reduce → k_norm                       
  ↓ sync #1 (before Indexer RoPE)
  Default continues: Indexer RoPEFP8 quant → W scale      
  ↓ fork #2 (issue 2 — aux stream reused)
  Default: Indexer Cache → fill → MQA logits → TopK                
  Aux:     Q B ProjMLA RoPE                                     
  ↓ sync #2 (before mla_attn)
  MLA attention → MoE...
RAW_BUFFERClick to expand / collapse

Motivation

In the DeepSeek-V3.2 attention layer, q_b_proj + MLA RoPE and the full indexer (projections + op) execute sequentially on the same CUDA stream despite having no data dependency.

This is complementary to the multi-stream overlap of the indexer k+w path || q path (separate issue): see https://github.com/vllm-project/vllm/issues/39309

Current Execution

 #4  QKV A Proj (fused_a_gemm)                                   7.7us
 #8  q_a_rmsnorm + split qkv_lora                                2.3us
 #9  kv_a_rmsnorm + split kv_lora                                2.5us
#14  Q B Proj (q_b_proj)                                        15.1us   → reads q_c
#15  MLA RoPE (Q RoPE + KV RoPE)                                 2.0us   → reads q + k_pe
      Full indexer (self.indexer(hidden_states, q_c, ...)):
 #5  wk_weights_proj (splitK)                                    4.8us
 #6  splitKreduce                                                2.8us
 #7  k_norm (LayerNorm)                                          2.2us
#10  wq_b                                                        7.2us
#11  Indexer RoPE (q+k)                                           2.0us
#12  FP8 quant                                                    2.3us
#13  W scale                                                      1.4us
#16  Indexer Cache (k_quant_and_cache)                            2.5us
#17  fill                                                         1.1us
#18  Indexer MQA (paged_mqa_logits)                               4.5us
#19  Logits Top K (topk_kernel)                                   1.4us
#20  concat_and_cache_mla                                         2.0us   ← start of mla_attn
      ... MLA attention continues ...
<img width="3170" height="206" alt="Image" src="https://github.com/user-attachments/assets/cba4af23-e311-4624-af36-f4d9a3c33bd5" />

Proposed Standalone Execution

Run the full indexer on the default stream and q_b_proj + MLA RoPE on an aux stream in parallel. In the current code, q_b_proj is already called outside the indexer — no changes to Indexer.forward() needed. The sync point is before mla_attn(), which needs both MLA q (from q_b_proj) and topk_indices (from the indexer):

 #4  QKV A Proj                                                   7.7us
 #8  q_a_rmsnorm + split                                          2.3us
 #9  kv_a_rmsnorm + split                                         2.5us
                    ┌──────────────────────────┴─────────────────────────┐
Default stream (full indexer):              Aux stream (q_b_proj + MLA RoPE):
──────────────────────────                  ────────────────────────────
 #5  wk_weights_proj         4.8us          #14  Q B Proj (q_b_proj)     15.1us
 #6  splitKreduce            2.8us          #15  MLA RoPE (Q+KV)          2.0us
 #7  k_norm                  2.2us               → MLA q ready
#10  wq_b                    7.2us               (17.1us total)
#11  Indexer RoPE            2.0us
#12  FP8 quant               2.3us
#13  W scale                 1.4us
#16  Indexer Cache            2.5us
#17  fill                    1.1us
#18  Indexer MQA             4.5us
#19  Logits Top K            1.4us
     → topk_indices ready
     (32.2us total)
                    └──────────────────────────┬─────────────────────────┘
                                         sync (mla_attn needs both MLA q + topk_indices)
                            #20  concat_and_cache_mla               2.0us
                            #21  kv_b_proj (W_UV)                   5.9us
                                 ... sparse FlashMLA ...

Combined with https://github.com/vllm-project/vllm/issues/39309 (k+w path || q path overlap)

If implemented together with the k+w path || q path multi-stream overlap (https://github.com/vllm-project/vllm/issues/39309), the aux stream is reused across two fork/sync phases. https://github.com/vllm-project/vllm/issues/39309 shortens the indexer's critical path, making the two streams more balanced:

AR + Add + RMS → hidden_states ready
  ↓ fork #1 (issue 1)
  Default: QKV A Proj → q_a_rmsnorm → kv_a_rmsnorm → wq_b     
  Aux:     wk_weights_proj → reduce → k_norm                       
  ↓ sync #1 (before Indexer RoPE)
  Default continues: Indexer RoPE → FP8 quant → W scale      
  ↓ fork #2 (issue 2 — aux stream reused)
  Default: Indexer Cache → fill → MQA logits → TopK                
  Aux:     Q B Proj → MLA RoPE                                     
  ↓ sync #2 (before mla_attn)
  MLA attention → MoE → ...

extent analysis

TL;DR

Run the full indexer on the default stream and q_b_proj + MLA RoPE on an auxiliary stream in parallel to improve execution efficiency.

Guidance

  • Identify the synchronization point before mla_attn() where both MLA q (from q_b_proj) and topk_indices (from the indexer) are needed.
  • Implement a fork point before q_b_proj to execute it and MLA RoPE on an auxiliary stream, allowing the full indexer to run on the default stream.
  • Ensure proper synchronization between the two streams before mla_attn() to guarantee that both required inputs are ready.
  • Consider combining this optimization with the k+w path || q path multi-stream overlap (https://github.com/vllm-project/vllm/issues/39309) for further performance improvements.

Example

No explicit code example is provided due to the complexity and specificity of the issue, but the proposed standalone execution section illustrates the intended parallelization.

Notes

The success of this optimization depends on the actual execution times and the balance between the two streams. Monitoring performance and adjusting the implementation as needed is crucial.

Recommendation

Apply the proposed workaround by running the full indexer on the default stream and q_b_proj + MLA RoPE on an auxiliary stream in parallel, as it has the potential to significantly improve execution efficiency without requiring major code changes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING