vllm - 💡(How to fix) Fix [Performance] DSV3.2 Indexer: Overlap indexer op || q_b_proj + MLA RoPE on separate CUDA streams [1 participants]

Code Example

#4  QKV A Proj (fused_a_gemm)                                   7.7us
 #8  q_a_rmsnorm + split qkv_lora                                2.3us
 #9  kv_a_rmsnorm + split kv_lora                                2.5us
       ↓
#14  Q B Proj (q_b_proj)                                        15.1us   → reads q_c
#15  MLA RoPE (Q RoPE + KV RoPE)                                 2.0us   → reads q + k_pe
       ↓
      Full indexer (self.indexer(hidden_states, q_c, ...)):
 #5  wk_weights_proj (splitK)                                    4.8us
 #6  splitKreduce                                                2.8us
 #7  k_norm (LayerNorm)                                          2.2us
#10  wq_b                                                        7.2us
#11  Indexer RoPE (q+k)                                           2.0us
#12  FP8 quant                                                    2.3us
#13  W scale                                                      1.4us
#16  Indexer Cache (k_quant_and_cache)                            2.5us
#17  fill                                                         1.1us
#18  Indexer MQA (paged_mqa_logits)                               4.5us
#19  Logits Top K (topk_kernel)                                   1.4us
       ↓
#20  concat_and_cache_mla                                         2.0us   ← start of mla_attn
      ... MLA attention continues ...

---

#4  QKV A Proj                                                   7.7us
 #8  q_a_rmsnorm + split                                          2.3us
 #9  kv_a_rmsnorm + split                                         2.5us
                    ┌──────────────────────────┴─────────────────────────┐
Default stream (full indexer):              Aux stream (q_b_proj + MLA RoPE):
──────────────────────────                  ────────────────────────────
 #5  wk_weights_proj         4.8us          #14  Q B Proj (q_b_proj)     15.1us
 #6  splitKreduce            2.8us          #15  MLA RoPE (Q+KV)          2.0us
 #7  k_norm                  2.2us               → MLA q ready
#10  wq_b                    7.2us               (17.1us total)
#11  Indexer RoPE            2.0us
#12  FP8 quant               2.3us
#13  W scale                 1.4us
#16  Indexer Cache            2.5us
#17  fill                    1.1us
#18  Indexer MQA             4.5us
#19  Logits Top K            1.4us
     → topk_indices ready
     (32.2us total)
                    └──────────────────────────┬─────────────────────────┘
                                         sync (mla_attn needs both MLA q + topk_indices)
                                               ↓
                            #20  concat_and_cache_mla               2.0us
                            #21  kv_b_proj (W_UV)                   5.9us
                                 ... sparse FlashMLA ...

---

AR + Add + RMS → hidden_states ready
  ↓ fork #1 (issue 1)
  Default: QKV A Proj → q_a_rmsnorm → kv_a_rmsnorm → wq_b     
  Aux:     wk_weights_proj → reduce → k_norm                       
  ↓ sync #1 (before Indexer RoPE)
  Default continues: Indexer RoPE → FP8 quant → W scale      
  ↓ fork #2 (issue 2 — aux stream reused)
  Default: Indexer Cache → fill → MQA logits → TopK                
  Aux:     Q B Proj → MLA RoPE                                     
  ↓ sync #2 (before mla_attn)
  MLA attention → MoE → ...

Motivation

In the DeepSeek-V3.2 attention layer, q_b_proj + MLA RoPE and the full indexer (projections + op) execute sequentially on the same CUDA stream despite having no data dependency.

This is complementary to the multi-stream overlap of the indexer k+w path || q path (separate issue): see https://github.com/vllm-project/vllm/issues/39309

Current Execution

 #4  QKV A Proj (fused_a_gemm)                                   7.7us
 #8  q_a_rmsnorm + split qkv_lora                                2.3us
 #9  kv_a_rmsnorm + split kv_lora                                2.5us
       ↓
#14  Q B Proj (q_b_proj)                                        15.1us   → reads q_c
#15  MLA RoPE (Q RoPE + KV RoPE)                                 2.0us   → reads q + k_pe
       ↓
      Full indexer (self.indexer(hidden_states, q_c, ...)):
 #5  wk_weights_proj (splitK)                                    4.8us
 #6  splitKreduce                                                2.8us
 #7  k_norm (LayerNorm)                                          2.2us
#10  wq_b                                                        7.2us
#11  Indexer RoPE (q+k)                                           2.0us
#12  FP8 quant                                                    2.3us
#13  W scale                                                      1.4us
#16  Indexer Cache (k_quant_and_cache)                            2.5us
#17  fill                                                         1.1us
#18  Indexer MQA (paged_mqa_logits)                               4.5us
#19  Logits Top K (topk_kernel)                                   1.4us
       ↓
#20  concat_and_cache_mla                                         2.0us   ← start of mla_attn
      ... MLA attention continues ...

Proposed Standalone Execution

Run the full indexer on the default stream and q_b_proj + MLA RoPE on an aux stream in parallel. In the current code, q_b_proj is already called outside the indexer — no changes to Indexer.forward() needed. The sync point is before mla_attn(), which needs both MLA q (from q_b_proj) and topk_indices (from the indexer):

 #4  QKV A Proj                                                   7.7us
 #8  q_a_rmsnorm + split                                          2.3us
 #9  kv_a_rmsnorm + split                                         2.5us
                    ┌──────────────────────────┴─────────────────────────┐
Default stream (full indexer):              Aux stream (q_b_proj + MLA RoPE):
──────────────────────────                  ────────────────────────────
 #5  wk_weights_proj         4.8us          #14  Q B Proj (q_b_proj)     15.1us
 #6  splitKreduce            2.8us          #15  MLA RoPE (Q+KV)          2.0us
 #7  k_norm                  2.2us               → MLA q ready
#10  wq_b                    7.2us               (17.1us total)
#11  Indexer RoPE            2.0us
#12  FP8 quant               2.3us
#13  W scale                 1.4us
#16  Indexer Cache            2.5us
#17  fill                    1.1us
#18  Indexer MQA             4.5us
#19  Logits Top K            1.4us
     → topk_indices ready
     (32.2us total)
                    └──────────────────────────┬─────────────────────────┘
                                         sync (mla_attn needs both MLA q + topk_indices)
                                               ↓
                            #20  concat_and_cache_mla               2.0us
                            #21  kv_b_proj (W_UV)                   5.9us
                                 ... sparse FlashMLA ...

Combined with https://github.com/vllm-project/vllm/issues/39309 (k+w path || q path overlap)

If implemented together with the k+w path || q path multi-stream overlap (https://github.com/vllm-project/vllm/issues/39309), the aux stream is reused across two fork/sync phases. https://github.com/vllm-project/vllm/issues/39309 shortens the indexer's critical path, making the two streams more balanced:

AR + Add + RMS → hidden_states ready
  ↓ fork #1 (issue 1)
  Default: QKV A Proj → q_a_rmsnorm → kv_a_rmsnorm → wq_b     
  Aux:     wk_weights_proj → reduce → k_norm                       
  ↓ sync #1 (before Indexer RoPE)
  Default continues: Indexer RoPE → FP8 quant → W scale      
  ↓ fork #2 (issue 2 — aux stream reused)
  Default: Indexer Cache → fill → MQA logits → TopK                
  Aux:     Q B Proj → MLA RoPE                                     
  ↓ sync #2 (before mla_attn)
  MLA attention → MoE → ...

extent analysis

TL;DR

Run the full indexer on the default stream and q_b_proj + MLA RoPE on an auxiliary stream in parallel to improve execution efficiency.

Guidance

Identify the synchronization point before mla_attn() where both MLA q (from q_b_proj) and topk_indices (from the indexer) are needed.
Implement a fork point before q_b_proj to execute it and MLA RoPE on an auxiliary stream, allowing the full indexer to run on the default stream.
Ensure proper synchronization between the two streams before mla_attn() to guarantee that both required inputs are ready.
Consider combining this optimization with the k+w path || q path multi-stream overlap (https://github.com/vllm-project/vllm/issues/39309) for further performance improvements.

Example

No explicit code example is provided due to the complexity and specificity of the issue, but the proposed standalone execution section illustrates the intended parallelization.

Notes

The success of this optimization depends on the actual execution times and the balance between the two streams. Monitoring performance and adjusting the implementation as needed is crucial.

Recommendation

Apply the proposed workaround by running the full indexer on the default stream and q_b_proj + MLA RoPE on an auxiliary stream in parallel, as it has the potential to significantly improve execution efficiency without requiring major code changes.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance] DSV3.2 Indexer: Overlap indexer op || q_b_proj + MLA RoPE on separate CUDA streams [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Motivation

Current Execution

Proposed Standalone Execution

Combined with https://github.com/vllm-project/vllm/issues/39309 (k+w path || q path overlap)

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance] DSV3.2 Indexer: Overlap indexer op || q_b_proj + MLA RoPE on separate CUDA streams [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Motivation

Current Execution

Proposed Standalone Execution

Combined with https://github.com/vllm-project/vllm/issues/39309 (k+w path || q path overlap)

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING