vllm - ✅(Solved) Fix [Performance] DSV3.2 Indexer: Overlap indexer k+w path || q path on separate CUDA streams [45 pull requests, 1 participants]

vllm2026-04-08 14:51:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39299•Fetched 2026-04-09 07:51:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

LopezCastroRoberto

Participants

LopezCastroRoberto

Timeline (top)

closed ×1cross-referenced ×1labeled ×1

PR fix notes

PR #39695: Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue

Repository: vllm-project/vllm
Author: panpan0000
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39695

Description (problem / solution / changelog)

Co-Author: Trae + GPT5.3-Codex

Purpose

Example to explain https://github.com/vllm-project/vllm/issues/39694

Example Algorithm:

Scoring: 0.75 * text_similarity + 0.25 * file_overlap .
Threshold used for report: 0.75 .
Using Github Action CI Cache to temp save the Github API result cache for recent 1000 PR/500 issue..etc

Test Plan

Using 1000 recent PR to test the similarity check :

High-similarity pairs ( >=0.75 ): 26

Test Result

PR Similarity

Repo: vllm-project/vllm
PR count: 1000
Candidate pairs: 17375
High-similarity pairs (>= 0.75): 26

Score	Text	Files	PR A	PR B
100%	100%	100%	#39553 Okakarpa shadow clone	#39577 Okakarpa shadow clone
99%	99%	100%	#37929 [Core] Use standalone autograd_cache_key for compilation dedup optimization	#39517 [Core] Use standalone autograd_cache_key for compilation dedup optimization
96%	95%	100%	#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu	#39257 [XPU] update triton version for torch 2.11 upgrade
96%	95%	100%	#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu	#39313 [XPU] upgrade to triton-xpu 3.7.0
95%	97%	88%	#38249 [Misc] Organize NixlConnector into own directory	#39354 [KVConnector][NIXL] Organize NIXL connector into its own directory
95%	93%	100%	#39410 [XPU] Disable fusion passes on XPU Platform	#39671 use spawn multiproc method on xpu
94%	92%	100%	#38856 [LMCache] vLLM Block Allocation Event	#39719 fix(lmcache): correct store for cached requests while enable prefix cache
94%	91%	100%	#39606 Pass extra_config to the constructor of LMCacheMPXXXAdapter	#39719 fix(lmcache): correct store for cached requests while enable prefix cache
94%	91%	100%	#39257 [XPU] update triton version for torch 2.11 upgrade	#39313 [XPU] upgrade to triton-xpu 3.7.0
91%	100%	67%	#39432 Gfx1250 wip	#39437 Gfx1250 wip rebase test
90%	92%	85%	#36823 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace	#38775 [vLLM IR] 4/N Compile native implementation
90%	86%	100%	#39402 [kv_offload+HMA[10/N]: Support load with multiple KV groups	#39403 [kv_offload+HMA][11/N]: Support store with multiple KV groups
86%	98%	50%	#23995 Feature/deepseek v31 lora support	#39661 [DOC] Update Gemma 4
82%	76%	100%	#39110 [Core] Disable HMA for eagle/MTP with sliding window models	#39376 [Core] Disable HMA for eagle/MTP with sliding window models
82%	76%	100%	#39401 [kv_offload+HMA][9/N]: Support lookup with multiple KV groups	#39402 [kv_offload+HMA[10/N]: Support load with multiple KV groups
82%	76%	100%	#39401 [kv_offload+HMA][9/N]: Support lookup with multiple KV groups	#39403 [kv_offload+HMA][11/N]: Support store with multiple KV groups
80%	96%	33%	#26583 add log for request trace	#39646 V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory
79%	97%	22%	#35721 [LoRA] Support dual CUDA streams-Linear Layer	#37297 [LoRA] Support FP8 LoRA E2E inference-dense model
79%	94%	32%	#39153 [Frontend][4/n] Improve pooling entrypoints	pooling.
79%	74%	91%	#38775 [vLLM IR] 4/N Compile native implementation	#39453 Port activations to IR op 1/3
79%	88%	50%	#39312 [Mergify] Update model vendor auto-label rules	#39429 [CI/Build] Update auto-rebase rule
78%	100%	13%	#39723 [SimpleCPUOffloadConnector]: Add support for `reset_cache()`	#39726 [SimpleCPUOffloadConnector]: Add support for reset_cache()
77%	98%	14%	#38780 [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops	#38798 [vLLM IR][RMSNorm] Port RMSNormGated to vLLM IR Ops
77%	69%	100%	#39744 [v1] Expose num_prompt_tokens in CommonAttentionMetadata	#39745 [v1] Expose num_prompt_tokens in CommonAttentionMetadata
77%	81%	62%	#23133 Split compressed_tensors_moe.py into separate wna16, int8, fp8, nvfp4	#29427 [Refactor] Split up compressed_tensors_moe.py into separate files per method
76%	82%	59%	#39267 [vllm IR] 1/N Port FP8 Quantization to vLLM IR Ops	#39481 [vllm IR] Port FP8 Quantization to vLLM IR Ops

Similar Issues:

Repo: vllm-project/vllm
Issue count: 500
Candidate pairs: 9909
High-similarity pairs (>= 0.75): 12

Match Score	Desc Similarity	Title Overlap	Issue A	Issue B
100%	100%	100%	#39270 [Bug]: Qwen3.5 crashes when using suffix-decoding	#39271 [Bug]: Qwen3.5 crashes when using suffix-decoding
100%	100%	100%	#39372 [Bug]:	#39373 [Bug]:
100%	100%	100%	#39372 [Bug]:	#39374 [Bug]:
100%	100%	100%	#39373 [Bug]:	#39374 [Bug]:
100%	100%	100%	#39433 RFC: Add logit_scale to PoolerConfig for Affine Score Calibration (Platt Scaling)	#39434 [RFC]: Add logit_scale to PoolerConfig for Affine Score Calibration (Platt Scaling)
100%	100%	100%	#39299 [Performance] DSV3.2 Indexer: Overlap indexer k+w path
81%	95%	25%	#31888 [Usage]: rollout slow	#38642 [Usage]: 模型返回值reasoning_content
80%	88%	50%	#38734 [Transformers v5] SarvamMLAForCausalLM	#38740 [Transformers v5] NemotronParseForConditionalGeneration
79%	94%	20%	#29245 [Usage]: 启动 qwen3 vl 超级超级超级慢，sglang 启动很快，可能的原因是什么？	#38642 [Usage]: 模型返回值reasoning_content
77%	92%	17%	#29245 [Usage]: 启动 qwen3 vl 超级超级超级慢，sglang 启动很快，可能的原因是什么？	#31888 [Usage]: rollout slow
77%	89%	29%	#38384 [Transformers v5] Distributed shutdown test timetout	#38740 [Transformers v5] NemotronParseForConditionalGeneration
76%	88%	31%	#31661 [Bug]: jina-reranker-m0 [image_index] IndexError: list index out of range	#32151 [Bug]: jina-reranker-m0 infer error

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

.github/workflows/detect-duplicate-issues.yml (added, +64/-0)
.github/workflows/detect-duplicate-prs.yml (added, +55/-0)
.github/workflows/scripts/detect_duplicate_issues.py (added, +453/-0)
.github/workflows/scripts/detect_duplicate_prs.py (added, +317/-0)

Code Example

Stream 19 (default) — all sequential:

 #4  QKV A Proj (fused_a_gemm)                    → produces qr + hidden_states
       ↓
 #5  wk_weights_proj (splitK)               4.8us   → reads hidden_states → kw
 #6  splitKreduce                           2.8us     kw split → k (128d), weights (64d)
 #7  k_norm (LayerNorm)                     2.2us   → k_norm(k), split → k_pe, k_nope
       ↓
 #8  q_a_rmsnorm + split qkv_lora          2.3us   → reads qr
 #9  kv_a_rmsnorm + split kv_lora          2.5us
#10  wq_b                                   7.2us   → reads q_c, split → q_pe, q_nope
       ↓
#11  Indexer RoPE(q+k) + splits + cats      2.0us   → rotary_emb(q_pe, k_pe) combined
#12  Indexer Q FP8 quant                    2.3us   → q → q_fp8, q_scale
#13  Indexer W scale                        1.4us   → weights * q_scale * scale
       ↓
       continues to q_b_proj → indexer op → MLA attention...

---

AR + Add + RMS — produces hidden_states
                    ┌──────────────────────────┴─────────────────────────┐
Default stream (q path):                    Aux stream (k+w path):
────────────────────────                    ──────────────────────
 #4  QKV A Proj               7.7us         #5  wk_weights_proj         4.8us
     split → q_c, kv_lora                   #6  splitKreduce            2.8us
 #8  q_a_rmsnorm + split      2.3us         #7  k_norm + split kw       2.2us
 #9  kv_a_rmsnorm + split     2.5us              → k_pe, k_nope, weights ready
#10  wq_b                     7.2us              (9.8us total)
     → q_pe, q_nope ready
     (19.7us total)
                    └──────────────────────────┬─────────────────────────┘
                                         sync (RoPE needs both q_pe + k_pe)
                                               ↓
                            #11  RoPE(q_pe, k_pe) combined          2.0us
                            #12  FP8 quant(q)                       2.3us
                            #13  W scale (weights * q_scale)        1.4us
                                 → continues to q_b_proj, indexer op...

---

hidden_states (from AR + Add + RMS)
                   /                    \
         QKV A Proj (GEMM)         wk_weights_proj (GEMM)
                 |                       |
         split → q_c, kv_lora      k = kw[:,:128], weights = kw[:,128:]
                 |                       |
         q_a_rmsnorm (RMS)          k_norm (LayerNorm)
                 |                       |
            wq_b (GEMM)            split k → k_pe
                 |                       |
         split q → q_pe                 |
                 \                      /
              rotary_emb(q_pe, k_pe)  ← SYNC (single kernel, needs both)
                 /             \
        cat → q             cat → k
                 |
        FP8 quant(q) → q_scale
                 |
        W scale ← weights + q_scale
                 |
        indexer_op ← q_fp8, k, scaled_weights

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Motivation

In the DeepSeek-V3.2 attention layer, the indexer's k+w path (wk_weights_proj → k_norm) and the q path (QKV A Proj → q_a_rmsnorm → wq_b) both read hidden_states as input but have no data dependency between them. Currently they execute sequentially on the same CUDA stream. Overlapping them on separate streams hides the k+w path entirely behind the longer q path.

wk_weights_proj(hidden_states) reads the original layer input directly — it does NOT depend on QKV A Proj output. Only the q path depends on QKV A Proj (via q_c = q_a_layernorm(split(fused_qkv_a_proj(hidden_states)))). So the fork can happen as soon as hidden_states is available (after AR + Add + RMS), with wk_weights_proj running in parallel with QKV A Proj itself.

PR #38684 already fused wk + weights_proj into a single wk_weights_proj GEMM. This proposal is the natural next step: overlap the fused GEMM and its downstream ops with the q path on a secondary stream.

Current Execution

All operations run sequentially on the default CUDA stream:

Stream 19 (default) — all sequential:

 #4  QKV A Proj (fused_a_gemm)                    → produces qr + hidden_states
       ↓
 #5  wk_weights_proj (splitK)               4.8us   → reads hidden_states → kw
 #6  splitKreduce                           2.8us     kw split → k (128d), weights (64d)
 #7  k_norm (LayerNorm)                     2.2us   → k_norm(k), split → k_pe, k_nope
       ↓
 #8  q_a_rmsnorm + split qkv_lora          2.3us   → reads qr
 #9  kv_a_rmsnorm + split kv_lora          2.5us
#10  wq_b                                   7.2us   → reads q_c, split → q_pe, q_nope
       ↓
#11  Indexer RoPE(q+k) + splits + cats      2.0us   → rotary_emb(q_pe, k_pe) combined
#12  Indexer Q FP8 quant                    2.3us   → q → q_fp8, q_scale
#13  Indexer W scale                        1.4us   → weights * q_scale * scale
       ↓
       continues to q_b_proj → indexer op → MLA attention...

Proposed Execution

After AR + Add + RMS, fork into two streams. Both read hidden_states. The sync point is before the indexer's rotary_emb(q_pe, k_pe) call, which is a single kernel that processes both q and k and requires outputs from both streams:

 AR + Add + RMS — produces hidden_states
                    ┌──────────────────────────┴─────────────────────────┐
Default stream (q path):                    Aux stream (k+w path):
────────────────────────                    ──────────────────────
 #4  QKV A Proj               7.7us         #5  wk_weights_proj         4.8us
     split → q_c, kv_lora                   #6  splitKreduce            2.8us
 #8  q_a_rmsnorm + split      2.3us         #7  k_norm + split kw       2.2us
 #9  kv_a_rmsnorm + split     2.5us              → k_pe, k_nope, weights ready
#10  wq_b                     7.2us              (9.8us total)
     → q_pe, q_nope ready
     (19.7us total)
                    └──────────────────────────┬─────────────────────────┘
                                         sync (RoPE needs both q_pe + k_pe)
                                               ↓
                            #11  RoPE(q_pe, k_pe) combined          2.0us
                            #12  FP8 quant(q)                       2.3us
                            #13  W scale (weights * q_scale)        1.4us
                                 → continues to q_b_proj, indexer op...

The aux stream should always complete well before the default stream. No stall at the sync point.

Data Dependency Analysis

After AR + Add + RMS produces hidden_states, two independent paths exist through to the sync point:

                 hidden_states (from AR + Add + RMS)
                   /                    \
         QKV A Proj (GEMM)         wk_weights_proj (GEMM)
                 |                       |
         split → q_c, kv_lora      k = kw[:,:128], weights = kw[:,128:]
                 |                       |
         q_a_rmsnorm (RMS)          k_norm (LayerNorm)
                 |                       |
            wq_b (GEMM)            split k → k_pe
                 |                       |
         split q → q_pe                 |
                 \                      /
              rotary_emb(q_pe, k_pe)  ← SYNC (single kernel, needs both)
                 /             \
        cat → q             cat → k
                 |
        FP8 quant(q) → q_scale
                 |
        W scale ← weights + q_scale
                 |
        indexer_op ← q_fp8, k, scaled_weights

Considerations

Future extension: Splitting the rotary_emb call into separate q/k calls and extending the aux stream through k_quant_and_cache could push the sync point later and hide more work, at the cost of additional code changes.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To improve performance, overlap the k+w path and the q path on separate CUDA streams by forking after AR + Add + RMS, allowing wk_weights_proj to run in parallel with QKV A Proj.

Guidance

Identify the independent paths after AR + Add + RMS produces hidden_states: one for QKV A Proj and the other for wk_weights_proj.
Implement a fork into two streams at this point, ensuring both paths can execute concurrently without data dependency issues.
Verify that the aux stream (k+w path) completes before the default stream (q path) to avoid stalls at the sync point before rotary_emb(q_pe, k_pe).
Consider future extensions, such as splitting the rotary_emb call, to further optimize performance.

Example

No specific code snippet is provided due to the lack of explicit code in the issue, but the proposed execution plan illustrates the parallelization of the k+w and q paths.

Notes

The success of this optimization depends on the actual execution times of the operations on the specific hardware and the efficiency of the CUDA stream management. Monitoring performance metrics after implementation will be crucial.

Recommendation

Apply the proposed workaround by forking the execution into two streams after AR + Add + RMS, allowing for the parallel execution of the k+w and q paths, as this approach has the potential to significantly improve performance by hiding the k+w path behind the longer q path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.