transformers - 💡(How to fix) Fix Discrepancy Between the Paper’s Claimed Sparse Attention Complexity and the HuggingFace Implementation Caused by Expanding KV to S * k

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
RAW_BUFFERClick to expand / collapse

In the HuggingFace Transformers implementation of DeepSeek V4, the Compressed Sparse Attention (CSA) module gathers the top‑k compressed blocks selected by the indexer for each query, flattens them into a tensor of shape [B, 1, S*k, head_dim], and then directly concatenates this tensor with the sliding‑window KV:

Final KV shape: [B, 1, S_sw + S*k, head_dim]

Attention weight matrix: [B, H, S, S_sw + S*k]

This leads to a computational cost of approximately O(S · (S_sw + S·k)) = O(S²·k). However, the DeepSeek V4 paper (§2.3.1) claims a sparse attention complexity of O(S · (S_sw + k)) — that is, each query only dot‑products with a fixed number of KV entries, without quadratic growth in the sequence length S.

The actual shape evolution in the code is: Total compressed blocks T → compressed_kv [B, 1, T, D] Indexer selects top‑k indices [B, S, k] Gather and flatten [B, 1, Sk, D] Concatenate with sliding‑window KV [B, 1, S_sw + Sk, D] Broadcast to all heads via repeat_kv [B, H, S_sw + Sk, D] q @ k^T [B, H, S, S_sw + Sk]

During training or long‑sequence prefill where S is large, the computational cost far exceeds the linear complexity promised by the paper.

Question: Is this implementation a reference solution intended only to work with standard attention APIs (matching theoretical complexity only when S=1, i.e., token‑by‑token decoding), or is there a plan to restore the O(S·(S_sw + k)) complexity later through block‑sparse kernels or FlashAttention sparse patterns?

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Discrepancy Between the Paper’s Claimed Sparse Attention Complexity and the HuggingFace Implementation Caused by Expanding KV to S * k