transformers - 💡(How to fix) Fix Discrepancy Between the Paper’s Claimed Sparse Attention Complexity and the HuggingFace Implementation Caused by Expanding KV to S * k

transformers2026-05-13 01:26:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

RAW_BUFFERClick to expand / collapse

In the HuggingFace Transformers implementation of DeepSeek V4, the Compressed Sparse Attention (CSA) module gathers the top‑k compressed blocks selected by the indexer for each query, flattens them into a tensor of shape [B, 1, S*k, head_dim], and then directly concatenates this tensor with the sliding‑window KV:

Final KV shape: [B, 1, S_sw + S*k, head_dim]

Attention weight matrix: [B, H, S, S_sw + S*k]

This leads to a computational cost of approximately O(S · (S_sw + S·k)) = O(S²·k). However, the DeepSeek V4 paper (§2.3.1) claims a sparse attention complexity of O(S · (S_sw + k)) — that is, each query only dot‑products with a fixed number of KV entries, without quadratic growth in the sequence length S.

The actual shape evolution in the code is: Total compressed blocks T → compressed_kv [B, 1, T, D] Indexer selects top‑k indices [B, S, k] Gather and flatten [B, 1, Sk, D] Concatenate with sliding‑window KV [B, 1, S_sw + Sk, D] Broadcast to all heads via repeat_kv [B, H, S_sw + Sk, D] q @ k^T [B, H, S, S_sw + Sk]

During training or long‑sequence prefill where S is large, the computational cost far exceeds the linear complexity promised by the paper.

Question: Is this implementation a reference solution intended only to work with standard attention APIs (matching theoretical complexity only when S=1, i.e., token‑by‑token decoding), or is there a plan to restore the O(S·(S_sw + k)) complexity later through block‑sparse kernels or FlashAttention sparse patterns?

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retrieval issue #search optimization #API routing #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Discrepancy Between the Paper’s Claimed Sparse Attention Complexity and the HuggingFace Implementation Caused by Expanding KV to S * k

Recommended Tools

GitHub issue graph ai analysis

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Discrepancy Between the Paper’s Claimed Sparse Attention Complexity and the HuggingFace Implementation Caused by Expanding KV to S * k

Recommended Tools

GitHub issue graph ai analysis

Still need to ship something?

RELATED_DISCOVERY

TRENDING