transformers - 💡(How to fix) Fix DeepSeek-V4 CSA eager path may not preserve per-query top-k masking for S > 1 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45758Fetched 2026-05-04 04:58:12
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Code Example

topk = self.indexer(hidden_states, q_residual, position_ids, past_key_values, layer_idx)  # [B, S, k]
expanded = compressed_kv.unsqueeze(2).expand(-1, -1, seq_len, -1, -1)                    # [B, 1, S, T, D]
idx = topk.unsqueeze(1).unsqueeze(-1).expand(-1, 1, -1, -1, self.head_dim)               # [B, 1, S, k, D]
return torch.gather(expanded, 3, idx).reshape(batch, 1, -1, self.head_dim)               # [B, 1, S*k, D]

---

selected[b, 0, t, :, :] = compressed entries selected for query t
shape: [B, 1, S, k, D]

---

shape: [B, 1, S*k, D]

---

compressed_kv = self.compressor(...)
kv = torch.cat([kv, compressed_kv], dim=2)

if isinstance(attention_mask, torch.Tensor) and kv.shape[2] > attention_mask.shape[-1]:
    attention_mask = F.pad(attention_mask, (0, kv.shape[2] - attention_mask.shape[-1]), value=0.0)

---

B = 1, S = 3, k = 2

logical selected entries:
  q0 -> [A0, A1]
  q1 -> [B0, B1]
  q2 -> [C0, C1]

after flatten:
  [A0, A1, B0, B1, C0, C1]

if the compressed segment is mask-padded with 0.0:
  q0 can attend to A*, B*, C*
  q1 can attend to A*, B*, C*
  q2 can attend to A*, B*, C*

---

return index_scores.topk(topk, dim=-1).indices
RAW_BUFFERClick to expand / collapse

System Info

Observed on main after DeepSeek-V4 support was added in #45643.

Relevant file:

  • src/transformers/models/deepseek_v4/modeling_deepseek_v4.py
  • DeepseekV4CSACompressor.forward
  • DeepseekV4Attention.forward

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset

Reproduction

This is a code-reading report rather than a failing runtime script. I may be missing an intended invariant in the eager path, but the current tensor shapes look non-equivalent to per-query CSA sparse attention when seq_len > 1.

In DeepseekV4CSACompressor.forward, CSA first obtains per-query top-k compressed KV indices:

topk = self.indexer(hidden_states, q_residual, position_ids, past_key_values, layer_idx)  # [B, S, k]
expanded = compressed_kv.unsqueeze(2).expand(-1, -1, seq_len, -1, -1)                    # [B, 1, S, T, D]
idx = topk.unsqueeze(1).unsqueeze(-1).expand(-1, 1, -1, -1, self.head_dim)               # [B, 1, S, k, D]
return torch.gather(expanded, 3, idx).reshape(batch, 1, -1, self.head_dim)               # [B, 1, S*k, D]

Conceptually, the gathered tensor before reshape is per-query selected compressed KV:

selected[b, 0, t, :, :] = compressed entries selected for query t
shape: [B, 1, S, k, D]

After .reshape(batch, 1, -1, D), the selected entries for all query positions are flattened into one KV axis:

shape: [B, 1, S*k, D]

Then DeepseekV4Attention.forward concatenates those entries after the sliding-window KV branch:

compressed_kv = self.compressor(...)
kv = torch.cat([kv, compressed_kv], dim=2)

if isinstance(attention_mask, torch.Tensor) and kv.shape[2] > attention_mask.shape[-1]:
    attention_mask = F.pad(attention_mask, (0, kv.shape[2] - attention_mask.shape[-1]), value=0.0)

The dense mask is right-padded with 0.0, which appears to make the entire flattened compressed segment visible to every query. For S > 1, that means query t0 can attend to compressed entries that were selected for query t1, t2, etc.

A minimal shape example:

B = 1, S = 3, k = 2

logical selected entries:
  q0 -> [A0, A1]
  q1 -> [B0, B1]
  q2 -> [C0, C1]

after flatten:
  [A0, A1, B0, B1, C0, C1]

if the compressed segment is mask-padded with 0.0:
  q0 can attend to A*, B*, C*
  q1 can attend to A*, B*, C*
  q2 can attend to A*, B*, C*

That seems different from query-specific sparse attention unless an additional block mask maps each query t to only its own flattened segment [t*k : (t+1)*k].

There is a second related question around causal visibility. The paper describes the index score between a query token t and a preceding compressed block s (s < floor(t / m)). In the current eager code, I do not see an explicit query-dependent visible-range mask before:

return index_scores.topk(topk, dim=-1).indices

So for multi-token prefill/training-style forwards, it is not obvious where future-containing compressed blocks are excluded from top-k selection.

Expected behavior

For CSA with seq_len > 1, each query position should attend only to the compressed KV entries selected for that same query, plus the intended sliding-window KV entries.

Equivalently, one of these should hold:

  1. the selected compressed KV remains logically shaped as [B, H_kv, S, k, D] and the attention kernel consumes it per query;
  2. the flattened [B, H_kv, S*k, D] layout is paired with a query-dependent block mask so query t only sees its own segment; or
  3. the eager path is documented/guarded as decode-only for CSA (S == 1) if that is the intended supported usage.

If the current implementation relies on some invariant that makes this equivalent, could you point me to where that masking/visibility constraint is enforced?

extent analysis

TL;DR

The issue can be resolved by introducing a query-dependent block mask to ensure each query position only attends to the compressed KV entries selected for that same query.

Guidance

  • Review the DeepseekV4CSACompressor.forward method to understand how the compressed KV indices are obtained and how they are used in the attention mechanism.
  • Consider introducing a block mask that maps each query t to only its own flattened segment [t*k : (t+1)*k] to enforce query-specific sparse attention.
  • Investigate the causal visibility of compressed blocks and ensure that future-containing blocks are excluded from top-k selection for each query token.
  • Verify that the attention kernel consumes the selected compressed KV per query, or that the flattened layout is paired with a query-dependent block mask.

Example

# Example of introducing a block mask
block_mask = torch.zeros((batch, seq_len, seq_len * k), device=device)
for t in range(seq_len):
    block_mask[:, t, t*k:(t+1)*k] = 1.0
# Apply the block mask to the attention weights
attention_weights = attention_weights * block_mask[:, :, None, :]

Notes

The current implementation may rely on an invariant that makes the eager path equivalent to the expected behavior, but it is not immediately clear where this masking/visibility constraint is enforced. Further investigation is needed to determine the root cause of the issue.

Recommendation

Apply a workaround by introducing a query-dependent block mask to ensure each query position only attends to the compressed KV entries selected for that same query. This will help to enforce query-specific sparse attention and prevent queries from attending to compressed entries selected for other queries.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For CSA with seq_len > 1, each query position should attend only to the compressed KV entries selected for that same query, plus the intended sliding-window KV entries.

Equivalently, one of these should hold:

  1. the selected compressed KV remains logically shaped as [B, H_kv, S, k, D] and the attention kernel consumes it per query;
  2. the flattened [B, H_kv, S*k, D] layout is paired with a query-dependent block mask so query t only sees its own segment; or
  3. the eager path is documented/guarded as decode-only for CSA (S == 1) if that is the intended supported usage.

If the current implementation relies on some invariant that makes this equivalent, could you point me to where that masking/visibility constraint is enforced?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix DeepSeek-V4 CSA eager path may not preserve per-query top-k masking for S > 1 [1 participants]