transformers - 💡(How to fix) Fix [DeepSeekV4] Potential RoPE theta mismatch between main attention and compressed KV branches

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I noticed a potential inconsistency between the official DeepSeekV4 inference/model.py implementation released on Hugging Face and the current transformers implementation in modeling_deepseek_v4.py.

In the official inference/model.py, the RoPE theta seems to be selected based on self.compress_ratio:

  • layers without compression, i.e. pure sliding-window attention, use rope_theta = 10000
  • layers with compression, i.e. CSA / HCA layers, use compress_rope_theta = 40000

As a result, in CSA / HCA layers, the main query, sliding-window KV, compressed KV, and indexer Q/K appear to share the same RoPE base.

However, in the transformers implementation, DeepSeekV4 defines two RoPE types:

  • main, with rope_theta = 10000
  • compress, with compress_rope_theta = 160000

From my reading of the code:

  • the main attention query and the normal sliding-window KV use main RoPE
  • the HCA / CSA compressed KV uses compress RoPE
  • the CSA indexer query and indexer key also use compress RoPE

Therefore, in CSA / HCA layers, the final attention seems to mix KV entries encoded with different RoPE bases:

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

Root Cause

I noticed a potential inconsistency between the official DeepSeekV4 inference/model.py implementation released on Hugging Face and the current transformers implementation in modeling_deepseek_v4.py.

In the official inference/model.py, the RoPE theta seems to be selected based on self.compress_ratio:

  • layers without compression, i.e. pure sliding-window attention, use rope_theta = 10000
  • layers with compression, i.e. CSA / HCA layers, use compress_rope_theta = 40000

As a result, in CSA / HCA layers, the main query, sliding-window KV, compressed KV, and indexer Q/K appear to share the same RoPE base.

However, in the transformers implementation, DeepSeekV4 defines two RoPE types:

  • main, with rope_theta = 10000
  • compress, with compress_rope_theta = 160000

From my reading of the code:

  • the main attention query and the normal sliding-window KV use main RoPE
  • the HCA / CSA compressed KV uses compress RoPE
  • the CSA indexer query and indexer key also use compress RoPE

Therefore, in CSA / HCA layers, the final attention seems to mix KV entries encoded with different RoPE bases:

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

Code Example

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000
RAW_BUFFERClick to expand / collapse

Description

I noticed a potential inconsistency between the official DeepSeekV4 inference/model.py implementation released on Hugging Face and the current transformers implementation in modeling_deepseek_v4.py.

In the official inference/model.py, the RoPE theta seems to be selected based on self.compress_ratio:

  • layers without compression, i.e. pure sliding-window attention, use rope_theta = 10000
  • layers with compression, i.e. CSA / HCA layers, use compress_rope_theta = 40000

As a result, in CSA / HCA layers, the main query, sliding-window KV, compressed KV, and indexer Q/K appear to share the same RoPE base.

However, in the transformers implementation, DeepSeekV4 defines two RoPE types:

  • main, with rope_theta = 10000
  • compress, with compress_rope_theta = 160000

From my reading of the code:

  • the main attention query and the normal sliding-window KV use main RoPE
  • the HCA / CSA compressed KV uses compress RoPE
  • the CSA indexer query and indexer key also use compress RoPE

Therefore, in CSA / HCA layers, the final attention seems to mix KV entries encoded with different RoPE bases:

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

Concern

My concern is about the inverse RoPE applied to the attention output.

DeepSeekV4 uses shared KV, so the KV tensor acts both as key and value. Since the value part carries RoPE-rotated channels, the attention output needs to be inverse-rotated.

However, if the attention output is aggregated from both:

sliding-window values rotated with theta = 10000 compressed values rotated with theta = 160000

then a single inverse rotation using the main RoPE theta may not exactly cancel the rotation applied to the compressed values.

In contrast, the official inference/model.py implementation appears to avoid this issue by using a unified RoPE theta for the whole CSA / HCA layer.

Questions

Could you clarify whether this difference is intentional? Is the transformers implementation expected to differ from the official inference/model.py implementation in this way?

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix [DeepSeekV4] Potential RoPE theta mismatch between main attention and compressed KV branches