transformers - 💡(How to fix) Fix [DeepSeekV4] Potential RoPE theta mismatch between main attention and compressed KV branches

transformers2026-05-12 03:46:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

I noticed a potential inconsistency between the official DeepSeekV4 inference/model.py implementation released on Hugging Face and the current transformers implementation in modeling_deepseek_v4.py.

In the official inference/model.py, the RoPE theta seems to be selected based on self.compress_ratio:

layers without compression, i.e. pure sliding-window attention, use rope_theta = 10000
layers with compression, i.e. CSA / HCA layers, use compress_rope_theta = 40000

As a result, in CSA / HCA layers, the main query, sliding-window KV, compressed KV, and indexer Q/K appear to share the same RoPE base.

However, in the transformers implementation, DeepSeekV4 defines two RoPE types:

main, with rope_theta = 10000
compress, with compress_rope_theta = 160000

From my reading of the code:

the main attention query and the normal sliding-window KV use main RoPE
the HCA / CSA compressed KV uses compress RoPE
the CSA indexer query and indexer key also use compress RoPE

Therefore, in CSA / HCA layers, the final attention seems to mix KV entries encoded with different RoPE bases:

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

Root Cause

In the official inference/model.py, the RoPE theta seems to be selected based on self.compress_ratio:

layers without compression, i.e. pure sliding-window attention, use rope_theta = 10000
layers with compression, i.e. CSA / HCA layers, use compress_rope_theta = 40000

As a result, in CSA / HCA layers, the main query, sliding-window KV, compressed KV, and indexer Q/K appear to share the same RoPE base.

However, in the transformers implementation, DeepSeekV4 defines two RoPE types:

main, with rope_theta = 10000
compress, with compress_rope_theta = 160000

From my reading of the code:

the main attention query and the normal sliding-window KV use main RoPE
the HCA / CSA compressed KV uses compress RoPE
the CSA indexer query and indexer key also use compress RoPE

Therefore, in CSA / HCA layers, the final attention seems to mix KV entries encoded with different RoPE bases:

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

Code Example

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

RAW_BUFFERClick to expand / collapse

Description

In the official inference/model.py, the RoPE theta seems to be selected based on self.compress_ratio:

layers without compression, i.e. pure sliding-window attention, use rope_theta = 10000
layers with compression, i.e. CSA / HCA layers, use compress_rope_theta = 40000

As a result, in CSA / HCA layers, the main query, sliding-window KV, compressed KV, and indexer Q/K appear to share the same RoPE base.

However, in the transformers implementation, DeepSeekV4 defines two RoPE types:

main, with rope_theta = 10000
compress, with compress_rope_theta = 160000

From my reading of the code:

the main attention query and the normal sliding-window KV use main RoPE
the HCA / CSA compressed KV uses compress RoPE
the CSA indexer query and indexer key also use compress RoPE

Therefore, in CSA / HCA layers, the final attention seems to mix KV entries encoded with different RoPE bases:

main query:         theta = 10000
sliding-window KV:  theta = 10000
compressed KV:      theta = 160000

Concern

My concern is about the inverse RoPE applied to the attention output.

DeepSeekV4 uses shared KV, so the KV tensor acts both as key and value. Since the value part carries RoPE-rotated channels, the attention output needs to be inverse-rotated.

However, if the attention output is aggregated from both:

sliding-window values rotated with theta = 10000 compressed values rotated with theta = 160000

then a single inverse rotation using the main RoPE theta may not exactly cancel the rotation applied to the compressed values.

In contrast, the official inference/model.py implementation appears to avoid this issue by using a unified RoPE theta for the whole CSA / HCA layer.

Questions

Could you clarify whether this difference is intentional? Is the transformers implementation expected to differ from the official inference/model.py implementation in this way?

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#SSR setup #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix [DeepSeekV4] Potential RoPE theta mismatch between main attention and compressed KV branches

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Description

Concern

Questions

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix [DeepSeekV4] Potential RoPE theta mismatch between main attention and compressed KV branches

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Description

Concern

Questions

Still need to ship something?

RELATED_DISCOVERY

TRENDING