transformers - ✅(Solved) Fix Native `DeepseekV3MoE` diverges from the remote DeepSeekV3 implementation [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45440Fetched 2026-04-16 06:35:41
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #45441: fix(DSV3): parity between native DeepseekV3MoE and remote official implementation

Description (problem / solution / changelog)

What does this PR do?

Please see fix #45440 for more details

Discussed with @vasqu

Also fixed via regen/inheritance:

  • exaone-moe
  • glm4-moe
  • glm4v-moe
  • glm4-moe-lite
  • glm_moe_dsa
  • nemotron_h
  • solar_open

 

[!WARNING]

This test now fails (untrained, more likely to diverge) and needs to be regenerated.

https://github.com/huggingface/transformers/blob/5b565a589ca5935fbc7a3ea93b9b622bb41bd129/tests/models/deepseek_v3/test_modeling_deepseek_v3.py#L398-L409

 

So I reran the DSV3 forward pass to get the new hardcoded string for myself and retest but maybe this needs to be run and tested on your side too?

       EXPECTED_TEXT_COMPLETION = [
"Simply put, the theory of relativity states that aportersh455elike injection tactics-altitude蹲在那儿 >Loregefruitakosdeckingredientsuchtroni李世umontיםplicitlyShadowoldtriad Therapeutics不减-ste 的希望和价值 >kerretteylesheetzimnasium的品质 Talm",
"My favorite all time favorite condiment is ketchup. Lan overhead excite-ment好用>cileriaceaeagnainesogaslipadicSiggleESHalseawarriorsrattieri佐iented >Parrheta-counterousseanatysisoglCTSinkeheilbronnenlaceslide tactauralick",
       ]

 

  • Also potential cascading effect with the inherited models too and their own def test_compile_static_cache(self): test?

 

I greedily flagged 3 other models with masked_fill(~score_mask.bool(), 0.0) but I'm not blindly touching these for now, need to verify their logic first (if using loss free load balancing) + what their remote are doing. If the remote is wrong, not sure we should change the native implementation or keep it wrong with the remote.

This is likely for a follow-up PR if these need to be fixed too.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc - kernels: @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Changed files

  • src/transformers/models/deepseek_v3/modeling_deepseek_v3.py (modified, +1/-1)
  • src/transformers/models/deepseek_v3/modular_deepseek_v3.py (modified, +1/-1)
  • src/transformers/models/exaone_moe/modeling_exaone_moe.py (modified, +1/-1)
  • src/transformers/models/glm4_moe/modeling_glm4_moe.py (modified, +1/-1)
  • src/transformers/models/glm4_moe_lite/modeling_glm4_moe_lite.py (modified, +1/-1)
  • src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +1/-1)
  • src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py (modified, +1/-1)
  • src/transformers/models/nemotron_h/modeling_nemotron_h.py (modified, +1/-1)
  • src/transformers/models/solar_open/modeling_solar_open.py (modified, +1/-1)
RAW_BUFFERClick to expand / collapse

System Info

Hello, the DeepseekV3MoE class in transformers (native) differs from the official remote DeepSeekV3 implementation (which was updated for a bug but not in transformers, hence the difference).

<details> <summary> See trf DeepSeekV3 MoE code </summary>

https://github.com/huggingface/transformers/blob/155db7146371335bdfa93f239c3b868b280e30b7/src/transformers/models/deepseek_v3/modular_deepseek_v3.py#L113-L166

</details>

The divergence comes from how experts are masked in the specific case of aux-loss free load balancing and negative biases being added.
scores_for_choice is normally not negative (sigmoid [0,1]) but it can with + negative bias. So 0.0 is not a floor anymore, but just another value in the score distribution hence the need to switch from masked_fill(~score_mask.bool(), 0.0) to masked_fill(~score_mask.bool(), -inf).

See the official remote update: https://huggingface.co/deepseek-ai/DeepSeek-V3-0324/commit/e9b33add76883f293d6bf61f6bd89b497e80e335#d2h-632685
Same fix is applied across the whole V3-* collection: https://huggingface.co/collections/deepseek-ai/deepseek-v3

(btw, I also need this for the MiMo-V2 model PR https://github.com/huggingface/transformers/pull/45144 in order to inherit properly.)

Who can help?

Was discussed with Vasqu already cc'ed in the PR

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See example below (where a masked expert is chosen over a valid expert with a negative score)

score_mask = [1,1,0,0] (Only experts 0 and 1 should be eligible)

With 0.0:

let's say scores_for_choice = [0.9, -0.1, 0.0, 0.0] topk(k=2) → expert 0 (0.9) and expert 3 (0.0) are chosen = bad Expert 3 is in the non-selected group but beats expert 1 (−0.1).

Expected behavior

With float("-inf"):

scores_for_choice = [0.9, -0.1, -inf, -inf] topk(k=2) → expert 0 (0.9) and expert 1 (-0.1) = good

extent analysis

TL;DR

The issue can be fixed by replacing masked_fill(~score_mask.bool(), 0.0) with masked_fill(~score_mask.bool(), -inf) in the DeepseekV3MoE class to correctly handle negative biases.

Guidance

  • The divergence between the DeepseekV3MoE class and the official remote DeepSeekV3 implementation is due to the handling of negative biases in the scores_for_choice calculation.
  • To verify the fix, test the model with a scenario where an expert has a negative score, such as the provided example with scores_for_choice = [0.9, -0.1, 0.0, 0.0].
  • The fix should ensure that experts with negative scores are not chosen over valid experts with higher scores.
  • Review the official remote update and the fix applied across the whole V3-* collection for reference.

Example

score_mask = [1, 1, 0, 0]
scores_for_choice = [0.9, -0.1, 0.0, 0.0]
# Replace 0.0 with -inf for non-selected experts
scores_for_choice = [0.9, -0.1, -float('inf'), -float('inf')]

Notes

The fix assumes that the scores_for_choice calculation is correct and that the only issue is with the handling of negative biases. If there are other issues with the calculation, this fix may not be sufficient.

Recommendation

Apply the workaround by replacing masked_fill(~score_mask.bool(), 0.0) with masked_fill(~score_mask.bool(), -inf) in the DeepseekV3MoE class, as this should correctly handle negative biases and fix the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

With float("-inf"):

scores_for_choice = [0.9, -0.1, -inf, -inf] topk(k=2) → expert 0 (0.9) and expert 1 (-0.1) = good

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING