PR #45441: fix(DSV3): parity between native `DeepseekV3MoE` and remote official implementation

Q: Expected behavior

With `float("-inf")`: `scores_for_choice = [0.9, -0.1, -inf, -inf]` topk(k=2) → expert 0 (0.9) and expert 1 (-0.1) = good

Repository: huggingface/transformers
Author: casinca
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45441

Description (problem / solution / changelog)

What does this PR do?

Please see fix #45440 for more details

Discussed with @vasqu

Also fixed via regen/inheritance:

exaone-moe
glm4-moe
glm4v-moe
glm4-moe-lite
glm_moe_dsa
nemotron_h
solar_open

[!WARNING]

This test now fails (untrained, more likely to diverge) and needs to be regenerated.

https://github.com/huggingface/transformers/blob/5b565a589ca5935fbc7a3ea93b9b622bb41bd129/tests/models/deepseek_v3/test_modeling_deepseek_v3.py#L398-L409

So I reran the DSV3 forward pass to get the new hardcoded string for myself and retest but maybe this needs to be run and tested on your side too?
       EXPECTED_TEXT_COMPLETION = [
"Simply put, the theory of relativity states that aportersh455elike injection tactics-altitude蹲在那儿 >Loregefruitakosdeckingredientsuchtroni李世umontיםplicitlyShadowoldtriad Therapeutics不减-ste 的希望和价值 >kerretteylesheetzimnasium的品质 Talm",
"My favorite all time favorite condiment is ketchup. Lan overhead excite-ment好用>cileriaceaeagnainesogaslipadicSiggleESHalseawarriorsrattieri佐iented >Parrheta-counterousseanatysisoglCTSinkeheilbronnenlaceslide tactauralick",
       ]
Also potential cascading effect with the inherited models too and their own def test_compile_static_cache(self): test?

I greedily flagged 3 other models with masked_fill(~score_mask.bool(), 0.0) but I'm not blindly touching these for now, need to verify their logic first (if using loss free load balancing) + what their remote are doing. If the remote is wrong, not sure we should change the native implementation or keep it wrong with the remote.

This is likely for a follow-up PR if these need to be fixed too.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Changed files

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py (modified, +1/-1)
src/transformers/models/deepseek_v3/modular_deepseek_v3.py (modified, +1/-1)
src/transformers/models/exaone_moe/modeling_exaone_moe.py (modified, +1/-1)
src/transformers/models/glm4_moe/modeling_glm4_moe.py (modified, +1/-1)
src/transformers/models/glm4_moe_lite/modeling_glm4_moe_lite.py (modified, +1/-1)
src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +1/-1)
src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py (modified, +1/-1)
src/transformers/models/nemotron_h/modeling_nemotron_h.py (modified, +1/-1)
src/transformers/models/solar_open/modeling_solar_open.py (modified, +1/-1)

System Info

Hello, the DeepseekV3MoE class in transformers (native) differs from the official remote DeepSeekV3 implementation (which was updated for a bug but not in transformers, hence the difference).

<details> <summary> See trf DeepSeekV3 MoE code </summary>

https://github.com/huggingface/transformers/blob/155db7146371335bdfa93f239c3b868b280e30b7/src/transformers/models/deepseek_v3/modular_deepseek_v3.py#L113-L166

</details>

The divergence comes from how experts are masked in the specific case of aux-loss free load balancing and negative biases being added.
scores_for_choice is normally not negative (sigmoid [0,1]) but it can with + negative bias. So 0.0 is not a floor anymore, but just another value in the score distribution hence the need to switch from masked_fill(~score_mask.bool(), 0.0) to masked_fill(~score_mask.bool(), -inf).

See the official remote update: https://huggingface.co/deepseek-ai/DeepSeek-V3-0324/commit/e9b33add76883f293d6bf61f6bd89b497e80e335#d2h-632685
Same fix is applied across the whole V3-* collection: https://huggingface.co/collections/deepseek-ai/deepseek-v3

(btw, I also need this for the MiMo-V2 model PR https://github.com/huggingface/transformers/pull/45144 in order to inherit properly.)

Who can help?

Was discussed with Vasqu already cc'ed in the PR

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

See example below (where a masked expert is chosen over a valid expert with a negative score)

score_mask = [1,1,0,0] (Only experts 0 and 1 should be eligible)

With 0.0:

let's say scores_for_choice = [0.9, -0.1, 0.0, 0.0] topk(k=2) → expert 0 (0.9) and expert 3 (0.0) are chosen = bad Expert 3 is in the non-selected group but beats expert 1 (−0.1).

Expected behavior

With float("-inf"):

scores_for_choice = [0.9, -0.1, -inf, -inf] topk(k=2) → expert 0 (0.9) and expert 1 (-0.1) = good

transformers - ✅(Solved) Fix Native `DeepseekV3MoE` diverges from the remote DeepSeekV3 implementation [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45441: fix(DSV3): parity between native DeepseekV3MoE and remote official implementation

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

See example below (where a masked expert is chosen over a valid expert with a negative score)

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #45441: fix(DSV3): parity between native `DeepseekV3MoE` and remote official implementation