transformers - ✅(Solved) Fix [Bug/Discussion] MLA q_a_layernorm Missing config.rms_norm_eps, Causing 1e-5/1e-6 Precision Error [2 pull requests, 9 comments, 6 participants]

transformers2026-02-24 16:14:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44261•Fetched 2026-04-08 00:29:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×9cross-referenced ×2labeled ×1mentioned ×1

Fix Action

Fixed

Fixed by PR: [megatron] support GLM-5 megatron (https://github.com/modelscope/ms-swift/pull/8085)
Fixed by PR: Fix missing rms_norm_eps in DeepseekV3 MLA layernorms (https://github.com/huggingface/transformers/pull/44585)

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Repository: modelscope/ms-swift
Author: Jintao-Huang
State: closed | merged: True
Link: https://github.com/modelscope/ms-swift/pull/8085

Description (problem / solution / changelog)

https://github.com/huggingface/transformers/issues/44360
https://github.com/huggingface/transformers/issues/44261
https://github.com/huggingface/transformers/issues/44485
casual attention_mask in indexer

For precision alignment issues, please refer to these three issues.

Currently, the megatron-swift implementation uses qk_layernorm eps of 1e-5, adds relu in the indexer, and sets rope_interleave to true.

Environment Setup

pip install git+https://github.com/NVIDIA/Megatron-LM.git
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform --no-build-isolation

Changed files

docs/source/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
docs/source/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
docs/source/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
docs/source_en/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
docs/source_en/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
docs/source_en/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
swift/megatron/arguments/megatron_args.py (modified, +4/-0)
swift/megatron/init.py (modified, +159/-0)
swift/megatron/model/gpt_bridge.py (modified, +29/-6)
swift/megatron/model/gpt_model.py (modified, +32/-4)
swift/megatron/model/gpts/__init__.py (modified, +1/-0)
swift/megatron/model/model_config.py (modified, +31/-3)
swift/megatron/model/register.py (modified, +17/-0)
swift/megatron/trainers/base.py (modified, +14/-0)

PR #44585: Fix missing rms_norm_eps in DeepseekV3 MLA layernorms

Repository: huggingface/transformers
Author: mvanhorn
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44585

Description (problem / solution / changelog)

What does this PR do?

Passes eps=config.rms_norm_eps to both q_a_layernorm and kv_a_layernorm in the DeepseekV3 MLA attention module. Without this, these layernorms default to eps=1e-5 instead of the config value (1e-6), causing precision differences compared to vLLM and SGLang implementations.

The fix was applied to modular_deepseek_v3.py and propagated to generated modeling files (deepseek_v3, glm4_moe_lite, longcat_flash, youtu) via make fix-repo.

Note: DeepseekV2 has the same issue but is left for a separate PR to keep this focused.

Fixes #44261

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. https://github.com/huggingface/transformers/issues/44261
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez (text models, attention)

This contribution was developed with AI assistance (Claude Code).

Changed files

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py (modified, +2/-2)
src/transformers/models/deepseek_v3/modular_deepseek_v3.py (modified, +2/-2)
src/transformers/models/glm4_moe_lite/modeling_glm4_moe_lite.py (modified, +2/-2)
src/transformers/models/longcat_flash/modeling_longcat_flash.py (modified, +2/-2)
src/transformers/models/youtu/modeling_youtu.py (modified, +2/-2)

RAW_BUFFERClick to expand / collapse

System Info

hello! I noticed that the MLA implementations in transformers/vllm/sglang/megatron have slight differences, leading to precision errors (train/infer/rl...)

vllm: https://github.com/vllm-project/vllm/blob/a0c70816956298f7dd1d0cf47cfa1a169a413692/vllm/model_executor/models/deepseek_v2.py#L907

sglang: https://github.com/sgl-project/sglang/blob/e6ad58e5daa1476544f813da93f1f2f5078d387f/python/sglang/srt/models/deepseek_v2.py#L1129

transformers: https://github.com/huggingface/transformers/blob/e2bc54f29a58b2d2ee7e7d6eac949c959e063e0f/src/transformers/models/deepseek_v3/modular_deepseek_v3.py#L192

Who can help?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Expected behavior

extent analysis

Fix Plan

1. Update transformers/vllm/sglang to use the same implementation

Update transformers to use the same DeepSeekV2 implementation as vllm and sglang.
In transformers/models/deepseek_v3/modular_deepseek_v3.py, replace the implementation with the one from vllm or sglang.

# Before
class DeepSeekV2(Module):
    def __init__(self, ...):
        ...
        self.mlp = MLP(...)

# After
from vllm.model_executor.models.deepseek_v2 import DeepSeekV2

class DeepSeekV2(ModularDeepSeekV3):
    def __init__(self, ...):
        ...
        self.mlp = DeepSeekV2.mlp

2. Update vllm and sglang to use the same implementation

Update vllm and sglang to use the same DeepSeekV2 implementation as transformers.
In vllm/model_executor/models/deepseek_v2.py and sglang/srt/models/deepseek_v2.py, replace the implementation with the one from transformers.

# Before
class DeepSeekV2(Module):
    def __init__(self, ...):
        ...
        self.mlp = MLP(...)

# After
from transformers.models.deepseek_v3.modular_deepseek_v3 import DeepSeekV2

class DeepSeekV2(ModularDeepSeekV3):
    def __init__(self, ...):
        ...
        self.mlp = DeepSeekV2.mlp

3. Test the updated implementation

Run the examples scripts with the updated implementation to verify that the precision errors are resolved.
Test the updated implementation with your own task or dataset to ensure that it works as expected

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error #model compatibility #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [Bug/Discussion] MLA q_a_layernorm Missing config.rms_norm_eps, Causing 1e-5/1e-6 Precision Error [2 pull requests, 9 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Description (problem / solution / changelog)

Environment Setup

Changed files

PR #44585: Fix missing rms_norm_eps in DeepseekV3 MLA layernorms

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Who can review?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

1. Update transformers/vllm/sglang to use the same implementation

2. Update vllm and sglang to use the same implementation

3. Test the updated implementation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING