transformers - ✅(Solved) Fix [Bug/Discussion] MLA q_a_layernorm Missing config.rms_norm_eps, Causing 1e-5/1e-6 Precision Error [2 pull requests, 9 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44261Fetched 2026-04-08 00:29:32
View on GitHub
Comments
9
Participants
6
Timeline
15
Reactions
0
Timeline (top)
commented ×9cross-referenced ×2labeled ×1mentioned ×1

Fix Action

Fixed

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Description (problem / solution / changelog)

  1. https://github.com/huggingface/transformers/issues/44360
  2. https://github.com/huggingface/transformers/issues/44261
  3. https://github.com/huggingface/transformers/issues/44485
  4. casual attention_mask in indexer

For precision alignment issues, please refer to these three issues.

Currently, the megatron-swift implementation uses qk_layernorm eps of 1e-5, adds relu in the indexer, and sets rope_interleave to true.

Environment Setup

pip install git+https://github.com/NVIDIA/Megatron-LM.git
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform --no-build-isolation

Changed files

  • docs/source/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
  • docs/source/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
  • docs/source/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
  • docs/source_en/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
  • docs/source_en/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
  • docs/source_en/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
  • swift/megatron/arguments/megatron_args.py (modified, +4/-0)
  • swift/megatron/init.py (modified, +159/-0)
  • swift/megatron/model/gpt_bridge.py (modified, +29/-6)
  • swift/megatron/model/gpt_model.py (modified, +32/-4)
  • swift/megatron/model/gpts/__init__.py (modified, +1/-0)
  • swift/megatron/model/model_config.py (modified, +31/-3)
  • swift/megatron/model/register.py (modified, +17/-0)
  • swift/megatron/trainers/base.py (modified, +14/-0)

PR #44585: Fix missing rms_norm_eps in DeepseekV3 MLA layernorms

Description (problem / solution / changelog)

What does this PR do?

Passes eps=config.rms_norm_eps to both q_a_layernorm and kv_a_layernorm in the DeepseekV3 MLA attention module. Without this, these layernorms default to eps=1e-5 instead of the config value (1e-6), causing precision differences compared to vLLM and SGLang implementations.

The fix was applied to modular_deepseek_v3.py and propagated to generated modeling files (deepseek_v3, glm4_moe_lite, longcat_flash, youtu) via make fix-repo.

Note: DeepseekV2 has the same issue but is left for a separate PR to keep this focused.

Fixes #44261

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. https://github.com/huggingface/transformers/issues/44261
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez (text models, attention)

This contribution was developed with AI assistance (Claude Code).

Changed files

  • src/transformers/models/deepseek_v3/modeling_deepseek_v3.py (modified, +2/-2)
  • src/transformers/models/deepseek_v3/modular_deepseek_v3.py (modified, +2/-2)
  • src/transformers/models/glm4_moe_lite/modeling_glm4_moe_lite.py (modified, +2/-2)
  • src/transformers/models/longcat_flash/modeling_longcat_flash.py (modified, +2/-2)
  • src/transformers/models/youtu/modeling_youtu.py (modified, +2/-2)
RAW_BUFFERClick to expand / collapse

System Info

hello! I noticed that the MLA implementations in transformers/vllm/sglang/megatron have slight differences, leading to precision errors (train/infer/rl...)

vllm: https://github.com/vllm-project/vllm/blob/a0c70816956298f7dd1d0cf47cfa1a169a413692/vllm/model_executor/models/deepseek_v2.py#L907

sglang: https://github.com/sgl-project/sglang/blob/e6ad58e5daa1476544f813da93f1f2f5078d387f/python/sglang/srt/models/deepseek_v2.py#L1129

transformers: https://github.com/huggingface/transformers/blob/e2bc54f29a58b2d2ee7e7d6eac949c959e063e0f/src/transformers/models/deepseek_v3/modular_deepseek_v3.py#L192

Who can help?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Expected behavior

extent analysis

Fix Plan

1. Update transformers/vllm/sglang to use the same implementation

  • Update transformers to use the same DeepSeekV2 implementation as vllm and sglang.
  • In transformers/models/deepseek_v3/modular_deepseek_v3.py, replace the implementation with the one from vllm or sglang.
# Before
class DeepSeekV2(Module):
    def __init__(self, ...):
        ...
        self.mlp = MLP(...)

# After
from vllm.model_executor.models.deepseek_v2 import DeepSeekV2

class DeepSeekV2(ModularDeepSeekV3):
    def __init__(self, ...):
        ...
        self.mlp = DeepSeekV2.mlp

2. Update vllm and sglang to use the same implementation

  • Update vllm and sglang to use the same DeepSeekV2 implementation as transformers.
  • In vllm/model_executor/models/deepseek_v2.py and sglang/srt/models/deepseek_v2.py, replace the implementation with the one from transformers.
# Before
class DeepSeekV2(Module):
    def __init__(self, ...):
        ...
        self.mlp = MLP(...)

# After
from transformers.models.deepseek_v3.modular_deepseek_v3 import DeepSeekV2

class DeepSeekV2(ModularDeepSeekV3):
    def __init__(self, ...):
        ...
        self.mlp = DeepSeekV2.mlp

3. Test the updated implementation

  • Run the examples scripts with the updated implementation to verify that the precision errors are resolved.
  • Test the updated implementation with your own task or dataset to ensure that it works as expected

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix [Bug/Discussion] MLA q_a_layernorm Missing config.rms_norm_eps, Causing 1e-5/1e-6 Precision Error [2 pull requests, 9 comments, 6 participants]