transformers - ✅(Solved) Fix [Bug/Discussion] The DSA indexer lacks a ReLU [2 pull requests, 2 comments, 2 participants]

yangdsh · 2026-02-28T19:25:43Z

[transformers] PR 8085: megatron support GLM-5 megatron - Repository: modelscope/ms-swift - Author: Jintao-Huang - State: closed | merged: True - Link: https:/… # PR #8085: [megatron] support GLM-5 megatron - Repository: modelscope/ms-swift - Author: Jintao-Huang - State: closed | merged: True - Link: https://github.com/modelscope/ms-swift/pull/8085 ## Description (problem / solution / changelog) 1. https://github.com/huggingface/transformers/issues/44360 2. https://github.com/huggingface/transformers/issues/44261 3. https://github.com/huggingface/transformers/issues/44485 4. casual attention_mask in indexer For precision alignment issues, please refer to these three issues. Currently, the megatron-swift implementation uses qk_layernorm eps of 1e-5, adds relu in the indexer, and sets rope_interleave to true. ## Environment Setup ```shell pip install git+https://github.com/NVIDIA/Megatron-LM.git pip install git+https://github.com/Dao-AILab/fast-hadamard-transform --no-build-isolation ``` ## Changed files - `docs/source/Instruction/Supported-models-and-datasets.md` (modified, +1/-1) - `docs/source/Megatron-SWIFT/Command-line-parameters.md` (modified, +5/-0) - `docs/source/Megatron-SWIFT/Quick-start.md` (modified, +1/-1) - `docs/source_en/Instruction/Supported-models-and-datasets.md` (modified, +1/-1) - `docs/source_en/Megatron-SWIFT/Command-line-parameters.md` (modified, +5/-0) - `docs/source_en/Megatron-SWIFT/Quick-start.md` (modified, +1/-1) - `swift/megatron/arguments/megatron_args.py` (modified, +4/-0) - `swift/megatron/init.py` (modified, +159/-0) - `swift/megatron/model/gpt_bridge.py` (modified, +29/-6) - `swift/megatron/model/gpt_model.py` (modified, +32/-4) - `swift/megatron/model/gpts/__init__.py` (modified, +1/-0) - `swift/megatron/model/model_config.py` (modified, +31/-3) - `swift/megatron/model/register.py` (modified, +17/-0) - `swift/megatron/trainers/base.py` (modified, +14/-0) --- # PR #44564: Fix glm dsa - Repository: huggingface/transformers - Author: ArthurZucker - State: closed | merged: True - Link: https://github.com/huggingface/transformers/pull/44564 ## Description (problem / solution / changelog) # What does this PR do? Fixes #44360 ## Changed files - `src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py` (modified, +1/-1) - `src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py` (modified, +1/-1) ## Fixed - Fixed by PR: [megatron] support GLM-5 megatron (https://github.com/modelscope/ms-swift/pull/8085) - Fixed by PR: Fix glm dsa (https://github.com/huggingface/transformers/pull/44564) ### System Info The model structure of the GLM-MOE-DSA indexer lacks a ReLU here (https://github.com/zRzRzRzRzRzRzR/transformers/blob/4ca30213c6f7aa84b55c280e02730fe14d33dac5/src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py#L403) compared to the reference implementation (https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/kernel.py#L241) ### Who can help? @JaredforReal ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction N/A ### Expected behavior Add ReLU

transformers2026-02-28 19:25:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44360•Fetched 2026-04-08 00:29:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yangdsh

Participants

Rocketknight1

yangdsh

Timeline (top)

cross-referenced ×3mentioned ×3subscribed ×3commented ×2

Fix Action

Fixed

Fixed by PR: [megatron] support GLM-5 megatron (https://github.com/modelscope/ms-swift/pull/8085)
Fixed by PR: Fix glm dsa (https://github.com/huggingface/transformers/pull/44564)

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Repository: modelscope/ms-swift
Author: Jintao-Huang
State: closed | merged: True
Link: https://github.com/modelscope/ms-swift/pull/8085

Description (problem / solution / changelog)

https://github.com/huggingface/transformers/issues/44360
https://github.com/huggingface/transformers/issues/44261
https://github.com/huggingface/transformers/issues/44485
casual attention_mask in indexer

For precision alignment issues, please refer to these three issues.

Currently, the megatron-swift implementation uses qk_layernorm eps of 1e-5, adds relu in the indexer, and sets rope_interleave to true.

Environment Setup

pip install git+https://github.com/NVIDIA/Megatron-LM.git
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform --no-build-isolation

Changed files

docs/source/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
docs/source/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
docs/source/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
docs/source_en/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
docs/source_en/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
docs/source_en/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
swift/megatron/arguments/megatron_args.py (modified, +4/-0)
swift/megatron/init.py (modified, +159/-0)
swift/megatron/model/gpt_bridge.py (modified, +29/-6)
swift/megatron/model/gpt_model.py (modified, +32/-4)
swift/megatron/model/gpts/__init__.py (modified, +1/-0)
swift/megatron/model/model_config.py (modified, +31/-3)
swift/megatron/model/register.py (modified, +17/-0)
swift/megatron/trainers/base.py (modified, +14/-0)

PR #44564: Fix glm dsa

Repository: huggingface/transformers
Author: ArthurZucker
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44564

Description (problem / solution / changelog)

What does this PR do?

Fixes #44360

Changed files

src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py (modified, +1/-1)
src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py (modified, +1/-1)

RAW_BUFFERClick to expand / collapse

System Info

The model structure of the GLM-MOE-DSA indexer lacks a ReLU here (https://github.com/zRzRzRzRzRzRzR/transformers/blob/4ca30213c6f7aa84b55c280e02730fe14d33dac5/src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py#L403) compared to the reference implementation (https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/kernel.py#L241)

Who can help?

@JaredforReal

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

N/A

Expected behavior

Add ReLU

extent analysis

Fix Plan

Add ReLU Activation Function

To fix the issue, we need to add a ReLU (Rectified Linear Unit) activation function to the model structure of the GLM-MOE-DSA indexer.

Step-by-Step Solution

Locate the relevant code: Find the file modular_glm_moe_dsa.py in the transformers repository and navigate to line 403.
Add ReLU activation function: Insert the following code snippet to add ReLU activation:

import torch.nn.functional as F

# ...

self.fc = torch.nn.Linear(self.hidden_size, self.hidden_size)
self.fc = torch.nn.utils.weight_norm(self.fc)
self.fc = F.relu(self.fc)  # Add ReLU activation

Update the model structure: Update the model structure to reflect the addition of ReLU activation.

Example Code

Here's an updated code snippet that includes the ReLU activation function:

class ModularGLMMOEDSA(torch.nn.Module):
    def __init__(self, hidden_size, num_heads, num_layers):
        super(ModularGLMMOEDSA, self).__init__()
        self.fc = torch.nn.Linear(hidden_size, hidden_size)
        self.fc = torch.nn.utils.weight_norm(self.fc)
        self.fc = torch.nn.functional.relu(self.fc)  # Add ReLU activation
        # ...

Verification

To verify that the fix worked, run the model with the updated code and check that the output is correct. You can use a testing framework or a simple script to test the model's performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Add ReLU

#api #ssr #installation #tensor shape #autograd error #logging issue #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix [Bug/Discussion] The DSA indexer lacks a ReLU [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Description (problem / solution / changelog)

Environment Setup

Changed files

PR #44564: Fix glm dsa

Description (problem / solution / changelog)

What does this PR do?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix [Bug/Discussion] The DSA indexer lacks a ReLU [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Description (problem / solution / changelog)

Environment Setup

Changed files

PR #44564: Fix glm dsa

Description (problem / solution / changelog)

What does this PR do?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING