transformers - ✅(Solved) Fix [Bug/Discussion] The DSA indexer lacks a ReLU [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44360Fetched 2026-04-08 00:29:02
View on GitHub
Comments
2
Participants
2
Timeline
13
Reactions
1
Author
Timeline (top)
cross-referenced ×3mentioned ×3subscribed ×3commented ×2

Fix Action

Fixed

PR fix notes

PR #8085: [megatron] support GLM-5 megatron

Description (problem / solution / changelog)

  1. https://github.com/huggingface/transformers/issues/44360
  2. https://github.com/huggingface/transformers/issues/44261
  3. https://github.com/huggingface/transformers/issues/44485
  4. casual attention_mask in indexer

For precision alignment issues, please refer to these three issues.

Currently, the megatron-swift implementation uses qk_layernorm eps of 1e-5, adds relu in the indexer, and sets rope_interleave to true.

Environment Setup

pip install git+https://github.com/NVIDIA/Megatron-LM.git
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform --no-build-isolation

Changed files

  • docs/source/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
  • docs/source/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
  • docs/source/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
  • docs/source_en/Instruction/Supported-models-and-datasets.md (modified, +1/-1)
  • docs/source_en/Megatron-SWIFT/Command-line-parameters.md (modified, +5/-0)
  • docs/source_en/Megatron-SWIFT/Quick-start.md (modified, +1/-1)
  • swift/megatron/arguments/megatron_args.py (modified, +4/-0)
  • swift/megatron/init.py (modified, +159/-0)
  • swift/megatron/model/gpt_bridge.py (modified, +29/-6)
  • swift/megatron/model/gpt_model.py (modified, +32/-4)
  • swift/megatron/model/gpts/__init__.py (modified, +1/-0)
  • swift/megatron/model/model_config.py (modified, +31/-3)
  • swift/megatron/model/register.py (modified, +17/-0)
  • swift/megatron/trainers/base.py (modified, +14/-0)

PR #44564: Fix glm dsa

Description (problem / solution / changelog)

What does this PR do?

Fixes #44360

Changed files

  • src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py (modified, +1/-1)
  • src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py (modified, +1/-1)
RAW_BUFFERClick to expand / collapse

System Info

The model structure of the GLM-MOE-DSA indexer lacks a ReLU here (https://github.com/zRzRzRzRzRzRzR/transformers/blob/4ca30213c6f7aa84b55c280e02730fe14d33dac5/src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py#L403) compared to the reference implementation (https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/kernel.py#L241)

Who can help?

@JaredforReal

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

N/A

Expected behavior

Add ReLU

extent analysis

Fix Plan

Add ReLU Activation Function

To fix the issue, we need to add a ReLU (Rectified Linear Unit) activation function to the model structure of the GLM-MOE-DSA indexer.

Step-by-Step Solution

  1. Locate the relevant code: Find the file modular_glm_moe_dsa.py in the transformers repository and navigate to line 403.
  2. Add ReLU activation function: Insert the following code snippet to add ReLU activation:
import torch.nn.functional as F

# ...

self.fc = torch.nn.Linear(self.hidden_size, self.hidden_size)
self.fc = torch.nn.utils.weight_norm(self.fc)
self.fc = F.relu(self.fc)  # Add ReLU activation
  1. Update the model structure: Update the model structure to reflect the addition of ReLU activation.

Example Code

Here's an updated code snippet that includes the ReLU activation function:

class ModularGLMMOEDSA(torch.nn.Module):
    def __init__(self, hidden_size, num_heads, num_layers):
        super(ModularGLMMOEDSA, self).__init__()
        self.fc = torch.nn.Linear(hidden_size, hidden_size)
        self.fc = torch.nn.utils.weight_norm(self.fc)
        self.fc = torch.nn.functional.relu(self.fc)  # Add ReLU activation
        # ...

Verification

To verify that the fix worked, run the model with the updated code and check that the output is correct. You can use a testing framework or a simple script to test the model's performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Add ReLU

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING