transformers - 💡(How to fix) Fix cohere2_moe fails training + tensor parallel tests [1 pull requests]

Q: Expected behavior

Tensor parallel tests (easy to fix) ``` Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap fn(i, *args) File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper func(rank, *func_args, **func_kwargs) File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl model_tp, model, device = _load_tp_and_reference_models(model_path, model_class) File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models model_tp = model_class.from_pretrained( File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained model = distribute_model(model, distributed_config, device_mesh) File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan) File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh) File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__ return self._global_mapping[key] KeyError: 'rowwise' ```

transformers2026-05-25 02:25:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

FAILED tests/models/cohere2_moe/test_modeling_cohere2_moe.py::Cohere2MoeModelTest::test_training_overfit - AssertionError: 0.27068585289520714 not greater than 0.9 : Expected loss to decrease by at least 90%, got 27.1%

Root Cause

[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2026-05-21 12:47:23 - transformers.training_test - INFO - Expected:  'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'
2026-05-21 12:47:23 - transformers.training_test - INFO - Generated: 'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'

Fix Action

Fixed

Fixed by PR: Fix FSDP2 and distributed checkpointing imports for older PyTorch versions (https://github.com/huggingface/transformers/pull/46141)

Code Example

FAILED tests/models/cohere2_moe/test_modeling_cohere2_moe.py::Cohere2MoeModelTest::test_training_overfit - AssertionError: 0.27068585289520714 not greater than 0.9 : Expected loss to decrease by at least 90%, got 27.1%

---

[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2026-05-21 12:47:23 - transformers.training_test - INFO - Expected:  'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'
2026-05-21 12:47:23 - transformers.training_test - INFO - Generated: 'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'

---

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper
    func(rank, *func_args, **func_kwargs)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl
    model_tp, model, device = _load_tp_and_reference_models(model_path, model_class)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models
    model_tp = model_class.from_pretrained(
  File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained
    model = distribute_model(model, distributed_config, device_mesh)
  File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model
    model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan)
  File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel
    ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh)
  File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__
    return self._global_mapping[key]
KeyError: 'rowwise'

RAW_BUFFERClick to expand / collapse

Reproduction

Training tests:

FAILED tests/models/cohere2_moe/test_modeling_cohere2_moe.py::Cohere2MoeModelTest::test_training_overfit - AssertionError: 0.27068585289520714 not greater than 0.9 : Expected loss to decrease by at least 90%, got 27.1%

but generation is the same

[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2026-05-21 12:47:23 - transformers.training_test - INFO - Expected:  'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'
2026-05-21 12:47:23 - transformers.training_test - INFO - Generated: 'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'

TODO: need to check if weights initialization is the same (cf we had this issue with BLT: https://github.com/huggingface/transformers/pull/42685#issuecomment-3641144381)

Expected behavior

Tensor parallel tests (easy to fix)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper
    func(rank, *func_args, **func_kwargs)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl
    model_tp, model, device = _load_tp_and_reference_models(model_path, model_class)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models
    model_tp = model_class.from_pretrained(
  File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained
    model = distribute_model(model, distributed_config, device_mesh)
  File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model
    model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan)
  File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel
    ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh)
  File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__
    return self._global_mapping[key]
KeyError: 'rowwise'

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Tensor parallel tests (easy to fix)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper
    func(rank, *func_args, **func_kwargs)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl
    model_tp, model, device = _load_tp_and_reference_models(model_path, model_class)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models
    model_tp = model_class.from_pretrained(
  File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained
    model = distribute_model(model, distributed_config, device_mesh)
  File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model
    model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan)
  File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel
    ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh)
  File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__
    return self._global_mapping[key]
KeyError: 'rowwise'

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix cohere2_moe fails training + tensor parallel tests [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

TRENDING