transformers - 💡(How to fix) Fix cohere2_moe fails training + tensor parallel tests [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

FAILED tests/models/cohere2_moe/test_modeling_cohere2_moe.py::Cohere2MoeModelTest::test_training_overfit - AssertionError: 0.27068585289520714 not greater than 0.9 : Expected loss to decrease by at least 90%, got 27.1%

Root Cause

[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2026-05-21 12:47:23 - transformers.training_test - INFO - Expected:  'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'
2026-05-21 12:47:23 - transformers.training_test - INFO - Generated: 'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'

Fix Action

Fixed

Code Example

FAILED tests/models/cohere2_moe/test_modeling_cohere2_moe.py::Cohere2MoeModelTest::test_training_overfit - AssertionError: 0.27068585289520714 not greater than 0.9 : Expected loss to decrease by at least 90%, got 27.1%

---

[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2026-05-21 12:47:23 - transformers.training_test - INFO - Expected:  'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'
2026-05-21 12:47:23 - transformers.training_test - INFO - Generated: 'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'

---

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper
    func(rank, *func_args, **func_kwargs)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl
    model_tp, model, device = _load_tp_and_reference_models(model_path, model_class)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models
    model_tp = model_class.from_pretrained(
  File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained
    model = distribute_model(model, distributed_config, device_mesh)
  File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model
    model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan)
  File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel
    ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh)
  File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__
    return self._global_mapping[key]
KeyError: 'rowwise'
RAW_BUFFERClick to expand / collapse

Reproduction

Training tests:

FAILED tests/models/cohere2_moe/test_modeling_cohere2_moe.py::Cohere2MoeModelTest::test_training_overfit - AssertionError: 0.27068585289520714 not greater than 0.9 : Expected loss to decrease by at least 90%, got 27.1%

but generation is the same

[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2026-05-21 12:47:23 - transformers.training_test - INFO - Expected:  'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'
2026-05-21 12:47:23 - transformers.training_test - INFO - Generated: 'abcdefghijklmnopqrsabcdefghijklmnopqrsab...'

TODO: need to check if weights initialization is the same (cf we had this issue with BLT: https://github.com/huggingface/transformers/pull/42685#issuecomment-3641144381)

Expected behavior

Tensor parallel tests (easy to fix)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper
    func(rank, *func_args, **func_kwargs)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl
    model_tp, model, device = _load_tp_and_reference_models(model_path, model_class)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models
    model_tp = model_class.from_pretrained(
  File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained
    model = distribute_model(model, distributed_config, device_mesh)
  File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model
    model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan)
  File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel
    ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh)
  File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__
    return self._global_mapping[key]
KeyError: 'rowwise'

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Tensor parallel tests (easy to fix)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 105, in _global_wrapper
    func(rank, *func_args, **func_kwargs)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 297, in _test_tp_generation_impl
    model_tp, model, device = _load_tp_and_reference_models(model_path, model_class)
  File "/root/project/tests/test_tensor_parallel_mixin.py", line 141, in _load_tp_and_reference_models
    model_tp = model_class.from_pretrained(
  File "/root/project/src/transformers/modeling_utils.py", line 4344, in from_pretrained
    model = distribute_model(model, distributed_config, device_mesh)
  File "/root/project/src/transformers/distributed/utils.py", line 149, in distribute_model
    model = apply_tensor_parallel(model, tp_mesh, distributed_config.tp_plan)
  File "/root/project/src/transformers/distributed/tensor_parallel.py", line 495, in apply_tensor_parallel
    ALL_PARALLEL_STYLES[style_value]._apply(submodule, tp_mesh)
  File "/root/project/src/transformers/utils/generic.py", line 1071, in __getitem__
    return self._global_mapping[key]
KeyError: 'rowwise'

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING