transformers - ✅(Solved) Fix IndexError: pop from an empty deque with DeepSpeed ZeRO3 [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45137Fetched 2026-04-08 01:57:48
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
mentioned ×2subscribed ×2cross-referenced ×1labeled ×1

Error Message

It should raise no error.

PR fix notes

PR #45395: Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active

Description (problem / solution / changelog)

Summary

Fixes #45137.

Since #41147, attention layers are decorated with @use_kernelized_func(apply_rotary_pos_emb) which attaches a rotary_fn child nn.Module at init when the kernels library is available.

DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to run during forward. The attention forward still calls the Python apply_rotary_pos_emb, so rotary_fn is never invoked and the parameter-fetch trace desynchronizes, raising:

IndexError: pop from an empty deque
  at deepspeed/runtime/zero/partitioned_param_coordinator.py

on the second forward (reproducible via TRL's RLOO/GRPO trainers under ZeRO-3, see huggingface/trl#4899).

Changed files

  • docs/source/en/model_doc/pp_chart2table.md (modified, +1/-1)
  • docs/source/en/model_doc/slanext.md (modified, +1/-1)
  • docs/source/en/model_doc/uvdoc.md (modified, +1/-1)
  • src/transformers/integrations/hub_kernels.py (modified, +9/-0)

PR #45414: Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active

Description (problem / solution / changelog)

Summary

Fixes #45137. Re-opened from #45395 on a same-repo branch so CI can run.

Since #41147, attention layers are decorated with @use_kernelized_func(apply_rotary_pos_emb) which attaches a rotary_fn child nn.Module at init when the kernels library is available. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to fire during forward. The attention forward still calls the plain Python apply_rotary_pos_emb, so rotary_fn is never invoked and the parameter-fetch trace desynchronizes, raising:

IndexError: pop from an empty deque
  at deepspeed/runtime/zero/partitioned_param_coordinator.py

on the second forward (reproducible via TRL's RLOO/GRPO trainers under ZeRO-3, see huggingface/trl#4899).

Fix

Skip attaching the kernelized submodule when is_deepspeed_zero3_enabled() is true. Under ZeRO-3 the Python apply_rotary_pos_emb path is used (same behavior as before #41147). Non-ZeRO-3 users are unaffected.

The second commit refreshes dates on three model cards (pp_chart2table, slanext, uvdoc) that were missing them on main — required for check-repository-consistency to pass.

Test plan

  • Reproducer from huggingface/trl#4899 no longer raises IndexError: pop from an empty deque
  • Qwen3 forward + kernelize still replaces rotary_fn when not under ZeRO-3
  • make style + check-repository-consistency pass

Changed files

  • docs/source/en/model_doc/pp_chart2table.md (modified, +1/-1)
  • docs/source/en/model_doc/slanext.md (modified, +1/-1)
  • docs/source/en/model_doc/uvdoc.md (modified, +1/-1)
  • src/transformers/integrations/hub_kernels.py (modified, +9/-0)

PR #5541: Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4

Description (problem / solution / changelog)

Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4.

This PR updates the test conditions for ZeRO-3 integration with the transformers library to reflect a recent upstream fix. The tests now only expect failures for a specific range of transformers versions where the issue is known to occur, improving the accuracy of test expectations.

Fix #4899, after the upstream issue in transformers:

has been fixed by:

Follow-up to:

  • #5420
  • #5404
  • #4898
  • #4899

Changes

Test condition updates:

  • In both test_reward and test_rloo in tests/distributed/test_distributed.py, the pytest.mark.xfail condition for the "zero3" parameter is updated to only expect failures when transformers version is greater than or equal to 5.0.0 and less than 5.5.4, reflecting that the issue is fixed in transformers#45414. The reason message is also updated for clarity.
<!-- CURSOR_SUMMARY -->

[!NOTE] Low Risk Low risk: only adjusts pytest xfail version gating and messages in distributed tests, with no production code changes.

Overview Updates distributed tests so the zero3 parameter is only marked xfail for transformers versions >= 5.0.0 and < 5.5.4, reflecting that the upstream ZeRO-3 issue is fixed in transformers 5.5.4.

Also updates the associated xfail reason strings (and keeps strict=True) in test_rloo and test_grpo to document the fixed upstream PR/reference.

<sup>Reviewed by Cursor Bugbot for commit fef7620e6204dafedc6e16fb5c42f619bc7a135b. Bugbot is set up for automated code reviews on this repo. Configure here.</sup>

<!-- /CURSOR_SUMMARY -->

Changed files

  • tests/distributed/test_distributed.py (modified, +4/-4)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.5.0.dev0
  • Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
  • Python version: 3.10.18
  • Huggingface_hub version: 1.8.0
  • Safetensors version: 0.6.2
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: 0.18.8
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Using distributed or parallel set-up in script?: <fill in>
  • Using GPU in script?: <fill in>
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

  • kernels: @MekkCyber @drbh

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See related downstream issue in trl, for a full reproducer using TRL's RLOO and GRPO trainers:

Expected behavior

It should raise no error.

extent analysis

TL;DR

The issue may be resolved by checking and adjusting the compatibility of the transformers library version with other dependencies, particularly PyTorch and DeepSpeed.

Guidance

  • Review the versions of PyTorch, DeepSpeed, and transformers to ensure they are compatible, as the transformers version is a development version (5.5.0.dev0).
  • Check the official documentation for the transformers library to see if there are any known issues or compatibility problems with the current version.
  • Investigate the related downstream issue in trl (https://github.com/huggingface/trl/issues/4899) for potential clues or workarounds.
  • Consider testing with a stable version of the transformers library to isolate the issue.

Notes

The provided information lacks details about the specific task or dataset being used, which might be relevant for troubleshooting. Additionally, the transformers version is a development version, which could be a contributing factor to the issue.

Recommendation

Apply workaround: Given the development version of transformers and the lack of information about the task or dataset, it's recommended to try a stable version of transformers or investigate the compatibility with other dependencies as a workaround.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

It should raise no error.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix IndexError: pop from an empty deque with DeepSpeed ZeRO3 [3 pull requests, 1 participants]