transformers - ✅(Solved) Fix Bug with detecting cache positions in sdpa_mask [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45735Fetched 2026-05-02 05:27:31
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Timeline (top)
subscribed ×4mentioned ×3commented ×1cross-referenced ×1

Error Message

File transformers/models/modernbert/modeling_modernbert.py:472, in ModernBertModel.forward(self, input_ids, attention_mask, position_ids, inputs_embeds, **kwargs) 465 if not isinstance(attention_mask_mapping := attention_mask, dict): 466 mask_kwargs = { 467 "config": self.config, 468 "inputs_embeds": hidden_states, 469 "attention_mask": attention_mask, 470 } 471 attention_mask_mapping = { --> 472 "full_attention": create_bidirectional_mask(**mask_kwargs), 473 "sliding_attention": create_bidirectional_sliding_window_mask(**mask_kwargs), 474 } 476 position_embeddings = {} 477 for layer_type in set(self.config.layer_types):

File transformers/utils/deprecation.py:171, in deprecate_kwarg.<locals>.wrapper.<locals>.wrapped_func(*args, **kwargs) 167 elif minimum_action in (Action.NOTIFY, Action.NOTIFY_ALWAYS) and not is_torchdynamo_compiling(): 168 # DeprecationWarning is ignored by default, so we use FutureWarning instead 169 warnings.warn(message, FutureWarning, stacklevel=2) --> 171 return func(*args, **kwargs)

File transformers/masking_utils.py:1071, in create_bidirectional_mask(config, inputs_embeds, attention_mask, encoder_hidden_states, past_key_values, or_mask_function, and_mask_function) 1068 use_vmap = True 1070 # We now create the mask -> 1071 attention_mask = mask_interface( 1072 batch_size=batch_size, 1073 q_length=q_length, 1074 kv_length=kv_length, 1075 q_offset=q_offset, 1076 kv_offset=kv_offset, 1077 mask_function=mask_factory_function, 1078 attention_mask=attention_mask, 1079 # Additional kwargs for sdpa 1080 allow_is_causal_skip=False, 1081 allow_is_bidirectional_skip=allow_is_bidirectional_skip, 1082 dtype=dtype, # Additional kwarg for eager 1083 config=config, # Pass the config as well, in case someone wants to easily have their own mask_interface 1084 use_vmap=use_vmap, # Short-circuit to non-vmap expansions for the mask 1085 device=device, 1086 ) 1087 return attention_mask

File transformers/masking_utils.py:492, in sdpa_mask(batch_size, q_length, kv_length, q_offset, kv_offset, mask_function, attention_mask, local_size, allow_is_causal_skip, allow_is_bidirectional_skip, allow_torch_fix, use_vmap, device, **kwargs) 487 if isinstance(q_length, torch.Tensor): 488 logger.warning_once( 489 "cache_position is deprecated as an arg, and will be removed in Transformers v5.6. Please use q_length and " 490 "q_offset instead, similarly to kv_length and kv_offset" 491 ) --> 492 q_length, q_offset = q_length.shape[0], q_length[0].to(device) 494 # Potentially pad the 2D mask 495 padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset)

IndexError: tuple index out of range

Fix Action

Fix / Workaround

The following patch seems to fix the issue.

diff --git a/transformers/masking_utils.py b/tmp/patch.py
index 45e43fd..3d9f496 100644
--- a/transformers/masking_utils.py
+++ b/tmp/patch.py
@@ -484,7 +484,7 @@ def sdpa_mask(

PR fix notes

PR #45740: Fix IndexError in sdpa_mask and flex_attention_mask for 0D tensors during ONNX export

Description (problem / solution / changelog)

Fix for Issue #45735

Problem

When calling torch.onnx.export with ModernBERT models, an IndexError: tuple index out of range occurs in sdpa_mask and flex_attention_mask functions. This happens because during ONNX export, cache_position can be passed as a 0-dimensional tensor (scalar), causing failures when accessing cache_position.shape[0] or cache_position[0].

Solution

Add a check at the start of both functions to handle 0D tensors by unsqueezing them to 1D before extracting shape information.

Changes

  • src/transformers/masking_utils.py: Added 0D tensor handling in both sdpa_mask and flex_attention_mask

Reproduction

import torch
from transformers import AutoModel, AutoTokenizer

path = "answerdotai/ModernBERT-base"
model = AutoModel.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)
dummy = dict(tokenizer(["test inputs"], return_tensors="pt"))
torch.onnx.export(model, (dummy,), dynamo=False)

Fixes #45735

Changed files

  • src/transformers/masking_utils.py (modified, +4/-0)

Code Example

File transformers/models/modernbert/modeling_modernbert.py:472, in ModernBertModel.forward(self, input_ids, attention_mask, position_ids, inputs_embeds, **kwargs)
    465 if not isinstance(attention_mask_mapping := attention_mask, dict):
    466     mask_kwargs = {
    467         "config": self.config,
    468         "inputs_embeds": hidden_states,
    469         "attention_mask": attention_mask,
    470     }
    471     attention_mask_mapping = {
--> 472         "full_attention": create_bidirectional_mask(**mask_kwargs),
    473         "sliding_attention": create_bidirectional_sliding_window_mask(**mask_kwargs),
    474     }
    476 position_embeddings = {}
    477 for layer_type in set(self.config.layer_types):

File transformers/utils/deprecation.py:171, in deprecate_kwarg.<locals>.wrapper.<locals>.wrapped_func(*args, **kwargs)
    167 elif minimum_action in (Action.NOTIFY, Action.NOTIFY_ALWAYS) and not is_torchdynamo_compiling():
    168     # DeprecationWarning is ignored by default, so we use FutureWarning instead
    169     warnings.warn(message, FutureWarning, stacklevel=2)
--> 171 return func(*args, **kwargs)

File transformers/masking_utils.py:1071, in create_bidirectional_mask(config, inputs_embeds, attention_mask, encoder_hidden_states, past_key_values, or_mask_function, and_mask_function)
   1068     use_vmap = True
   1070 # We now create the mask
-> 1071 attention_mask = mask_interface(
   1072     batch_size=batch_size,
   1073     q_length=q_length,
   1074     kv_length=kv_length,
   1075     q_offset=q_offset,
   1076     kv_offset=kv_offset,
   1077     mask_function=mask_factory_function,
   1078     attention_mask=attention_mask,
   1079     # Additional kwargs for sdpa
   1080     allow_is_causal_skip=False,
   1081     allow_is_bidirectional_skip=allow_is_bidirectional_skip,
   1082     dtype=dtype,  # Additional kwarg for eager
   1083     config=config,  # Pass the config as well, in case someone wants to easily have their own mask_interface
   1084     use_vmap=use_vmap,  # Short-circuit to non-vmap expansions for the mask
   1085     device=device,
   1086 )
   1087 return attention_mask

File transformers/masking_utils.py:492, in sdpa_mask(batch_size, q_length, kv_length, q_offset, kv_offset, mask_function, attention_mask, local_size, allow_is_causal_skip, allow_is_bidirectional_skip, allow_torch_fix, use_vmap, device, **kwargs)
    487 if isinstance(q_length, torch.Tensor):
    488     logger.warning_once(
    489         "`cache_position` is deprecated as an arg, and will be removed in Transformers v5.6. Please use `q_length` and "
    490         "`q_offset` instead, similarly to `kv_length` and `kv_offset`"
    491     )
--> 492     q_length, q_offset = q_length.shape[0], q_length[0].to(device)
    494 # Potentially pad the 2D mask
    495 padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset)

IndexError: tuple index out of range

---

import torch
from transformers import AutoModel, AutoTokenizer

# Note that using bert-base-uncased creates another separate ERROR
path = "answerdotai/ModernBERT-base"

model = AutoModel.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)

# Generate dummy inputs
dummy = dict(tokenizer(["test inputs"], return_tensors="pt"))

torch.onnx.export(
    model,
    (dummy,),
    dynamo=False
)

---

diff --git a/transformers/masking_utils.py b/tmp/patch.py
index 45e43fd..3d9f496 100644
--- a/transformers/masking_utils.py
+++ b/tmp/patch.py
@@ -484,7 +484,7 @@ def sdpa_mask(
 
     """
     # For BC on `cache_positions` that used to be an arg at the position of `q_length`
-    if isinstance(q_length, torch.Tensor):
+    if isinstance(q_length, torch.Tensor) and q_length.ndim > 0:
         logger.warning_once(
             "`cache_position` is deprecated as an arg, and will be removed in Transformers v5.6. Please use `q_length` and "
             "`q_offset` instead, similarly to `kv_length` and `kv_offset`"
@@ -689,7 +689,7 @@ def flex_attention_mask(
             An optional device to create the mask on.
     """
     # For BC on `cache_positions` that used to be an arg at the position of `q_length`
-    if isinstance(q_length, torch.Tensor):
+    if isinstance(q_length, torch.Tensor) and q_length.ndim > 0:
         logger.warning_once(
             "`cache_position` is deprecated as an arg, and will be removed in Transformers v5.6. Please use `q_length` and "
             "`q_offset` instead, similarly to `kv_length` and `kv_offset`"
RAW_BUFFERClick to expand / collapse

System Info

Transformers v5.7.0

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

It looks like https://github.com/huggingface/transformers/pull/44181 introduced the following issue.

File transformers/models/modernbert/modeling_modernbert.py:472, in ModernBertModel.forward(self, input_ids, attention_mask, position_ids, inputs_embeds, **kwargs)
    465 if not isinstance(attention_mask_mapping := attention_mask, dict):
    466     mask_kwargs = {
    467         "config": self.config,
    468         "inputs_embeds": hidden_states,
    469         "attention_mask": attention_mask,
    470     }
    471     attention_mask_mapping = {
--> 472         "full_attention": create_bidirectional_mask(**mask_kwargs),
    473         "sliding_attention": create_bidirectional_sliding_window_mask(**mask_kwargs),
    474     }
    476 position_embeddings = {}
    477 for layer_type in set(self.config.layer_types):

File transformers/utils/deprecation.py:171, in deprecate_kwarg.<locals>.wrapper.<locals>.wrapped_func(*args, **kwargs)
    167 elif minimum_action in (Action.NOTIFY, Action.NOTIFY_ALWAYS) and not is_torchdynamo_compiling():
    168     # DeprecationWarning is ignored by default, so we use FutureWarning instead
    169     warnings.warn(message, FutureWarning, stacklevel=2)
--> 171 return func(*args, **kwargs)

File transformers/masking_utils.py:1071, in create_bidirectional_mask(config, inputs_embeds, attention_mask, encoder_hidden_states, past_key_values, or_mask_function, and_mask_function)
   1068     use_vmap = True
   1070 # We now create the mask
-> 1071 attention_mask = mask_interface(
   1072     batch_size=batch_size,
   1073     q_length=q_length,
   1074     kv_length=kv_length,
   1075     q_offset=q_offset,
   1076     kv_offset=kv_offset,
   1077     mask_function=mask_factory_function,
   1078     attention_mask=attention_mask,
   1079     # Additional kwargs for sdpa
   1080     allow_is_causal_skip=False,
   1081     allow_is_bidirectional_skip=allow_is_bidirectional_skip,
   1082     dtype=dtype,  # Additional kwarg for eager
   1083     config=config,  # Pass the config as well, in case someone wants to easily have their own mask_interface
   1084     use_vmap=use_vmap,  # Short-circuit to non-vmap expansions for the mask
   1085     device=device,
   1086 )
   1087 return attention_mask

File transformers/masking_utils.py:492, in sdpa_mask(batch_size, q_length, kv_length, q_offset, kv_offset, mask_function, attention_mask, local_size, allow_is_causal_skip, allow_is_bidirectional_skip, allow_torch_fix, use_vmap, device, **kwargs)
    487 if isinstance(q_length, torch.Tensor):
    488     logger.warning_once(
    489         "`cache_position` is deprecated as an arg, and will be removed in Transformers v5.6. Please use `q_length` and "
    490         "`q_offset` instead, similarly to `kv_length` and `kv_offset`"
    491     )
--> 492     q_length, q_offset = q_length.shape[0], q_length[0].to(device)
    494 # Potentially pad the 2D mask
    495 padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset)

IndexError: tuple index out of range

The following code reproduces this.

import torch
from transformers import AutoModel, AutoTokenizer

# Note that using bert-base-uncased creates another separate ERROR
path = "answerdotai/ModernBERT-base"

model = AutoModel.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)

# Generate dummy inputs
dummy = dict(tokenizer(["test inputs"], return_tensors="pt"))

torch.onnx.export(
    model,
    (dummy,),
    dynamo=False
)

The following patch seems to fix the issue.

diff --git a/transformers/masking_utils.py b/tmp/patch.py
index 45e43fd..3d9f496 100644
--- a/transformers/masking_utils.py
+++ b/tmp/patch.py
@@ -484,7 +484,7 @@ def sdpa_mask(
 
     """
     # For BC on `cache_positions` that used to be an arg at the position of `q_length`
-    if isinstance(q_length, torch.Tensor):
+    if isinstance(q_length, torch.Tensor) and q_length.ndim > 0:
         logger.warning_once(
             "`cache_position` is deprecated as an arg, and will be removed in Transformers v5.6. Please use `q_length` and "
             "`q_offset` instead, similarly to `kv_length` and `kv_offset`"
@@ -689,7 +689,7 @@ def flex_attention_mask(
             An optional device to create the mask on.
     """
     # For BC on `cache_positions` that used to be an arg at the position of `q_length`
-    if isinstance(q_length, torch.Tensor):
+    if isinstance(q_length, torch.Tensor) and q_length.ndim > 0:
         logger.warning_once(
             "`cache_position` is deprecated as an arg, and will be removed in Transformers v5.6. Please use `q_length` and "
             "`q_offset` instead, similarly to `kv_length` and `kv_offset`"

Expected behavior

The export works.

extent analysis

TL;DR

Apply the provided patch to the transformers/masking_utils.py file to fix the IndexError: tuple index out of range issue.

Guidance

  • The error occurs due to an incorrect assumption about the shape of the q_length tensor in the sdpa_mask function.
  • The patch fixes this by adding a check for the number of dimensions in the q_length tensor before attempting to access its elements.
  • To apply the patch, update the transformers/masking_utils.py file with the provided changes.
  • After applying the patch, re-run the code that reproduces the issue to verify that the error is resolved.

Example

No code snippet is necessary, as the patch is provided in the issue.

Notes

The patch is specific to the transformers library version 5.7.0, and may not be applicable to other versions.

Recommendation

Apply the workaround by patching the transformers/masking_utils.py file, as the issue is specific to this version of the library and a fixed version is not available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The export works.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix Bug with detecting cache positions in sdpa_mask [1 pull requests, 1 comments, 2 participants]