transformers - 💡(How to fix) Fix Direct build of layer/attention silently produces non-causal output

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

cfg = AutoConfig.from_pretrained("Qwen/Qwen3-4B")
cfg._attn_implementation            # -> None  (never set without a model)
layer = Qwen3DecoderLayer(cfg, 0)   # dispatch falls back to eager

I can submit PR for either Option A or B. I'm currently relying on B in my local monkey patch.

Code Example

def eager_attention_forward(module, query, key, value, attention_mask, scaling, dropout=0.0, **kwargs):
    ...
    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask
    elif getattr(module, "is_causal", False):           # apply causal mask when none is supplied
        L, S = query.shape[-2], key_states.shape[-2]
        causal = torch.ones(L, S, dtype=torch.bool, device=query.device).tril(diagonal=S - L)
        attn_weights = attn_weights.masked_fill(~causal, float("-inf"))
    ...

---

config.to_dict(include_runtime=True)         # keep _attn_implementation, etc.
config.to_json_string(include_runtime=True)

---

def _remove_keys_not_serialized(self, d, include_runtime: bool = False):
    keys_to_remove = ["_is_quantized", "_auto_class", "_commit_hash", ...]
    if not include_runtime:
        keys_to_remove += ["_attn_implementation_internal", "_experts_implementation_internal"]
    for key_to_remove in keys_to_remove:
        d.pop(key_to_remove, None)
    ...

---

cfg = AutoConfig.from_pretrained("Qwen/Qwen3-4B")
cfg._attn_implementation            # -> None  (never set without a model)
layer = Qwen3DecoderLayer(cfg, 0)   # dispatch falls back to eager

---

import torch
from transformers import Qwen3Config
from transformers.models.qwen3.modeling_qwen3 import Qwen3Attention, Qwen3RotaryEmbedding

cfg = Qwen3Config(hidden_size=64, num_attention_heads=4, num_key_value_heads=2,
                  head_dim=16, intermediate_size=128, num_hidden_layers=1, vocab_size=100)

assert cfg._attn_implementation is None 
attn = Qwen3Attention(cfg, layer_idx=0).eval()
assert attn.is_causal is True   

B, T = 1, 8
x = torch.randn(B, T, cfg.hidden_size)
pe = Qwen3RotaryEmbedding(cfg)(x, torch.arange(T).unsqueeze(0))

with torch.no_grad():
    out_none, _ = attn(x, position_embeddings=pe, attention_mask=None)  # None -> eager, NOT causal
    cfg._attn_implementation = "sdpa"
    out_sdpa, _ = attn(x, position_embeddings=pe, attention_mask=None)  # sdpa, is_causal=True

print((out_none - out_sdpa).norm())   # large: bidirectional vs causal
RAW_BUFFERClick to expand / collapse

Feature request

  • Option A: make the eager fallback honor module.is_causal:
def eager_attention_forward(module, query, key, value, attention_mask, scaling, dropout=0.0, **kwargs):
    ...
    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask
    elif getattr(module, "is_causal", False):           # apply causal mask when none is supplied
        L, S = query.shape[-2], key_states.shape[-2]
        causal = torch.ones(L, S, dtype=torch.bool, device=query.device).tril(diagonal=S - L)
        attn_weights = attn_weights.masked_fill(~causal, float("-inf"))
    ...
  • Option B: let users export the fully-resolved config: doesn't change attention behavior, but gives a way to snapshot and restore the resolved backend.
config.to_dict(include_runtime=True)         # keep _attn_implementation, etc.
config.to_json_string(include_runtime=True)
def _remove_keys_not_serialized(self, d, include_runtime: bool = False):
    keys_to_remove = ["_is_quantized", "_auto_class", "_commit_hash", ...]
    if not include_runtime:
        keys_to_remove += ["_attn_implementation_internal", "_experts_implementation_internal"]
    for key_to_remove in keys_to_remove:
        d.pop(key_to_remove, None)
    ...

These are not mutually exclusive, I think B improves reproducibility. Happy to open a PR for whichever direction maintainers prefer.

Motivation

When a decoder module is built directly from a config instead of via AutoModel.from_pretrained (using this for distillation), config._attn_implementation is None:

cfg = AutoConfig.from_pretrained("Qwen/Qwen3-4B")
cfg._attn_implementation            # -> None  (never set without a model)
layer = Qwen3DecoderLayer(cfg, 0)   # dispatch falls back to eager

AttentionInterface.get_interface(None, ...) falls back to eager_attention_forward, which only masks if attention_mask is not None. With attention_mask=None, the standalone module computes bidirectional attention instead of causal.

The module (for Qwen3) already declares itself causal (self.is_causal = True), and to_dict() / to_json_string() strip the resolved backend unconditionally (_remove_keys_not_serialized), so there is no way to round-trip or recover it either.

Minimal reproduction

import torch
from transformers import Qwen3Config
from transformers.models.qwen3.modeling_qwen3 import Qwen3Attention, Qwen3RotaryEmbedding

cfg = Qwen3Config(hidden_size=64, num_attention_heads=4, num_key_value_heads=2,
                  head_dim=16, intermediate_size=128, num_hidden_layers=1, vocab_size=100)

assert cfg._attn_implementation is None 
attn = Qwen3Attention(cfg, layer_idx=0).eval()
assert attn.is_causal is True   

B, T = 1, 8
x = torch.randn(B, T, cfg.hidden_size)
pe = Qwen3RotaryEmbedding(cfg)(x, torch.arange(T).unsqueeze(0))

with torch.no_grad():
    out_none, _ = attn(x, position_embeddings=pe, attention_mask=None)  # None -> eager, NOT causal
    cfg._attn_implementation = "sdpa"
    out_sdpa, _ = attn(x, position_embeddings=pe, attention_mask=None)  # sdpa, is_causal=True

print((out_none - out_sdpa).norm())   # large: bidirectional vs causal

Your contribution

I can submit PR for either Option A or B. I'm currently relying on B in my local monkey patch.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING