transformers - 💡(How to fix) Fix AutoModelForSequenceClassification with attn_implementation="flash_attention_3" causes degenerate training (loss increases, model predicts all-one-class) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44829Fetched 2026-04-08 00:58:08
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×1labeled ×1mentioned ×1subscribed ×1

Code Example

model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-Embedding-8B",
    num_labels=2,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_3",  # remove this → works
)
# train with HF Trainer on binary classification task
RAW_BUFFERClick to expand / collapse

System Info

When fine-tuning Qwen3ForSequenceClassification (loaded via AutoModelForSequenceClassification) with attn_implementation="flash_attention_3", training completely fails: loss increases instead of decreasing, and the model collapses to predicting all examples as one class. Removing attn_implementation="flash_attention_3" (falling back to default attention) fixes the issue immediately.

Environment:

Hardware: NVIDIA H100 (Hopper) transformers version: (your version) flash-attn version: (your version) Model: Qwen/Qwen3-Embedding-8B PEFT / LoRA applied on top

Observed: loss increases (e.g. 0.35 → 0.41), eval_recall=1.0 with threshold≈0 (all predicted positive), F1 stuck at positive-class base rate.

Note: The issue appears specific to Qwen3ForSequenceClassification + FA3. The same model backbone with FA3 works correctly in other use cases (e.g. feature extraction / embedding), suggesting the problem lies in the last-token pooling or score head gradient path under FA3.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-Embedding-8B",
    num_labels=2,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_3",  # remove this → works
)
# train with HF Trainer on binary classification task

Expected behavior

normal convergence.

extent analysis

Fix Plan

The issue seems to be related to the attn_implementation="flash_attention_3" in the Qwen3ForSequenceClassification model. To fix this, we can try the following steps:

  • Disable flash attention: Remove the attn_implementation="flash_attention_3" argument when loading the model.
  • Update flash-attn library: Ensure that the flash-attn library is up-to-date, as newer versions may have fixed the issue.
  • Modify the model configuration: Try modifying the model configuration to use a different attention implementation.

Here's an example code snippet that demonstrates how to disable flash attention:

model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-Embedding-8B",
    num_labels=2,
    torch_dtype=torch.bfloat16,
    # Remove attn_implementation argument
)

Alternatively, you can try updating the flash-attn library and then load the model with the attn_implementation="flash_attention_3" argument.

Verification

To verify that the fix worked, you can train the model and check the loss and evaluation metrics. The loss should decrease, and the evaluation metrics (e.g., F1 score) should improve.

Extra Tips

  • Make sure to test the model on a small dataset before training on the full dataset to ensure that the issue is resolved.
  • If the issue persists, try debugging the model by printing the intermediate outputs and gradients to identify where the problem lies.
  • Consider opening an issue on the Hugging Face Transformers repository or the flash-attn repository if the problem is not resolved after trying the above steps.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

normal convergence.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING