transformers - 💡(How to fix) Fix AutoModelForSequenceClassification with attn_implementation="flash_attention_3" causes degenerate training (loss increases, model predicts all-one-class) [1 comments, 2 participants]

transformers2026-03-18 13:56:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44829•Fetched 2026-04-08 00:58:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Jantory

Participants

Jantory

Rocketknight1

Timeline (top)

commented ×1labeled ×1mentioned ×1subscribed ×1

Code Example

model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-Embedding-8B",
    num_labels=2,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_3",  # remove this → works
)
# train with HF Trainer on binary classification task

RAW_BUFFERClick to expand / collapse

System Info

When fine-tuning Qwen3ForSequenceClassification (loaded via AutoModelForSequenceClassification) with attn_implementation="flash_attention_3", training completely fails: loss increases instead of decreasing, and the model collapses to predicting all examples as one class. Removing attn_implementation="flash_attention_3" (falling back to default attention) fixes the issue immediately.

Environment:

Hardware: NVIDIA H100 (Hopper) transformers version: (your version) flash-attn version: (your version) Model: Qwen/Qwen3-Embedding-8B PEFT / LoRA applied on top

Observed: loss increases (e.g. 0.35 → 0.41), eval_recall=1.0 with threshold≈0 (all predicted positive), F1 stuck at positive-class base rate.

Note: The issue appears specific to Qwen3ForSequenceClassification + FA3. The same model backbone with FA3 works correctly in other use cases (e.g. feature extraction / embedding), suggesting the problem lies in the last-token pooling or score head gradient path under FA3.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-Embedding-8B",
    num_labels=2,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_3",  # remove this → works
)
# train with HF Trainer on binary classification task

Expected behavior

normal convergence.

extent analysis

Fix Plan

The issue seems to be related to the attn_implementation="flash_attention_3" in the Qwen3ForSequenceClassification model. To fix this, we can try the following steps:

Disable flash attention: Remove the attn_implementation="flash_attention_3" argument when loading the model.
Update flash-attn library: Ensure that the flash-attn library is up-to-date, as newer versions may have fixed the issue.
Modify the model configuration: Try modifying the model configuration to use a different attention implementation.

Here's an example code snippet that demonstrates how to disable flash attention:

model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-Embedding-8B",
    num_labels=2,
    torch_dtype=torch.bfloat16,
    # Remove attn_implementation argument
)

Alternatively, you can try updating the flash-attn library and then load the model with the attn_implementation="flash_attention_3" argument.

Verification

To verify that the fix worked, you can train the model and check the loss and evaluation metrics. The loss should decrease, and the evaluation metrics (e.g., F1 score) should improve.

Extra Tips

Make sure to test the model on a small dataset before training on the full dataset to ensure that the issue is resolved.
If the issue persists, try debugging the model by printing the intermediate outputs and gradients to identify where the problem lies.
Consider opening an issue on the Hugging Face Transformers repository or the flash-attn repository if the problem is not resolved after trying the above steps.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

normal convergence.

#api #ssr #installation #tensor shape #autograd error #device allocation #model download #tokenizer error #prompt formatting #chain error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix AutoModelForSequenceClassification with attn_implementation="flash_attention_3" causes degenerate training (loss increases, model predicts all-one-class) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Environment:

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix AutoModelForSequenceClassification with attn_implementation="flash_attention_3" causes degenerate training (loss increases, model predicts all-one-class) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Environment:

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING