transformers - 💡(How to fix) Fix `problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models [2 comments, 2 participants]

Q: Expected behavior

Expected behavior I would expect `num_labels=1` with `problem_type="single_label_classification"` to support binary classification meaningfully for labels `{0, 1}`, instead of silently producing a degenerate zero loss. For example, this could be implemented with a single-logit binary objective such as `BCEWithLogitsLoss`, or by internally mapping this configuration to an equivalent binary-classification setup. In any case, the current behavior of silently returning zero loss seems incorrect. Actual behavior The model runs, but training/eval loss becomes degenerate `(0)` because `CrossEntropyLoss` is applied to logits with shape `[..., 1]`.

transformers2026-04-16 14:58:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45479•Fetched 2026-04-17 08:26:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

BohdanBabii

Participants

BohdanBabii

Mati0kez

Timeline (top)

commented ×2labeled ×1

Root Cause

which produces a degenerate zero loss because there is only one class dimension.

Code Example

num_labels=1
problem_type="single_label_classification"

---

loss = None
if labels is not None:
    if self.config.problem_type is None:
        if self.num_labels == 1:
            self.config.problem_type = "regression"
        elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
            self.config.problem_type = "single_label_classification"
        else:
            self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
        loss_fct = MSELoss()
        if self.num_labels == 1:
            loss = loss_fct(logits.squeeze(), labels.squeeze())
        else:
            loss = loss_fct(logits, labels)
    elif self.config.problem_type == "single_label_classification":
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    elif self.config.problem_type == "multi_label_classification":
        loss_fct = BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)

---

CrossEntropyLoss()(logits.view(-1, 1), labels.view(-1))

---

from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=1,
    problem_type="single_label_classification",
)

input_ids = torch.tensor([[101, 102]])
attention_mask = torch.tensor([[1, 1]])
labels = torch.tensor([0])

outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    labels=labels,
)

print(outputs.logits.shape)
print(outputs.loss)

RAW_BUFFERClick to expand / collapse

System Info

Hi, I found what looks like a library-wide issue in transformers affecting multiple ForSequenceClassification models, not just ModernBERT.

If a model is initialized with:

num_labels=1
problem_type="single_label_classification"

the forward pass uses CrossEntropyLoss() with only one output logit. This leads to a degenerate zero loss during training/evaluation instead of performing binary classification meaningfully.

I first observed this with ModernBertForSequenceClassification, but the same logic appears in other sequence-classification models as well (for example RoBERTa and others using the same loss-selection pattern).

In modeling_modernbert.py, the relevant part is:

loss = None
if labels is not None:
    if self.config.problem_type is None:
        if self.num_labels == 1:
            self.config.problem_type = "regression"
        elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
            self.config.problem_type = "single_label_classification"
        else:
            self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
        loss_fct = MSELoss()
        if self.num_labels == 1:
            loss = loss_fct(logits.squeeze(), labels.squeeze())
        else:
            loss = loss_fct(logits, labels)
    elif self.config.problem_type == "single_label_classification":
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    elif self.config.problem_type == "multi_label_classification":
        loss_fct = BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)

With `num_labels=1 and problem_type="single_label_classification", this becomes:

CrossEntropyLoss()(logits.view(-1, 1), labels.view(-1))

which produces a degenerate zero loss because there is only one class dimension.

Why I think this is a bug: This setup naturally suggests binary classification with labels like:

0 -> class 0
1 -> class 1

So from a user perspective, this looks like it should be a valid single-label binary classification setup. Right now, however, num_labels=1 is effectively treated as if there were only one possible class in the loss computation, which makes the classification loss meaningless.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Minimal reproduction

from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=1,
    problem_type="single_label_classification",
)

input_ids = torch.tensor([[101, 102]])
attention_mask = torch.tensor([[1, 1]])
labels = torch.tensor([0])

outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    labels=labels,
)

print(outputs.logits.shape)
print(outputs.loss)

Observed result outputs.loss is 0 (or degenerate), and the same behavior also shows up during training.

Expected behavior

Expected behavior I would expect num_labels=1 with problem_type="single_label_classification" to support binary classification meaningfully for labels {0, 1}, instead of silently producing a degenerate zero loss. For example, this could be implemented with a single-logit binary objective such as BCEWithLogitsLoss, or by internally mapping this configuration to an equivalent binary-classification setup. In any case, the current behavior of silently returning zero loss seems incorrect.

Actual behavior The model runs, but training/eval loss becomes degenerate (0) because CrossEntropyLoss is applied to logits with shape [..., 1].

extent analysis

TL;DR

The most likely fix is to use BCEWithLogitsLoss instead of CrossEntropyLoss when num_labels=1 and problem_type="single_label_classification".

Guidance

Check the loss function used in the model when num_labels=1 and problem_type="single_label_classification". It should be BCEWithLogitsLoss instead of CrossEntropyLoss.
Verify that the model is producing a non-degenerate loss during training and evaluation.
Consider modifying the modeling_modernbert.py file to use BCEWithLogitsLoss when num_labels=1 and problem_type="single_label_classification".
Test the model with a minimal reproduction script to ensure the fix is working as expected.

Example

if self.config.problem_type == "single_label_classification" and self.num_labels == 1:
    loss_fct = BCEWithLogitsLoss()
    loss = loss_fct(logits.squeeze(), labels.squeeze())

Notes

This fix assumes that the intended behavior is to perform binary classification when num_labels=1 and problem_type="single_label_classification". If this is not the case, further modifications may be needed.

Recommendation

Apply workaround: use BCEWithLogitsLoss instead of CrossEntropyLoss when num_labels=1 and problem_type="single_label_classification", as this will allow the model to perform binary classification meaningfully.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Actual behavior The model runs, but training/eval loss becomes degenerate (0) because CrossEntropyLoss is applied to logits with shape [..., 1].

#tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix `problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix `problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING