transformers - 💡(How to fix) Fix `problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45479Fetched 2026-04-17 08:26:41
View on GitHub
Comments
2
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×2labeled ×1

Root Cause

which produces a degenerate zero loss because there is only one class dimension.

Code Example

num_labels=1
problem_type="single_label_classification"

---

loss = None
if labels is not None:
    if self.config.problem_type is None:
        if self.num_labels == 1:
            self.config.problem_type = "regression"
        elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
            self.config.problem_type = "single_label_classification"
        else:
            self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
        loss_fct = MSELoss()
        if self.num_labels == 1:
            loss = loss_fct(logits.squeeze(), labels.squeeze())
        else:
            loss = loss_fct(logits, labels)
    elif self.config.problem_type == "single_label_classification":
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    elif self.config.problem_type == "multi_label_classification":
        loss_fct = BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)

---

CrossEntropyLoss()(logits.view(-1, 1), labels.view(-1))

---

from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=1,
    problem_type="single_label_classification",
)

input_ids = torch.tensor([[101, 102]])
attention_mask = torch.tensor([[1, 1]])
labels = torch.tensor([0])

outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    labels=labels,
)

print(outputs.logits.shape)
print(outputs.loss)
RAW_BUFFERClick to expand / collapse

System Info

Hi, I found what looks like a library-wide issue in transformers affecting multiple ForSequenceClassification models, not just ModernBERT.

If a model is initialized with:

num_labels=1
problem_type="single_label_classification"

the forward pass uses CrossEntropyLoss() with only one output logit. This leads to a degenerate zero loss during training/evaluation instead of performing binary classification meaningfully.

I first observed this with ModernBertForSequenceClassification, but the same logic appears in other sequence-classification models as well (for example RoBERTa and others using the same loss-selection pattern).

In modeling_modernbert.py, the relevant part is:

loss = None
if labels is not None:
    if self.config.problem_type is None:
        if self.num_labels == 1:
            self.config.problem_type = "regression"
        elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
            self.config.problem_type = "single_label_classification"
        else:
            self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
        loss_fct = MSELoss()
        if self.num_labels == 1:
            loss = loss_fct(logits.squeeze(), labels.squeeze())
        else:
            loss = loss_fct(logits, labels)
    elif self.config.problem_type == "single_label_classification":
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    elif self.config.problem_type == "multi_label_classification":
        loss_fct = BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)

With `num_labels=1 and problem_type="single_label_classification", this becomes:

CrossEntropyLoss()(logits.view(-1, 1), labels.view(-1))

which produces a degenerate zero loss because there is only one class dimension.

Why I think this is a bug: This setup naturally suggests binary classification with labels like:

  • 0 -> class 0
  • 1 -> class 1

So from a user perspective, this looks like it should be a valid single-label binary classification setup. Right now, however, num_labels=1 is effectively treated as if there were only one possible class in the loss computation, which makes the classification loss meaningless.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Minimal reproduction

from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=1,
    problem_type="single_label_classification",
)

input_ids = torch.tensor([[101, 102]])
attention_mask = torch.tensor([[1, 1]])
labels = torch.tensor([0])

outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    labels=labels,
)

print(outputs.logits.shape)
print(outputs.loss)

Observed result outputs.loss is 0 (or degenerate), and the same behavior also shows up during training.

Expected behavior

Expected behavior I would expect num_labels=1 with problem_type="single_label_classification" to support binary classification meaningfully for labels {0, 1}, instead of silently producing a degenerate zero loss. For example, this could be implemented with a single-logit binary objective such as BCEWithLogitsLoss, or by internally mapping this configuration to an equivalent binary-classification setup. In any case, the current behavior of silently returning zero loss seems incorrect.

Actual behavior The model runs, but training/eval loss becomes degenerate (0) because CrossEntropyLoss is applied to logits with shape [..., 1].

extent analysis

TL;DR

The most likely fix is to use BCEWithLogitsLoss instead of CrossEntropyLoss when num_labels=1 and problem_type="single_label_classification".

Guidance

  • Check the loss function used in the model when num_labels=1 and problem_type="single_label_classification". It should be BCEWithLogitsLoss instead of CrossEntropyLoss.
  • Verify that the model is producing a non-degenerate loss during training and evaluation.
  • Consider modifying the modeling_modernbert.py file to use BCEWithLogitsLoss when num_labels=1 and problem_type="single_label_classification".
  • Test the model with a minimal reproduction script to ensure the fix is working as expected.

Example

if self.config.problem_type == "single_label_classification" and self.num_labels == 1:
    loss_fct = BCEWithLogitsLoss()
    loss = loss_fct(logits.squeeze(), labels.squeeze())

Notes

This fix assumes that the intended behavior is to perform binary classification when num_labels=1 and problem_type="single_label_classification". If this is not the case, further modifications may be needed.

Recommendation

Apply workaround: use BCEWithLogitsLoss instead of CrossEntropyLoss when num_labels=1 and problem_type="single_label_classification", as this will allow the model to perform binary classification meaningfully.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Expected behavior I would expect num_labels=1 with problem_type="single_label_classification" to support binary classification meaningfully for labels {0, 1}, instead of silently producing a degenerate zero loss. For example, this could be implemented with a single-logit binary objective such as BCEWithLogitsLoss, or by internally mapping this configuration to an equivalent binary-classification setup. In any case, the current behavior of silently returning zero loss seems incorrect.

Actual behavior The model runs, but training/eval loss becomes degenerate (0) because CrossEntropyLoss is applied to logits with shape [..., 1].

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING