transformers - ✅(Solved) Fix Add `Olmo2ForSequenceClassification` (and ideally `OlmoForSequenceClassification` / `Olmo3ForSequenceClassification`) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

AutoModelForSequenceClassification.from_pretrained("allenai/OLMo-2-0425-1B") currently fails because the OLMo family exposes only *Model and *ForCausalLM. All peer decoder architectures (Llama, Mistral, Qwen2, Gemma, Falcon, etc.) ship ForSequenceClassification.

PR fix notes

PR #45551: Add ForSequenceClassification heads for the OLMo family

Description (problem / solution / changelog)

What does this PR do?

Adds ForSequenceClassification for Olmo, Olmo2, and Olmo3 so AutoModelForSequenceClassification.from_pretrained("allenai/OLMo-2-0425-1B") (and the Olmo/Olmo3 equivalents) work.

Structure follows the same pattern as Gemma2, Qwen3, and Glm4: one OlmoForSequenceClassification(LlamaForSequenceClassification): pass in modular_olmo.py, then Olmo2 and Olmo3 each subclass the previous one. After make fix-repo, the generated modeling_*.py files use the GenericForSequenceClassification mixin, same as Jamba, JetMoe, Ministral3, and Gemma3Text.

Scope is the dense chain only. OlmoHybrid and the MoE branch (Olmoe, FlexOlmo) can come as follow-up PRs.

Fixes #45529

Code Agent Policy

This was drafted with Claude Code. @Rocketknight1 explicitly opted in for this specific change on the coordination issue:

This is welcome! Sequence classification heads are often not included in the initial PR adding a new causal LM, but we're happy to add them. Your reason for needing it is good, and PRs like this are usually very easy to automate, so I'm happy for it to be mostly AI-written.

https://github.com/huggingface/transformers/issues/45529#issuecomment-4288374995

I read every changed line before each commit, ran the local test suite on my machine, and did GPU validation on real checkpoints before opening this. I can defend each change.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? #45529
  • Did you make sure to update the documentation with your changes? Added autodoc sections to docs/source/en/model_doc/olmo{,2,3}.md.
  • Did you write any new necessary tests? See test plan below.

Changes

  • src/transformers/models/olmo/modular_olmo.py: class OlmoForSequenceClassification(LlamaForSequenceClassification): pass
  • src/transformers/models/olmo2/modular_olmo2.py: subclass of OlmoForSequenceClassification
  • src/transformers/models/olmo3/modular_olmo3.py: subclass of Olmo2ForSequenceClassification
  • Regenerated modeling_olmo.py, modeling_olmo2.py, modeling_olmo3.py via make fix-repo
  • Registered all three in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES in models/auto/modeling_auto.py
  • Added autodoc sections to docs/source/en/model_doc/olmo{,2,3}.md
  • Tests:
    • Olmo and Olmo2 use the older ModelTesterMixin pattern. Added the new class to all_model_classes and text-classification/zero-shot to pipeline_model_mapping.
    • Olmo3 uses the newer CausalLMModelTester. Set sequence_classification_class = Olmo3ForSequenceClassification, which auto-enables the three test_sequence_classification_model* tests from the base class.

Test plan

Local, on MPS:

  • make style, make typing, make check-repo all pass
  • pytest tests/models/olmo tests/models/olmo2 tests/models/olmo3: 413 passed, 0 failing. 4 test_tp_* tests deselected because they require CUDA multi-GPU.
  • Olmo3's three test_sequence_classification_model* tests pass

GPU validation notebook with outputs: https://gist.github.com/earino/2bc6f246eef21a36c3c64d64150b9510

Ran on an NVIDIA RTX PRO 6000 Blackwell. For each of allenai/OLMo-1B-hf, allenai/OLMo-2-0425-1B, and allenai/Olmo-3-7B-Instruct:

  • AutoModelForSequenceClassification.from_pretrained(...) dispatches to the right class
  • Forward returns logits of shape (batch, num_labels)
  • Loss is finite at random init, roughly ln(num_labels) as expected
  • Backward produces finite gradients for every trainable parameter
  • The library's LOAD REPORT correctly shows score.weight | MISSING (new head) and lm_head.weight | UNEXPECTED (causal-LM head unused)

A LoRA fine-tune on IMDB with allenai/OLMo-2-0425-1B (4.2M trainable params, 250 steps) brings loss from 1.48 over the first 20 steps down to 0.0005 over the last 20. I only ran the full training loop on one of the three because the classification head implementation is identical across them (same GenericForSequenceClassification mixin, different backbones and *PreTrainedModel bases), so the training-loop plumbing is shared; smoke-test forward/backward covers the other two. Happy to extend the LoRA run to Olmo and Olmo3 if you'd prefer.

Who can review?

cc @Rocketknight1 as requested on #45529.

Changed files

  • docs/source/en/model_doc/olmo.md (modified, +5/-0)
  • docs/source/en/model_doc/olmo2.md (modified, +5/-0)
  • docs/source/en/model_doc/olmo3.md (modified, +5/-0)
  • src/transformers/models/auto/modeling_auto.py (modified, +3/-0)
  • src/transformers/models/olmo/modeling_olmo.py (modified, +6/-2)
  • src/transformers/models/olmo/modular_olmo.py (modified, +6/-0)
  • src/transformers/models/olmo2/modeling_olmo2.py (modified, +6/-2)
  • src/transformers/models/olmo2/modular_olmo2.py (modified, +6/-0)
  • src/transformers/models/olmo3/modeling_olmo3.py (modified, +6/-2)
  • src/transformers/models/olmo3/modular_olmo3.py (modified, +6/-0)
  • tests/models/olmo/test_modeling_olmo.py (modified, +4/-1)
  • tests/models/olmo2/test_modeling_olmo2.py (modified, +4/-1)
  • tests/models/olmo3/test_modeling_olmo3.py (modified, +2/-0)
RAW_BUFFERClick to expand / collapse

AutoModelForSequenceClassification.from_pretrained("allenai/OLMo-2-0425-1B") currently fails because the OLMo family exposes only *Model and *ForCausalLM. All peer decoder architectures (Llama, Mistral, Qwen2, Gemma, Falcon, etc.) ship ForSequenceClassification.

Motivation

I teach the graduate Applied Deep Learning course at Central European University (ECBS5200, http://earino.github.io/applied-deep-learning). The course runs a six-week project fine-tuning models for a 113-class text classification task (consumer financial complaints) on free-tier Kaggle T4 GPUs, with an explicit module comparing encoder vs decoder architectures and a capstone on model economics.

I insist on fully-open models (open weights + open training data + open training code) so students can inspect and reproduce the full pipeline. OLMo-2 1B is the only small-decoder option in that category that fits a T4, but AutoModelForSequenceClassification.from_pretrained("allenai/OLMo-2-0425-1B") currently fails — which blocks the decoder-comparison module from using the canonical HF classification API.

Adding this head would let a concrete graduate cohort learn decoder fine-tuning on a fully-transparent pipeline using the same idiomatic API they use for the encoder baseline.

Proposed approach

Add Olmo2ForSequenceClassification in modular_olmo2.py as a subclass of LlamaForSequenceClassification re-pointed at Olmo2Model, let make fix-repo regenerate modeling_olmo2.py, register in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES, and extend tests/models/olmo2/test_modeling_olmo2.py with the standard ModelTesterMixin classification tests. End-to-end verified on real data on a GPU before PR.

Questions before I start

  1. Welcome in principle, or was the omission intentional?
  2. Scope — OLMo-2 only, or should I include OLMo and OLMo-3 in the same PR for consistency?
  3. Any preferences on test coverage beyond the standard ModelTesterMixin?

Will disclose AI assistance and link this thread in the PR per the CLAUDE.md agentic contribution policy.

extent analysis

TL;DR

Add Olmo2ForSequenceClassification to enable sequence classification for OLMo-2 models using the canonical Hugging Face classification API.

Guidance

  • The issue arises from the lack of a ForSequenceClassification head in the OLMo family, which is present in other peer decoder architectures.
  • To resolve this, create a new class Olmo2ForSequenceClassification as a subclass of LlamaForSequenceClassification, but pointed at Olmo2Model, and register it in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES.
  • Extend the tests in tests/models/olmo2/test_modeling_olmo2.py with the standard ModelTesterMixin classification tests to ensure the new functionality works as expected.
  • Consider including OLMo and OLMo-3 in the same PR for consistency, but await feedback on the scope.

Example

No code snippet is provided as the issue does not contain sufficient code context, but the proposed approach suggests modifying modular_olmo2.py and tests/models/olmo2/test_modeling_olmo2.py.

Notes

The solution depends on the approval of the proposed approach and the scope of the changes (OLMo-2 only or including OLMo and OLMo-3).

Recommendation

Apply the workaround by adding Olmo2ForSequenceClassification as described, as it enables the use of the canonical Hugging Face classification API for OLMo-2 models, aligning with the requirements for the graduate course project.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix Add `Olmo2ForSequenceClassification` (and ideally `OlmoForSequenceClassification` / `Olmo3ForSequenceClassification`) [1 pull requests]