transformers - 💡(How to fix) Fix First-class fine-tuning support for Mamba / Mamba-2 SSMs — architecture is production-ready, but the training path in Transformers isn't [5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44929Fetched 2026-04-08 01:16:56
View on GitHub
Comments
5
Participants
4
Timeline
13
Reactions
0
Timeline (top)
commented ×5subscribed ×4mentioned ×3labeled ×1
RAW_BUFFERClick to expand / collapse

Feature request

You can load Mamba models in Transformers — but the moment you try to actually fine-tune one, things fall apart fast. The standard Trainer was built around attention + KV cache assumptions that SSMs simply don't share. Gradient checkpointing breaks in weird ways, DataCollatorForLanguageModeling doesn't account for SSM inputs, and LoRA targeting on Mamba layers is a total wild west — everyone's doing it differently, nobody's sure if it's right. You're shipping hybrid SSM models like OLMo Hybrid in v5.3 but fine-tuning them reliably still feels like a DIY project.

What I'm asking for

A simple official Mamba fine-tuning example script — just one reference implementation people can trust

Motivation

SSM and hybrid architectures aren't experimental anymore — they're in production. The people trying to fine-tune them on medical, legal, and scientific text are exactly who needs this to just work. Right now it doesn't.

Your contribution

point me at the right files and I'll take a shot at the example script or docs.

extent analysis

Fix Plan

To address the issue, we need to create a simple official Mamba fine-tuning example script. Here are the steps:

  • Create a new Python script, e.g., mamba_fine_tuning_example.py
  • Import the necessary libraries, including transformers and torch
  • Load a pre-trained Mamba model and create a custom dataset class to handle SSM inputs
  • Define a custom DataCollatorForLanguageModeling class to account for SSM inputs
  • Use the Trainer class with gradient checkpointing and LoRA targeting

Example Code

import torch
from transformers import MambaForPreTraining, MambaTokenizer
from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset

# Load pre-trained Mamba model and tokenizer
model = MambaForPreTraining.from_pretrained("mamba-base")
tokenizer = MambaTokenizer.from_pretrained("mamba-base")

# Define custom dataset class
class MambaDataset(Dataset):
    def __init__(self, texts, tokenizer):
        self.texts = texts
        self.tokenizer = tokenizer

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer(text, return_tensors="pt")
        return inputs

# Define custom DataCollatorForLanguageModeling class
class MambaDataCollator:
    def __call__(self, features):
        input_ids = torch.cat([f["input_ids"] for f in features])
        attention_mask = torch.cat([f["attention_mask"] for f in features])
        labels = torch.cat([f["labels"] for f in features])
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

# Create dataset and data collator
dataset = MambaDataset(["Example text"], tokenizer)
data_collator = MambaDataCollator()

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    save_total_limit=2,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_on_each_node=True,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,
    compute_metrics=lambda pred: {"accuracy": torch.sum(pred.label_ids == pred.predictions.argmax(-1))},
    data_collator=data_collator,
)

# Train model
trainer.train()

Verification

To verify that the fix worked, run the example script and check that the model fine-tunes successfully without errors. You can also evaluate the

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING