transformers - 💡(How to fix) Fix First-class fine-tuning support for Mamba / Mamba-2 SSMs — architecture is production-ready, but the training path in Transformers isn't [5 comments, 4 participants]

transformers2026-03-22 17:23:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44929•Fetched 2026-04-08 01:16:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5subscribed ×4mentioned ×3labeled ×1

RAW_BUFFERClick to expand / collapse

Feature request

You can load Mamba models in Transformers — but the moment you try to actually fine-tune one, things fall apart fast. The standard Trainer was built around attention + KV cache assumptions that SSMs simply don't share. Gradient checkpointing breaks in weird ways, DataCollatorForLanguageModeling doesn't account for SSM inputs, and LoRA targeting on Mamba layers is a total wild west — everyone's doing it differently, nobody's sure if it's right. You're shipping hybrid SSM models like OLMo Hybrid in v5.3 but fine-tuning them reliably still feels like a DIY project.

What I'm asking for

A simple official Mamba fine-tuning example script — just one reference implementation people can trust

Motivation

SSM and hybrid architectures aren't experimental anymore — they're in production. The people trying to fine-tune them on medical, legal, and scientific text are exactly who needs this to just work. Right now it doesn't.

Your contribution

point me at the right files and I'll take a shot at the example script or docs.

extent analysis

Fix Plan

To address the issue, we need to create a simple official Mamba fine-tuning example script. Here are the steps:

Create a new Python script, e.g., mamba_fine_tuning_example.py
Import the necessary libraries, including transformers and torch
Load a pre-trained Mamba model and create a custom dataset class to handle SSM inputs
Define a custom DataCollatorForLanguageModeling class to account for SSM inputs
Use the Trainer class with gradient checkpointing and LoRA targeting

Example Code

import torch
from transformers import MambaForPreTraining, MambaTokenizer
from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset

# Load pre-trained Mamba model and tokenizer
model = MambaForPreTraining.from_pretrained("mamba-base")
tokenizer = MambaTokenizer.from_pretrained("mamba-base")

# Define custom dataset class
class MambaDataset(Dataset):
    def __init__(self, texts, tokenizer):
        self.texts = texts
        self.tokenizer = tokenizer

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer(text, return_tensors="pt")
        return inputs

# Define custom DataCollatorForLanguageModeling class
class MambaDataCollator:
    def __call__(self, features):
        input_ids = torch.cat([f["input_ids"] for f in features])
        attention_mask = torch.cat([f["attention_mask"] for f in features])
        labels = torch.cat([f["labels"] for f in features])
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

# Create dataset and data collator
dataset = MambaDataset(["Example text"], tokenizer)
data_collator = MambaDataCollator()

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    save_total_limit=2,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_on_each_node=True,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,
    compute_metrics=lambda pred: {"accuracy": torch.sum(pred.label_ids == pred.predictions.argmax(-1))},
    data_collator=data_collator,
)

# Train model
trainer.train()

Verification

To verify that the fix worked, run the example script and check that the model fine-tunes successfully without errors. You can also evaluate the

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#prompt issue #agent setup #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix First-class fine-tuning support for Mamba / Mamba-2 SSMs — architecture is production-ready, but the training path in Transformers isn't [5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Feature request

Motivation

Your contribution

extent analysis

Fix Plan

Example Code

Verification

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix First-class fine-tuning support for Mamba / Mamba-2 SSMs — architecture is production-ready, but the training path in Transformers isn't [5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Feature request

Motivation

Your contribution

extent analysis

Fix Plan

Example Code

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING