transformers - 💡(How to fix) Fix Need an example for FSDP + FP16 training [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44169Fetched 2026-04-08 00:30:01
View on GitHub
Comments
2
Participants
3
Timeline
3
Reactions
0
Timeline (top)
commented ×2closed ×1
RAW_BUFFERClick to expand / collapse

In my setup, I am trying to run FSDP with FP16 precision. Is there any limitation that I can not use FSDP with FP16 precision? How can I convert my existing code to FSDP for FP16 precision? I believe there is ShardedGradScaler from FSDP should be used. How is it different than normal GradScaler in terms of implementation? It will be great if someone share a concise example for this.

extent analysis

Fix Plan

Convert to FSDP with FP16 Precision

Step 1: Install Required Packages

  • Install transformers and torch packages if not already installed
  • Install torch-amp for mixed precision training
pip install transformers torch torch-amp

Step 2: Import Required Modules

import torch
from transformers import AutoModelForSequenceClassification
from torch.cuda.amp import autocast, GradScaler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, broadcast

Step 3: Initialize FSDP with FP16 Precision

# Initialize FSDP with FP16 precision
model = AutoModelForSequenceClassification.from_pretrained('your_model_name')
model = torch.nn.parallel.FSDP(model, device_ids=[0], process_group=None)

# Initialize ShardedGradScaler
scaler = GradScaler()

Step 4: Train Model with FSDP and FP16 Precision

# Move model to device
model.to('cuda')

# Initialize optimizer and scheduler
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

# Train model
for epoch in range(10):
    # Zero gradients
    optimizer.zero_grad()

    # Forward pass with autocast
    with autocast():
        outputs = model(input_ids, attention_mask)
        loss = outputs.loss

    # Backward pass with scaler
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    # Update scheduler
    scheduler.step()

Step 5: Verify Fix

  • Run the code with FSDP and FP16 precision
  • Check if the model is training correctly with FP16 precision
  • Verify that the ShardedGradScaler is working

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING