transformers - 💡(How to fix) Fix Feature Request: Add SCAO Optimizer integration for 1.5x faster fine-tuning throughput

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

I'm always frustrated when fine-tuning LLMs in compute-constrained environments because AdamW's diagonal approximation of loss curvature makes early convergence slow. Wall-clock time is a primary bottleneck for rapid fine-tuning pipelines (like LoRA/PEFT).

RAW_BUFFERClick to expand / collapse

Feature request

Currently, AdamW is the default standard for fine-tuning via the Trainer class. While robust, its diagonal approximation of loss curvature makes early convergence slow, which is particularly expensive in compute-constrained environments or rapid fine-tuning pipelines (like LoRA/PEFT) where wall-clock time is the primary bottleneck.

I have developed and open-sourced SCAO (Sparse Curvature-Aware Adaptive Optimizer), a second-order PyTorch optimizer designed as a drop-in replacement for AdamW. It delivers Shampoo-quality preconditioned gradients but uses adaptive rank selection (keeping only top-k eigenvectors) to make it highly efficient.

I would like to propose integrating SCAO as a supported optimizer within TrainingArguments (e.g., optim="scao").

Experimental Results & Benchmarks In our benchmarks against AdamW, SCAO demonstrates a significant advantage in both raw throughput and scaling capabilities:

1M Parameter Model (500 steps):

Throughput: 827 tok/s vs AdamW's 537 tok/s (+54% faster).

Time-to-Target: Reaches a PPL of 14.10 in just 320 seconds (AdamW takes 582s).

Quality: Final PPL gap is negligible (+0.18 PPL compared to AdamW), but achieved in ~60% of the wall-clock time.

5M Parameter Model (Scaling Dominance):

As model complexity increases, SCAO's second-order approximation shines. At 5M parameters, SCAO beat AdamW by 2.55 PPL (23.94 vs 26.49), an improvement of ~9.6%.

Note: SCAO introduces a ~2.2x memory overhead compared to AdamW due to the sparse preconditioner matrices, which dilutes at larger batch sizes. It implements a fallback diagonal approximation for layers exceeding max_precond_dim to prevent OOM errors.

Proposed Implementation Design Since SCAO inherits from torch.optim.Optimizer, the integration is straightforward:

Add "scao" to OptimizerNames in training_args.py.

Add the initialization logic in Trainer.create_optimizer().

Allow passing SCAO-specific kwargs (like precond_freq, warmup_steps) via optim_args.

Code repository and Paper

GitHub Repo: https://github.com/whispering3/scao

Next Steps I already have a working integration of SCAO with the Trainer class locally. If the core team is open to this addition, I would be more than happy to submit a Pull Request with the implementation, comprehensive tests, and documentation.

Would love to hear the maintainers' thoughts on this!

Motivation

I'm always frustrated when fine-tuning LLMs in compute-constrained environments because AdamW's diagonal approximation of loss curvature makes early convergence slow. Wall-clock time is a primary bottleneck for rapid fine-tuning pipelines (like LoRA/PEFT).

Your contribution

I already have a working integration of SCAO with the Trainer class locally, and I am ready to submit a PR if the core team is open to this addition.

Since SCAO inherits from torch.optim.Optimizer, the proposed implementation is very straightforward:

Add "scao" to OptimizerNames in training_args.py. Add the initialization logic in Trainer.create_optimizer(). Allow passing SCAO-specific kwargs via optim_args. Let me know if I should go ahead and open the Pull Request!

extent analysis

TL;DR

Integrate SCAO as a supported optimizer within TrainingArguments by adding "scao" to OptimizerNames and implementing initialization logic in Trainer.create_optimizer().

Guidance

  • Review the proposed implementation design to ensure it aligns with the existing codebase and standards.
  • Verify that SCAO's integration does not introduce any compatibility issues with other components or features.
  • Consider adding comprehensive tests and documentation to the Pull Request to facilitate review and maintenance.
  • Evaluate the trade-off between SCAO's improved performance and its increased memory overhead, particularly for larger models or batch sizes.

Example

No code snippet is provided as the issue does not contain specific implementation details that can be confidently reproduced.

Notes

The integration of SCAO may require additional consideration for its memory overhead and potential impact on performance in certain scenarios.

Recommendation

Apply workaround: Integrate SCAO as a supported optimizer, as it offers significant performance advantages over AdamW, particularly in compute-constrained environments.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Feature Request: Add SCAO Optimizer integration for 1.5x faster fine-tuning throughput