pytorch - ✅(Solved) Fix Muon documentation lacks minimal example [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177029Fetched 2026-04-08 00:22:38
View on GitHub
Comments
2
Participants
2
Timeline
18
Reactions
0
Author
Participants
Timeline (top)
labeled ×4referenced ×4mentioned ×3subscribed ×3

Fix Action

Fixed

PR fix notes

PR #177262: Add minimal usage example to Muon optimizer docstring (#177029)

Description (problem / solution / changelog)

Fixes #177029 Adds an Example: section to the torch.optim.Muon docstring showing how to split 2D parameters (for Muon) from biases/embeddings (for AdamW), matching the pattern from the external Muon repo's MuonWithAuxAdam but using native PyTorch optimizers.

Changed files

  • torch/optim/_muon.py (modified, +23/-0)

Code Example

# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)

# To replace the above, do the following:

from muon import MuonWithAuxAdam
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()]
param_groups = [
    dict(params=hidden_weights, use_muon=True,
         lr=0.02, weight_decay=0.01),
    dict(params=hidden_gains_biases+nonhidden_params, use_muon=False,
         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]
optimizer = MuonWithAuxAdam(param_groups)
RAW_BUFFERClick to expand / collapse

Hi,

I was trying to switch from https://github.com/KellerJordan/Muon to https://docs.pytorch.org/docs/stable/generated/torch.optim.Muon.html and could not help but notice that the torch doc is lacking a minimal example like:

# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)

# To replace the above, do the following:

from muon import MuonWithAuxAdam
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()]
param_groups = [
    dict(params=hidden_weights, use_muon=True,
         lr=0.02, weight_decay=0.01),
    dict(params=hidden_gains_biases+nonhidden_params, use_muon=False,
         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]
optimizer = MuonWithAuxAdam(param_groups)

Best, Chris

cc @svekars @sekyondaMeta @AlannaBurke

extent analysis

Problem Summary

Switching from Muon optimizer to PyTorch's MuonWithAuxAdam optimizer.

Root Cause Analysis

The issue is due to the lack of a minimal example in the PyTorch documentation for using MuonWithAuxAdam optimizer.

Fix Plan

To fix this issue, we need to create a custom optimizer using MuonWithAuxAdam. Here are the steps:

Step 1: Import necessary modules

from muon import MuonWithAuxAdam
import torch

Step 2: Separate model parameters into different groups

hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()]

Step 3: Create parameter groups for the optimizer

param_groups = [
    dict(params=hidden_weights, use_muon=True,
         lr=0.02, weight_decay=0.01),
    dict(params=hidden_gains_biases+nonhidden_params, use_muon=False,
         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]

Step 4: Create the optimizer

optimizer = MuonWithAuxAdam(param_groups)

Verification

To verify that the fix worked, you can check if the optimizer is created correctly and if the model is being updated correctly during training.

Extra Tips

Make sure to update the PyTorch documentation with a minimal example for using MuonWithAuxAdam optimizer.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING