Muon on Graph Neural Networks: Spectral Norm Control Where It Matters

Mar 8, 2026

optimization gnn deep-learning

The optimizer you pick shapes more than the loss curve. It dictates the geometry of your weight matrices: their singular value spectrum, their Lipschitz constant, the conditions under which information actually flows through layers. For most of deep learning’s history, this geometric shaping has been incidental. A side effect of Adam’s element-wise adaptivity, or a byproduct of implicit regularization in SGD. We rarely think about global matrix properties of our weights because, for many architectures, we haven’t needed to.

Graph Neural Networks operate under stricter rules. In a GNN, the spectral norm of each layer directly controls a host of notorious pathologies: oversmoothing, over-squashing, rank collapse, and the Lipschitz stability of message-passing itself.

Muon¹, born from the NanoGPT speedrunning community in late 2024, later used by Moonshot AI for large-scale LLM training in Moonlight², now in PyTorch core³, does something different. It treats weight matrices as 2D matrices, geometric objects with spectral structure, and optimizes them accordingly. While it has proven itself in the transformer world, Muon’s core property has a natural home in graph representation learning.

This post investigates that connection empirically. I’ll unpack the mathematical machinery, map its optimization guarantees to the structural bottlenecks of GNNs, and walk through results from an extensive benchmark suite. The findings are nuanced: Muon is not a drop-in replacement that wins everywhere. But where it works (moderate-depth networks, robustness under structural perturbation), it works for principled reasons with solid statistical evidence.

The Bottlenecks of Deep GNNs

A standard GCN layer updates node representations via:

H^{(\ell+1)} = \sigma\!\left(\hat{A} \, H^{(\ell)} \, W^{(\ell)}\right)

$\hat{A}$ is the normalized adjacency, $H^{(\ell)}$ the node features at layer $\ell$ , $W^{(\ell)}$ the learnable weight matrix, and $\sigma$ a nonlinearity.

Stack enough layers and several failure modes kick in:

Oversmoothing. $\hat{A}$ acts as a low-pass filter over the graph. Repeated multiplication mixes neighbor features until all node representations converge toward the dominant eigenvector. The network can no longer tell structurally distinct neighborhoods apart.

Feature collapse. Distinct from oversmoothing. If the weight matrices are contractive ( $\sigma_{\max} < 1$ ), representations don’t just merge; they decay exponentially toward zero across layers.

Rank collapse. Even if feature scale is maintained, channel diversity can degrade. An ill-conditioned weight matrix (large ratio between largest and smallest singular values) projects features into a low-rank subspace. The network loses expressive capacity. To survive the spatial smoothing from $\hat{A}$ , it needs a rich, full-rank feature space through $W^{(\ell)}$ .

The standard response has been architectural: residual connections, DropEdge, PairNorm, orthogonal linear layers. Muon offers a different angle. Control the spectral properties through the optimizer, without touching the forward architecture.

What Muon Actually Does

Muon stands for MomentUm Orthogonalized by Newton-schulz. It runs a three-phase pipeline each training step.

1. Momentum accumulation. Standard Polyak momentum with Nesterov interpolation:

M_t = \beta M_{t-1} + (1 - \beta) \nabla_W \mathcal{L}

G_t = \text{lerp}(\nabla_W \mathcal{L}, M_t, \beta)

2. Orthogonalization via Newton-Schulz. The key step. Instead of stepping in the direction of $G_t$ , Muon replaces it with an approximately orthogonal matrix: the polar factor $U V^\top$ from the SVD $G_t = U \Sigma V^\top$ .

Computing a full SVD every step costs $O(\min(m, n)^3)$ and is miserable on GPUs (too much branching, not enough dense matmul). Instead, Muon uses Newton-Schulz iterations:

X_0 = \frac{G_t}{\|G_t\|_F \cdot 1.02}

X_{k+1} = a_k X_k + X_k (b_k X_k^\top X_k + c_k (X_k^\top X_k)^2)

After ~5 iterations, $X_5$ approximates $U V^\top$ well. Because it’s an orthogonal projection, all singular values converge toward 1. The coefficients $(a_k, b_k, c_k)$ come from the Polar Express method⁴, tuned for bfloat16 stability. The whole thing is just matrix multiplications, $O(n^2k)$ per step. No cubic SVD.

3. Update with spectral constraint. The orthogonalized update plus directional weight decay:

W_{t+1} = W_t - \eta \left( X_5 + \lambda \cdot W_t \odot \mathbb{1}[X_5 \odot W_t \geq 0] \right)

Bernstein⁵ showed this is equivalent to steepest descent under the spectral norm. You’re minimizing the loss subject to a bound on the operator norm of the perturbation, not the Euclidean distance as in standard SGD.

Combined with directional weight decay, Muon introduces an implicit bias⁶:

\min_W \mathcal{L}(W) \quad \text{s.t.} \quad \sigma_{\max}(W) \lesssim \frac{1}{\lambda}

This is asymptotic, not a hard constraint at every step, but it acts as a persistent regularizing force. The largest singular value of each weight matrix stays bounded, while the optimization amplifies underrepresented gradient directions by equalizing the singular values of the update.

Seeing the Convergence

Drag the slider to watch Newton-Schulz iterations drive a random matrix toward orthogonality. Watch the singular values collapse toward 1:

Step 0/5:Normalized input

X matrix

Singular values

Drag the slider to see singular values converge toward 1.

Why This Matters for GNNs

The Lipschitz constant of a GNN layer is bounded by the spectral norms of its parts⁷:

\text{Lip}(f^{(\ell)}) \leq \|\hat{A}\|_2 \cdot \sigma_{\max}(W^{(\ell)})

For the normalized propagation operators used in standard GCN variants, $\|\hat{A}\|_2$ is typically bounded by or close to 1. That means a large part of the layer’s Lipschitz behavior is carried by $\sigma_{\max}(W^{(\ell)})$ .

Stack $L$ layers and the product of spectral norms across layers governs behavior. If $\sigma_{\max}(W^{(\ell)}) > 1$ consistently, features explode. If $\sigma_{\max}(W^{(\ell)}) < 1$ , they collapse to zero.

Muon’s implicit bias $\sigma_{\max}(W) \lesssim 1/\lambda$ gives a tunable dial on the Lipschitz constant of message-passing.

There’s also a backward story. Recent work on backward oversmoothing⁸ shows that gradients flowing through deep GNNs suffer rank collapse too. The backpropagated signal becomes increasingly low-rank the deeper it goes, making early layers hard to train even when forward oversmoothing is managed. Replacing low-rank gradient signals with approximately orthogonal updates injects full-rank update steps at every layer, fighting backward rank collapse by construction.

Experimental Setup

All experiments ran on an NVIDIA L40S GPU.

A note on the stack: I used a custom Muon implementation on PyTorch 2.5.1 and PyTorch Geometric 2.7.0. (An optimized version later entered PyTorch core in 2.9.) The Newton-Schulz iterations do add overhead; Muon training was roughly 1.5-2x slower per epoch compared to AdamW on these small architectures.

Three optimization regimes:

AdamW: Aggressively tuned per-experiment as a strong baseline.
Muon: Applied to all 2D weight matrices. AdamW retained for 1D parameters, biases, and normalization layers (Muon is undefined for 1D tensors).
MuonAdamW: Hybrid. Only core message-passing weights get Muon; everything else (including the classification head) stays on AdamW.

Every result is aggregated over 10 independent seeds. Error bars show $\pm 1$ standard deviation. I also computed Welch’s t-tests and bootstrap confidence intervals for the pairwise comparisons highlighted below.

Hyperparameters were tuned independently per optimizer and per depth. Dropping Muon into an AdamW config will fail. The optimizer needs different scaling.

Shallow Benchmarks: A Modest but Real Gain

I started with standard 2-layer GCN and GAT architectures on Cora, CiteSeer, and PubMed. These datasets are tiny by modern standards, but in GNN literature they’re well-studied microcosms of message-passing dynamics, good for isolating optimizer effects.

Results on shallow networks are dataset-dependent. On Cora, all strategies perform within noise. On CiteSeer and PubMed, the Muon variants show a measurable edge.

The strongest shallow result is CiteSeer GCN: Muon hits 72.95% vs AdamW’s 69.22% (10-seed aggregate, $p < 0.001$ ). A +3.7pp gain from changing only the optimizer, no extra parameters, no architecture changes, is worth paying attention to. On PubMed, MuonAdamW leads at 77.76% versus AdamW’s 76.62%.

Why does CiteSeer respond more than Cora? It comes down to matrix geometry. CiteSeer has 3703 input features projected into 64 hidden dimensions, producing very tall, rectangular first-layer weight matrices. Muon’s internal spectral norm scaling ( $\text{lr} \times \sqrt{\max(1, m/n)}$ ) has a stronger regularizing effect on these aspect ratios than on squarer matrices.

Where Muon Shines: Scaling Depth

The 2-layer regime was never going to be Muon’s strongest case. The real evidence shows up when you push depth, hitting a sweet spot around 8 layers.

With per-depth hyperparameter tuning, Muon clearly outperforms AdamW on an 8-layer Cora GCN:

At 8 layers: Muon reaches 80.23%, AdamW degrades to 70.99%. That’s a +9.2pp gap ( $p < 0.001$ over 10 seeds). The standard deviations tell their own story: AdamW’s 3.6% reflects unstable training across initializations, while Muon holds at 1.0%.

At 16 layers: Pure Muon struggles, but MuonAdamW pulls ahead at 58.61% vs AdamW’s 53.47% ( $p = 0.04$ ). Pure Muon is directionally better but statistically inconclusive due to rising variance.

At 32 layers: Everything collapses. Spatial oversmoothing at this depth exceeds what optimizer-level spectral control can fix. Architectural interventions are necessary.

The depth-8 success aligns with theory. At 8 layers, $\prod_{\ell=1}^{8} \sigma_{\max}(W^{(\ell)})$ hits the threshold where spectral drift becomes destructive. Deep enough to amplify instabilities exponentially, but shallow enough that optimizer-driven regularization can keep things in check.

The Spectral Story: Uniformity Over Absolute Scale

To understand why Muon enables deeper scaling, look at the singular values during training. Here’s the evolution of the second GCN layer (64x64 weight matrix, Cora):

Epoch:

Toggle between the spectrum view (all singular values at a given epoch) and the line chart tracking $\sigma_{\max}$ over time.

There’s a subtlety here that’s easy to miss. On shallow 2-layer networks, Muon’s $\sigma_{\max}$ actually grows larger than AdamW’s — roughly 13.8 vs 7.5. So Muon doesn’t just blindly crush spectral norms.

The real picture emerges at depth 8, and it’s about uniformity, not absolute scale:

AdamW lets the spectral norm drift unevenly. By epoch 240, its mid-layer $\sigma_{\max}$ hits 3.80, with singular values ranging from 2.76 to 3.96. The condition number spikes. The weight matrix acts as a low-rank projection, killing channel diversity.
Muon keeps the mid-layer $\sigma_{\max}$ at 2.38, with singular values tightly clustered between 2.26 and 2.49. Condition number stays close to 1.
MuonAdamW lands in between: spectral control on message-passing weights, flexibility on the classifier head.

This uniformity is the mechanism. Stack 8 layers with AdamW’s $\sigma_{\max}$ of 3.80 and you get $3.80^8 \approx 43{,}000$ in potential directional amplification. Muon’s 2.38 gives $2.38^8 \approx 1{,}000$ , a much more bounded transformation. Equal singular values mean no single feature dimension dominates, so channel diversity survives repeated graph smoothing.

Gradient Dynamics

The differences aren’t limited to the forward pass. They change backward dynamics too.

I recorded gradient norms and input saliency across epochs to see how loss signals propagate backward. Saliency here measures how strongly the final loss depends on specific input features, a proxy for gradient integrity.

AdamWMuon

In AdamW-trained networks, backward gradient norms tend to decay exponentially in deeper models, or fragment into noisy updates. Poorly conditioned weight matrices mean the backpropagated signal loses directional richness. Backward rank collapse in action.

Muon’s approximately orthogonal updates keep weight matrices well-conditioned, so they serve as better conduits for the backward pass. Gradient signals retain structure and magnitude across layers. The saliency maps from Muon-trained networks are more concentrated. The network gets clearer feedback about which features contributed to errors, enabling more coherent learning in early layers. This explains why Muon converges slower per epoch but reaches a more stable optimum.

Shaded bands show ±1 standard deviation across 3 seeds.

AdamW converges faster early on but plateaus abruptly. Muon takes about 190 epochs to peak vs AdamW’s 90, but arrives at a lower-variance destination.

Robustness: The Clearest Advantage

The most practically relevant finding is the robustness gap under perturbation. I subjected trained models to feature noise (Gaussian, varying $\sigma$ ) and edge dropout (randomly deleting edges at test time).

On the 8-layer Cora GCN, the gap under duress is stark. Feature noise at $\sigma = 0.05$ :

AdamW: Collapses from 72.6% to 24.9%.
Muon: Degrades from 80.4% to 66.8%.

Edge dropout at 40%:

AdamW: 72.6% → 66.9%.
Muon: 80.4% → 76.6%.

This extends to shallow models too. On 2-layer CiteSeer, MuonAdamW loses 2.0pp under 40% edge dropout vs AdamW’s 2.2pp, starting from a baseline 3.5pp higher.

The mechanism is suggestive, but it is not as simple as a single-layer toy bound. For a deep nonlinear GNN, sensitivity depends on the product of propagation, weight, and activation Lipschitz constants across layers. Controlling $\sigma_{\max}(W^{(\ell)})$ does not by itself guarantee robustness, but it tightens one of the main factors in that product:

\|f(x + \delta) - f(x)\| \leq \mathrm{Lip}(f)\,\|\delta\|

In a linear network, that connection is direct. In our nonlinear setting, the relationship is looser, but the empirical pattern is consistent: the runs with tighter, more uniform spectra are also the ones that degrade more gracefully under feature noise and edge removal.

Practical Guidance

When to use Muon/MuonAdamW:

Moderate-depth GCNs (4-16 layers) where oversmoothing and rank collapse are active problems.
Noise-heavy environments requiring robustness to feature noise, sensor drift, or incomplete graph topology.
Tall weight matrices (high input dimensionality relative to hidden dim), where the optimizer’s scale adjustments have the most effect.

When to stick with AdamW:

Shallow, clean setups. On 2-layer networks on Cora, the gains are marginal and not worth the compute.
Extreme depth (32+). Oversmoothing overwhelms optimizer-level control. You need architectural fixes regardless.
Tight compute budgets. Newton-Schulz adds wall-clock time per epoch.

Routing matters for attention architectures. On GATs, excluding the final layer from Muon consistently helps: MuonAdamW (exclude-last) hit 72.98% vs all-matrix Muon at 71.50% on CiteSeer GAT ( $p = 0.049$ ). The attention output projection benefits from AdamW’s element-wise adaptivity.

Hyperparameters. Don’t copy AdamW settings. They live on different scales. Weight decay in Muon controls the asymptotic spectral limit directly ( $\sigma_{\max} \lesssim 1/\lambda$ ), so it’s not just generic regularization, it’s a Lipschitz dial. Effective ranges from my experiments:

Muon LR: 0.02-0.05 (much higher than AdamW’s typical 0.005-0.01)
Muon weight decay: 0.005-0.01 (vs AdamW’s 0.0005)
Momentum: 0.95 (default works well)
Newton-Schulz iterations: 5 (no benefit from more on small GNN matrices)

Bottom Line

Muon is not a magic bullet for GNNs. On shallow architectures and clean benchmarks, the gains are negligible to moderate. The standard GNN regime (2 layers, small clean graphs) was never going to be where a spectral optimizer shows its best.

But the deep scaling result is real. Muon lets you train meaningfully deeper GCNs (+9.2pp at depth 8) through spectral control of the weights, not architectural hacks. The robustness result is equally clear: bounded, uniform singular values yield bounded sensitivity, and it shows up under severe perturbation.

The most interesting direction from here is applying Muon where oversmoothing and rank collapse are the binding constraints: deep equivariant GNNs for molecular simulation, long-range message passing on large graphs, deep graph-level encoders. Early external evidence (Nequix, an E(3)-equivariant potential trained with Muon, recently hit top-3 on Matbench-Discovery at 1/4 the training cost) is promising.

The optimizer shapes the geometry of your weights. When you’re working with graphs, that geometry is the message.

All experiments ran on an NVIDIA L40S GPU with PyTorch 2.5.1+cu121 and PyG 2.7.0. All evaluations use 10-seed aggregation with Welch’s t-tests and bootstrap confidence intervals. The interactive plots and aggregate results used in this post are in this repo.

Keller Jordan, “Muon: An optimizer for hidden layers in neural networks”, October 2024. Set a NanoGPT training speed record on 10/15/24, 35% faster. ↩
Moonshot AI, “Muon is Scalable for LLM Training”, February 2025. Scaled Muon to train Moonlight, a 16B MoE model with ~2x efficiency over AdamW. ↩
torch.optim.Muon, available since PyTorch 2.9. ↩
Amsel et al., “Polar Express Sign Method”, May 2025. Optimized Newton-Schulz coefficients for bfloat16 stability. ↩
Jeremy Bernstein, “Deriving Muon”. Shows Muon is steepest descent under the spectral norm: given gradient $\nabla_W \mathcal{L} = U \Sigma V^\top$ , the constrained update is $\Delta W = -\eta \sqrt{n_\text{out} / n_\text{in}} \cdot U V^\top$ . ↩
“Muon is a Nuclear Lion King”, UT Austin. Proves Muon with nuclear norm K and weight decay constrains $\sigma_{\max}(W) \leq 1/\lambda$ . Also “Muon Optimizes Under Spectral Norm Constraints” for convergence proofs. ↩
Follows from submultiplicativity of the spectral norm. The Lipschitz constant of a linear map is its operator norm. ↩
“Understanding Oversmoothing in GNNs from the Backward Perspective”, May 2025. Also “Training Graph Neural Networks Subject to a Tight Lipschitz Constraint”. ↩