A Cora Depth Study of Geometry and Spectral Optimization in GNNs

Mar 9, 2026

gnn optimization geometry research-notes

This post extends my earlier note on Muon and graph neural networks, but it asks a different question.

If Muon only helps plain deep GCNs, then it may simply be compensating for an architecture that collapses with depth. The more interesting test is whether optimizer-level spectral control still matters once the backbone itself already addresses oversmoothing and oversquashing.

To test that, I ran a focused depth study on Cora across three backbones:

GCN, as the plain message-passing baseline.
GraphSAGE, as a second plain backbone with a different aggregation rule.
GBN, the GraphBoundary-conditioned message passing Neural network from Deeper with Riemannian Geometry, a geometry-aware architecture designed to stay stable at extreme depth.

And I crossed them with three optimizers:

AdamW
Muon
AdaMuon

The result is straightforward: Muon helps plain deep GCNs, does not rescue deep GraphSAGE, and still gives a large lift on top of GBN. The architecture and the optimizer appear to be addressing different parts of the depth problem.

Experimental Setup

This was intentionally a narrow follow-up study rather than a broad benchmark.

Dataset: Cora
Stack: PyTorch 2.5.1, PyG 2.7.0
Hardware: NVIDIA L40S
Depth sweep: 2, 4, 8, 16, 32, 64, 128, 256
Status: interrupted after the primary comparisons completed

Completed runs:

full GCN × {AdamW, Muon, AdaMuon} sweep
full GraphSAGE × {AdamW, Muon, AdaMuon} sweep
full GBN × AdamW sweep
GBN × Muon through 128 layers

Incomplete runs:

GBN + Muon at 256
all GBN + AdaMuon
fresh robustness / gradient diagnostics for this exact geometry study
external validation on a second dataset

This should therefore be read as an interrupted but informative Cora depth study, not as a finished benchmark paper.

Results

The 0.2 dashed line is there as a visual anchor. On Cora, once a model spends enough time near that band, it is usually no longer doing meaningful node classification.

The sweep reveals three distinct patterns, one for each backbone.

GCN: Muon Extends the Viable Depth Regime

The first result is the most familiar one. On plain GCN, Muon is strongest in the moderate-depth regime:

Backbone	Depth	AdamW	Muon	AdaMuon
GCN	4	0.7910	0.8123	0.8030
GCN	8	0.7280	0.7853	0.7450
GCN	16	0.5540	0.6077	0.5240

This lines up with the earlier 10-seed plain-GCN confirmation, where Muon reached 0.8023 ± 0.0099 at depth 8 versus AdamW at 0.7099 ± 0.0360, and the hybrid was the cleaner win at depth 16.

On a standard GCN, then, spectral optimization buys real depth headroom. It does not eliminate depth-induced failure, but it clearly delays it.

GraphSAGE: Optimizer Choice Does Not Rescue a Failing Backbone

GraphSAGE is the negative control that makes the interpretation cleaner.

At shallow-to-moderate depth, Muon is still competitive:

depth 4: AdamW 0.7607, Muon 0.8000
depth 8: AdamW 0.6647, Muon 0.7587

Beyond that, however, the backbone fails regardless of optimizer:

Backbone	Depth	AdamW	Muon	AdaMuon
GraphSAGE	16	0.3467	0.3150	0.3190
GraphSAGE	64	0.2040	0.1473	0.1473
GraphSAGE	128	0.2023	0.2023	0.2023

This is an important negative result. Optimizer-level spectral control is not a universal cure for deep GNN failure. When the backbone’s geometry is still poor enough, optimizer choice can delay collapse at best; it does not correct the underlying problem.

GBN Changes the Baseline Before the Optimizer Does

The most interesting part of the study begins with the GBN backbone.

GBN + AdamW already behaves very differently from plain GCN and GraphSAGE. It does not catastrophically collapse as depth increases:

Backbone	Depth	AdamW
GBN	32	0.5947
GBN	64	0.5753
GBN	128	0.5767
GBN	256	0.5827

That is exactly the baseline a geometry-aware deep message-passing model is supposed to create. The architecture removes catastrophic depth collapse under a standard optimizer.

GBN + Muon Still Wins

Once I crossed Muon with GBN, the result stayed strong through every completed depth:

Backbone	Depth	AdamW	Muon
GBN	2	0.5813	0.7003
GBN	4	0.6047	0.7720
GBN	8	0.6087	0.7837
GBN	16	0.5650	0.7867
GBN	32	0.5947	0.7793
GBN	64	0.5753	0.7863
GBN	128	0.5767	0.7780

This is the central claim of the post:

Muon is not just a band-aid for bad deep GNN architectures. In this Cora study, it still adds a large lift after a geometry-aware backbone has already stabilized depth.

That is a stronger result than “Muon beats AdamW on a deep GCN.” It suggests that the optimizer is not merely preventing collapse. It is improving the operating point of a backbone that was already designed to survive depth.

Interpretation

I do not want to overstate the mechanism. I did not rerun the full spectral and gradient diagnostic suite for the GBN sweep, so the interpretation here has to stay at the level of informed reading rather than direct mechanistic proof.

Still, the pattern is coherent:

Plain GCN: Muon helps most when depth is high enough for spectral drift to become destructive, but not so high that the architecture is irrecoverable.
Plain GraphSAGE: optimizer changes cannot overcome deep geometric failure on their own.
GBN: geometry-aware message passing removes catastrophic depth collapse under AdamW and establishes a stable deep baseline.
GBN + Muon: once that baseline exists, spectral optimization still seems to improve how the model uses depth.

The simplest reading is that the architecture and the optimizer are complementary, not redundant. Geometry determines whether deep training stays viable at all; the optimizer helps determine how good the resulting solution is.

Scope and Next Steps

This is still a narrow result. The next steps are clear:

one external validation dataset
completed GBN + AdaMuon
the same robustness and gradient diagnostics I used in the earlier Muon post, now rerun on the geometry-aware backbone

That is the point where this turns from a strong research note into a more complete empirical study.

Bottom Line

In this Cora depth study, geometry-aware message passing fixed the catastrophic depth problem, but Muon still mattered. The architecture established stability; the optimizer improved the resulting operating point. Those are different gains, and on this evidence they stack.