A Cora Depth Study of Geometry and Spectral Optimization in GNNs
This post extends my earlier note on Muon and graph neural networks, but it asks a different question.
If Muon only helps plain deep GCNs, then it may simply be compensating for an architecture that collapses with depth. The more interesting test is whether optimizer-level spectral control still matters once the backbone itself already addresses oversmoothing and oversquashing.
To test that, I ran a focused depth study on Cora across three backbones:
- GCN, as the plain message-passing baseline.
- GraphSAGE, as a second plain backbone with a different aggregation rule.
- GBN, the GraphBoundary-conditioned message passing Neural network from Deeper with Riemannian Geometry, a geometry-aware architecture designed to stay stable at extreme depth.
And I crossed them with three optimizers:
- AdamW
- Muon
- AdaMuon
The result is straightforward: Muon helps plain deep GCNs, does not rescue deep GraphSAGE, and still gives a large lift on top of GBN. The architecture and the optimizer appear to be addressing different parts of the depth problem.
Experimental Setup
This was intentionally a narrow follow-up study rather than a broad benchmark.
- Dataset: Cora
- Stack: PyTorch 2.5.1, PyG 2.7.0
- Hardware: NVIDIA L40S
- Depth sweep: 2, 4, 8, 16, 32, 64, 128, 256
- Status: interrupted after the primary comparisons completed
Completed runs:
- full
GCN × {AdamW, Muon, AdaMuon}sweep - full
GraphSAGE × {AdamW, Muon, AdaMuon}sweep - full
GBN × AdamWsweep GBN × Muonthrough 128 layers
Incomplete runs:
GBN + Muonat 256- all
GBN + AdaMuon - fresh robustness / gradient diagnostics for this exact geometry study
- external validation on a second dataset
This should therefore be read as an interrupted but informative Cora depth study, not as a finished benchmark paper.
Results
The 0.2 dashed line is there as a visual anchor. On Cora, once a model spends enough time near that band, it is usually no longer doing meaningful node classification.
The sweep reveals three distinct patterns, one for each backbone.
GCN: Muon Extends the Viable Depth Regime
The first result is the most familiar one. On plain GCN, Muon is strongest in the moderate-depth regime:
| Backbone | Depth | AdamW | Muon | AdaMuon |
|---|---|---|---|---|
| GCN | 4 | 0.7910 | 0.8123 | 0.8030 |
| GCN | 8 | 0.7280 | 0.7853 | 0.7450 |
| GCN | 16 | 0.5540 | 0.6077 | 0.5240 |
This lines up with the earlier 10-seed plain-GCN confirmation, where Muon reached 0.8023 ± 0.0099 at depth 8 versus AdamW at 0.7099 ± 0.0360, and the hybrid was the cleaner win at depth 16.
On a standard GCN, then, spectral optimization buys real depth headroom. It does not eliminate depth-induced failure, but it clearly delays it.
GraphSAGE: Optimizer Choice Does Not Rescue a Failing Backbone
GraphSAGE is the negative control that makes the interpretation cleaner.
At shallow-to-moderate depth, Muon is still competitive:
- depth
4: AdamW0.7607, Muon0.8000 - depth
8: AdamW0.6647, Muon0.7587
Beyond that, however, the backbone fails regardless of optimizer:
| Backbone | Depth | AdamW | Muon | AdaMuon |
|---|---|---|---|---|
| GraphSAGE | 16 | 0.3467 | 0.3150 | 0.3190 |
| GraphSAGE | 64 | 0.2040 | 0.1473 | 0.1473 |
| GraphSAGE | 128 | 0.2023 | 0.2023 | 0.2023 |
This is an important negative result. Optimizer-level spectral control is not a universal cure for deep GNN failure. When the backbone’s geometry is still poor enough, optimizer choice can delay collapse at best; it does not correct the underlying problem.
GBN Changes the Baseline Before the Optimizer Does
The most interesting part of the study begins with the GBN backbone.
GBN + AdamW already behaves very differently from plain GCN and GraphSAGE. It does not catastrophically collapse as depth increases:
| Backbone | Depth | AdamW |
|---|---|---|
| GBN | 32 | 0.5947 |
| GBN | 64 | 0.5753 |
| GBN | 128 | 0.5767 |
| GBN | 256 | 0.5827 |
That is exactly the baseline a geometry-aware deep message-passing model is supposed to create. The architecture removes catastrophic depth collapse under a standard optimizer.
GBN + Muon Still Wins
Once I crossed Muon with GBN, the result stayed strong through every completed depth:
| Backbone | Depth | AdamW | Muon |
|---|---|---|---|
| GBN | 2 | 0.5813 | 0.7003 |
| GBN | 4 | 0.6047 | 0.7720 |
| GBN | 8 | 0.6087 | 0.7837 |
| GBN | 16 | 0.5650 | 0.7867 |
| GBN | 32 | 0.5947 | 0.7793 |
| GBN | 64 | 0.5753 | 0.7863 |
| GBN | 128 | 0.5767 | 0.7780 |
This is the central claim of the post:
Muon is not just a band-aid for bad deep GNN architectures. In this Cora study, it still adds a large lift after a geometry-aware backbone has already stabilized depth.
That is a stronger result than “Muon beats AdamW on a deep GCN.” It suggests that the optimizer is not merely preventing collapse. It is improving the operating point of a backbone that was already designed to survive depth.
Interpretation
I do not want to overstate the mechanism. I did not rerun the full spectral and gradient diagnostic suite for the GBN sweep, so the interpretation here has to stay at the level of informed reading rather than direct mechanistic proof.
Still, the pattern is coherent:
- Plain GCN: Muon helps most when depth is high enough for spectral drift to become destructive, but not so high that the architecture is irrecoverable.
- Plain GraphSAGE: optimizer changes cannot overcome deep geometric failure on their own.
- GBN: geometry-aware message passing removes catastrophic depth collapse under AdamW and establishes a stable deep baseline.
- GBN + Muon: once that baseline exists, spectral optimization still seems to improve how the model uses depth.
The simplest reading is that the architecture and the optimizer are complementary, not redundant. Geometry determines whether deep training stays viable at all; the optimizer helps determine how good the resulting solution is.
Scope and Next Steps
This is still a narrow result. The next steps are clear:
- one external validation dataset
- completed
GBN + AdaMuon - the same robustness and gradient diagnostics I used in the earlier Muon post, now rerun on the geometry-aware backbone
That is the point where this turns from a strong research note into a more complete empirical study.
Bottom Line
In this Cora depth study, geometry-aware message passing fixed the catastrophic depth problem, but Muon still mattered. The architecture established stability; the optimizer improved the resulting operating point. Those are different gains, and on this evidence they stack.