Sparse Graph Storage for Message-Passing Networks: From COO to LSMGraph

Apr 22, 2025

gnn graph-storage sparse-matrices systems

Every graph neural network, somewhere in its forward pass, hits a sparse matrix operation. A GCN layer computes a sparse-dense product $\hat{A} H W$ . A GAT layer computes edge-wise attention scores (SDDMM) followed by a weighted aggregation (SpMM). GraphSAGE samples neighbors and gathers features before a learned transform. The specific primitive varies, but they all depend on the same thing: an efficient representation of the graph topology as a sparse adjacency structure.

How you store and access that structure determines the cost of neighbor aggregation, the memory footprint of your graph, and how well your training pipeline utilizes GPU parallelism. Yet most researchers never think about it. They call Data(edge_index=...) in PyTorch Geometric and trust the framework. That trust is well-placed (PyG makes good choices internally, as we will see), but understanding why a particular format is chosen lets you reason about performance bottlenecks, write better custom kernels, and make informed decisions when scaling to billion-edge graphs.

This post covers the sparse representations that matter for message-passing networks: COO, CSR, and CSC; the doubly-compressed variants DCSR and DCSC; GPU-native blocked layouts like Sliced ELLPACK; and a short tour of systems built for dynamic graphs (VCSR, DGAP, LSMGraph, RapidStore). We will pay special attention to why destination-grouped storage has become the dominant internal strategy across GNN frameworks.

The Sparse Matrix as Graph Topology

A directed graph $G = (V, E)$ with $n = |V|$ nodes and $m = |E|$ edges is represented by its adjacency matrix $A \in \{0, 1\}^{n \times n}$ , where $A_{ij} = 1$ if edge $(i, j) \in E$ , meaning node $i$ sends a message to node $j$ . For real-world graphs (social networks, molecules, citation graphs, financial transaction networks) this matrix is sparse. A citation graph with 100,000 nodes might have 500,000 edges, a density of $5 \times 10^{-5}$ . The full dense matrix would need 10 billion entries, almost all zeros.

One notational point worth being explicit about: under this convention, row $i$ of $A$ contains the outgoing edges from node $i$ , and column $j$ contains the incoming edges to node $j$ . The standard GNN aggregation gathers messages from sources into a destination, which means iterating over column $j$ . That is equivalent to computing $A^T X$ . For undirected graphs $A$ is symmetric, so this distinction does not matter. For directed graphs, it matters a lot, and it directly determines which sparse format is best.

Sparse formats store only the non-zero entries and their positions. The core tradeoff: how fast can you access a specific row (outgoing edges), a specific column (incoming edges), or iterate over all edges?

COO: The Universal Baseline

The Coordinate format (COO) stores two parallel arrays of length $m$ : row_indices and col_indices. Position $k$ encodes edge $(row\_indices[k], col\_indices[k])$ .

If you have used PyG, you know this format already. It is the edge_index tensor: a $2 \times m$ matrix where the first row holds source nodes and the second holds destinations.

Hover over a non-zero cell below to see how COO, CSR, and CSC each locate that same edge:

Adjacency Matrix (hover a cell)

row_indices

col_indices

Memory: 2 arrays of 10 = 20 values

COO is simple and flexible. Appending an edge is $O(1)$ . Each edge is independent, so scatter/gather over the edge list is embarrassingly parallel on GPUs. This is why PyG uses COO as its external API: edge_index is a COO tensor that users construct, filter, and transform freely.

The cost of unsorted COO is access locality. Finding all neighbors of a node requires a full $O(m)$ scan. But COO can be sorted by destination index, which gives contiguous access to incoming neighbors via binary search, approaching CSC-level performance for aggregation. Destination-sorted COO with segmented reduction is a common strategy in GNN CUDA kernels, trading insertion flexibility for aggregation throughput.

\text{COO storage} = 2m \text{ index entries (plus } m \text{ values for weighted graphs)}

CSR: Row-Oriented Compression

Compressed Sparse Row (CSR) groups edges by source node. It replaces the row_indices array with a row_ptr array of length $n+1$ . Edges from node $i$ sit contiguously in col_indices[row_ptr[i] .. row_ptr[i+1]].

This gives $O(1)$ access to any node’s outgoing neighbors. For row-oriented SpMV computing $y = A x$ , CSR is optimal: iterate rows, gather column indices, accumulate.

\text{CSR storage} = (n + 1) + m \text{ index entries (plus } m \text{ values for weighted graphs)}

CSR is the default in SciPy, MATLAB, and most sparse linear algebra libraries. It is also the format that PyG’s documentation discusses most. The older SparseTensor API stores data in CSR form, and the docs cover CSR-based operations extensively.

CSC: Destination-Grouped Storage for Message-Passing

Compressed Sparse Column (CSC) is the transpose-dual of CSR. It uses a col_ptr array of length $n+1$ and a row_indices array. Edges arriving at node $j$ sit contiguously in row_indices[col_ptr[j] .. col_ptr[j+1]].

Column-oriented access gives $O(1)$ lookup of incoming edges. That is exactly what message-passing needs.

Consider the aggregation step. For each target node $j$ , we gather messages from all sources $i$ where $(i, j) \in E$ :

h_j^{(\ell+1)} = \phi\!\left(\bigoplus_{i \in \mathcal{N}(j)} \psi(h_i^{(\ell)}, h_j^{(\ell)}, e_{ij})\right)

We need all $i \in \mathcal{N}(j)$ , the incoming neighborhood of $j$ . Under our convention ( $A_{ij} = 1$ means $i \to j$ ), the incoming neighbors form column $j$ of $A$ . The aggregation is $A^T X$ : the transpose applied to the feature matrix.

In CSC, the source indices for column $j$ sit contiguously in row_indices[col_ptr[j] .. col_ptr[j+1]]. One pointer dereference gives the start and end; the neighbor indices follow sequentially. In CSR, finding incoming neighbors of $j$ means scanning every row’s column list, an $O(n + m)$ operation.

Toggle between CSC and CSR below to see the access pattern difference when aggregating into a target node:

Format:Target:

CSC access for node 3:

col_ptr[3] = 4, col_ptr[4] = 6

row_idx[4..6] = [1, 2]

O(1) pointer lookup + O(degree) scan.
All sources for node 3 are contiguous.

How PyG Implements This

In PyTorch Geometric, the MessagePassing base class expects the transposed sparse adjacency when you pass a SparseTensor. This is documented in the sparse tensor notes. The transposed adjacency adj_t is stored in CSR, but since CSR of $A^T$ is the same thing as CSC of $A$ , the framework is doing destination-grouped access on the original graph.

From a PyG discussion: “the CSR representation of a transposed matrix is the CSC representation of the original matrix.” When PyG calls .csr() on adj_t, the semantic operation is CSC-style column access on the original adjacency.

Newer PyG versions clean this up with the EdgeIndex class, which caches both CSR and CSC conversions internally and selects the right sorted representation depending on whether the operation needs source-grouped or destination-grouped access. No more confusing adj_t naming.

DGL does the same thing: its internal CSR structures support efficient destination-grouped aggregation in update_all and apply_edges. Both frameworks converged on this independently because the message-passing computation requires it.

Efficiency of A @ X for message-passing (scatter to destination)

DCSR and DCSC: When Sparsity is Extreme

Standard CSR allocates a pointer entry for every row, even empty ones. For many real-world graphs this wastes space because they are hyper-sparse: the number of rows with at least one non-zero ( $nnzr$ , “number of non-zero rows”) is much smaller than $n$ .

Take a bipartite user-item graph with 10 million users where only 500,000 have recorded interactions. CSR allocates 10 million pointer entries; 9.5 million are redundant duplicates pointing to the same offset.

Doubly Compressed Sparse Row (DCSR)¹ adds a row_idx array that stores only the indices of non-empty rows. The pointer array shrinks from $n + 1$ to $nnzr + 1$ .

\text{DCSR storage} = nnzr + (nnzr + 1) + m \approx 2 \cdot nnzr + m

Toggle between CSR and DCSR below to see how empty-row overhead disappears:

8x8 matrix, 6 nnz, 4 empty rows

row_ptr [9]

col_idx [6]

Total: 15 values stored
4 redundant pointer pairs for empty rows

DCSC is the column-oriented dual: it stores only non-empty column indices and compresses the column pointer. For message-passing on graphs where most nodes receive no messages (web link graphs, citation networks), DCSC cuts pointer overhead substantially.

The tradeoff is access time. Random row lookup in DCSR needs a binary search over row_idx: $O(\log nnzr)$ instead of $O(1)$ . For sequential iteration this is negligible, but for random-access patterns during sampling it can matter.

DCSR and DCSC work best where standard compressed formats waste the most: knowledge graphs with millions of entity types but sparse relations, temporal snapshots where most nodes are inactive, and large bipartite graphs where one side has far more nodes than edges.

Sliced ELLPACK: GPU-Native Blocked Storage

ELLPACK (ELL) takes the opposite approach to compression. Instead of removing zeros, it pads every row to the maximum degree $d_{\max}$ and stores the result as a dense $n \times d_{\max}$ matrix. No pointers, no indirection, perfectly regular memory access. GPUs love it.

The problem is padding waste. Real graphs have power-law degree distributions: a few hubs with thousands of neighbors, most nodes with single digits. Padding every row to $d_{\max}$ wastes 90%+ of memory.

Sliced ELLPACK (SELL-C- $\sigma$ )² fixes this in two phases. First, rows are sorted by degree over a scope of $\sigma$ rows ( $\sigma = n$ for global sort, $\sigma = C$ for local). Then the sorted rows are cut into slices of $C$ consecutive rows, each padded to the slice-local maximum:

ELLPACK (max_deg = 4)

15/32 cells are padding (47% waste)

Sliced ELLPACK (slice = 4, sorted by degree)

slice 0 (max_deg = 4)

slice 1 (max_deg = 3)

11/28 cells are padding (39% waste vs 47%)

Sorting puts rows of similar degree together, so the slice-local max stays close to the mean. On power-law graphs, this cuts padding from 80-90% down to 15-30%.

$C$ is usually set to the GPU warp size (32 on NVIDIA) so each warp processes one slice with coalesced memory access. For GNN aggregation, each thread block processes a slice of destination nodes, all threads accessing memory at the same stride. Good L2 cache use, good bandwidth.

Beyond Static Training: Dynamic Graph Storage

Everything above (COO, CSR, CSC, DCSR/DCSC, SELL) targets static snapshots. These are the workhorses of GNN training on fixed datasets. But real graphs change: social networks add and drop connections, transaction graphs grow by millions of edges daily, interaction networks get revised. Modifying a CSR in place requires rebuilding the pointer array, which does not work for streaming workloads.

Several recent systems tackle this with mutable sparse formats. They are further from the typical GNN workflow, but they matter when you are training on evolving graphs or building real-time inference pipelines.

VCSR: Mutable CSR with Packed Memory Arrays

VCSR³ stores each vertex’s adjacency list in a Packed Memory Array (PMA), a gapped array that keeps sorted order while leaving empty slots for future insertions. In standard CSR, inserting one edge means shifting all subsequent entries. VCSR absorbs insertions locally with amortized $O(\log^2 n)$ cost per update while keeping CSR-like sequential scan performance for analytics.

DGAP: Dynamic Graphs on Persistent Memory

DGAP⁴ builds a single mutable CSR for persistent memory hardware (Intel Optane DCPMM). Three designs keep updates crash-consistent: a per-section edge log to reduce write amplification, a per-thread undo log for rebalancing, and a placement schema that minimizes in-place updates. The authors report up to 3.2x better update throughput and 3.77x better analytics performance over prior persistent-memory graph systems.

LSMGraph: Multi-Level CSR with LSM-Tree Compaction

LSMGraph⁵ combines LSM-tree write optimization with CSR-based reads. An in-memory MemGraph caches updates; a multi-level on-disk hierarchy stores immutable CSR structures that get flushed and compacted through levels (like RocksDB, but with vertex-grained version control). It targets workloads where the graph exceeds main memory and must support ingestion and analytics simultaneously.

RapidStore: Concurrent In-Memory Graph Storage

RapidStore⁶ separates read and write paths so analytical queries can run while the graph is being updated. Writes are coordinated with multiversion two-phase locking; reads operate on snapshots with no locks. Versions are tracked at the subgraph level, not per-edge, which keeps overhead low. The result is up to 3.46x speedup on analytics with 56% less memory, and read latency increases at most 13% under heavy concurrent writes.

Choosing the Right Format

Use Case	Recommended Format	Why
GNN message-passing (general)	CSC / destination-sorted COO	$O(1)$ access to incoming neighbors
User-facing data representation	COO (`edge_index`)	Flexible, $O(1)$ edge insertion
Custom GPU kernels	Sliced ELLPACK (SELL-C-32)	Regular access, warp coalescing
Hyper-sparse / bipartite graphs	DCSR / DCSC	No empty row/column overhead
Dynamic graphs (in-memory)	VCSR / RapidStore	Mutable CSR with concurrent access
Dynamic graphs (persistent)	DGAP / LSMGraph	Crash-consistent or disk-backed

For message-passing aggregation, destination-grouped storage is usually the right default. Whether that means true CSC, CSR of the transposed adjacency, or destination-sorted COO with segmented reduction, the requirement is the same: incoming neighbors of each destination node should be contiguous in memory. PyG, DGL, and related systems have broadly converged on this, even if they expose it through different APIs.

The sparse format does not change your model’s expressiveness. The same aggregation produces the same output regardless of layout. But it determines whether training takes hours or days, whether the graph fits in memory, and whether your kernel hits 10% or 80% of peak GPU throughput. Once you see these formats as a design space with clear tradeoffs rather than a black box, the systems-level decisions get a lot easier.

Buluç, A. and Gilbert, J.R., “On the Representation and Multiplication of Hypersparse Matrices”, IEEE IPDPS, 2008. Introduced DCSR for matrices where the number of non-empty rows is much smaller than the matrix dimension. ↩
Kreutzer, M. et al., “A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiply on Modern Processors with Wide SIMD Units”, SIAM Journal on Scientific Computing, 2014. The SELL-C- $\sigma$ format and its performance analysis across CPU and GPU architectures. ↩
Islam, A.A.R. et al., “VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array”, IEEE/ACM CCGrid, 2022. Achieves up to 3.81x better edge insertion performance over PMA-based CSR extensions while maintaining competitive analytics throughput. ↩
Islam, A.A.R. et al., “DGAP: Efficient Dynamic Graph Analysis on Persistent Memory”, SC, 2024. A single mutable CSR design with per-section edge logs, per-thread undo logs, and persistent-memory-aware data placement. ↩
Sun, Y. et al., “LSMGraph: A High-Performance Dynamic Graph Storage System with Multi-Level CSR”, SIGMOD, 2025. Combines LSM-tree compaction with CSR-based read paths and vertex-grained version control. ↩
Sun, S. et al., “RapidStore: An Efficient Dynamic Graph Storage System for Concurrent Queries”, PVLDB, 2025. Decoupled read/write paths with subgraph-level versioning for concurrent graph analytics. ↩