tl;dr
A graph self-supervised learning task is constructed in which the text embeddings of nodes are used to predict a node’s neighborhood (e.g., the adjacency row). Since this dimensionality is large, (i.e., like an “extreme multi-label text classification” (XMC) problem), they hierarchically cluster the graph neighborhoods and take a curriculum learning approach, in which they first train to predict coarse graph structure, and incrementally increase the resolution of neighborhood definition. Clustering is done by defining label features based on an aggregation of simple text features that share a label (in this case, the rows of the adjacency matrix), and then applying hierarchical clustering to this feature space.
Really large performance benefits are achieved if these features are directly fed into an MLP, compared to natural features (e.g., 63% to 73% on arxiv, 47% to 61% on papers100M). Improvements still exist for GNNs, but are much smaller in magnitude.