# Graph-Bert: Only Attention is Needed for Learning Graph Representations

Authors:

• Jiawei Zhang
• Haopeng Zhang
• Congying Xia
• Li Sun

## tl;dr

This paper uses a transformer-based approach to solve node classification tasks. To limit the receptive field of the attention mechanism to only “relevant” nodes, subgraphs around a target node are sampled by first computing an “intimacy score” between all pairs of nodes, and filtering to top-k. This therefore does not restrict to directly connected nodes or threshold the number of hops from the target node. The intimacy score is based on PageRank:

S = \alpha [ \mathbf I − (1 − \alpha) \bar A ] ^{-1}

where \alpha is a constant (set to 0.15), \mathbf I the identity matrix, and \bar A the column-normalized adjacency matrix.

After determining the relevant context nodes, positional encodings of 3 different types are used. In each, an integer is constructed by various methods, and the integer is then used in the standard transformer’s positional encoding embedding function: PE(i) = [\sin(\frac{i}{10000^{2l/d}}), \cos(\frac{i}{10000^{2l/d}}) ]_{l=0}^{d/2}

• Run the W-L algorithm to assign matching “colors” to each structurally equivalent node, and then map each color to an integer index
• This doesn’t make sense to me since W-L are hashes, meaning that adjacent integers are not more similar than distant ones
• Impose a rank order based on descending “intimacy” score, and use the rank as the integer index
• Use the shortest path length as the integer index

The raw input features and various PEs are summed to create the final input representation of each node. Between layers, residual connections of various types are considered. To construct the final representation of the target node, all output nodes’ representations are mean-pooled together.

Beyond supervised learning, the authors also consider self-supervised pre-training via two tasks:

• Reconstruction of target node’s raw features
• For a pair of nodes, take cosine similarity in output representations and train to predict the pre-computed “intimacy” score

## Experimental results

The primary results compared accuracy on node classification tasks using small citation graph datasets: Cora, Citeseer, Pubmed.

• Improved upon GCN on 2 of the 3: 81.5 → 84.3 (Cora), 70.3 → 71.2 (Citeseer), and mostly matched on Pubmed: 79.0 → 79.3
• Positional Encoding methods provided a small benefit beyond the raw features, but little value without raw features
• The number of context nodes to include had a strong impact on results, where too few or too many caused severe degradations
• Using graph-smoothed raw features (A*X) as a residual to all layers was a little better than just raw features (X) or no residual connections.

## my take-aways

• The idea of using a threshold on a similarity metric to construct the context, rather than just restricting to a k-hop neighborhood, is interesting since it allows for important nodes that may not share an edge (e.g., because of incomplete data). This may have some connection with the intuition behind diffusion-based methods or graph rewiring approaches.
• Determining the relevant context from the graph will be a key choice in making Transformers work well on node classification tasks
• Some of the positional encoding details were questionable. Using WL to map structurally equivalent nodes to integers makes some sense, but using those as inputs to Transformer sin/cos-based embeddings doesn’t, as far as I can tell. It would be better to make these learnable.
• The datasets are too small to draw conclusions for real-life industrial problems, and I’d like to see more than citation networks
1 Like