Global Self-Attention as a Replacement for Graph Convolution


At the time of writing, this is the top performing method on OGB. This is similar to the Graphformer (arXiv, summary), but there are edge-level tokens as well. This allows the model to attend on edge features and make predictions at the edge level. They not only have an embedding for each real edge, but also for the N \times N possible edges. It uses edge information in two ways: 1/ by adding a bias term to the attention module via an edge encoding and 2/ by multiplying the output of the softmax-attention by a sigmoid gating function that depends on an edge encoding. They seem to assign positional encodings to the edges by doing SVD on the adjacency matrix, and concatenating rows of the source/destination node, as the dot product of these node representations should recreate the adjacency, given that they’re a result of matrix factorization.

They also describe a self-supervision target that they then use in a multi-task learning setup to help regularize the model. Here, they use an MLP on the edge embeddings to predict (a OHE of) the distance in the graph between the two nodes. They claim this regularizes the model and imposes a structural inductive bias.

Their ablation studies suggest the gating mechanism is important and dropout on the attention is important (which they claim encourages long-range interactions).

1 Like