Do Transformers Really Perform Bad for Graph Representation?

tl;dr

They present the Graphformer model, which attempts to inject graph structural information into a standard Transformer architecture. Specifically, they add “centrality encodings”, which are basically learned embeddings based on the in-degree and out-degree. They also use “spatial encodings” for pairs of nodes (u, v) by 1/ calculating shortest path distance between u and v and 2/ mapping this distance to a learnable scalar value that’s used as a bias in the attention term:

A_{ij} = \frac{(h_iW_Q)(h_jW_K)^T}{\sqrt{d}} + b_{\phi(v_i, v_j)}

They additionally incorporate edge features by 1/ constructing the shortest-path between a pair of nodes, 2/ taking the dot product of each encountered edge feature vector along the path with a learnable vector of parameters and 3/ averaging the result of these dot products across the path:

c_{ij} = \frac{1}{N}\sum_{n=1}^N x_{e_n} (w_n^E)^T

Where x_{e_n} are the features of the n-th edge and w_n^E is the learnable vector at distance n. This term is then added as a bias to the attention module:

A_{ij} = \frac{(h_iW_Q)(h_jW_K)^T}{\sqrt{d}} + b_{\phi(v_i, v_j)} + c_{ij}

Note that these models can be transferred to other molecular tasks since, at its heart, the node features are related to atoms, edge features are related to bonds, and distances are calculable from the structure. They indeed show this model gets SOTA performance when fine-tuned on other molecule-graph tasks.