Attending to Graph Transformers


  • Luis Müller
  • Mikhail Galkin
  • Christopher Morris
  • Ladislav Rampášek


This paper is a survey of Graph Transformers (GTs), which use the Transformer architecture along with various techniques for injecting information from the graph. In doing, there are three major tasks: 1/ representing graph structure through positional encodings, 2/ tokenizing the graph into a sequence of items, and 3/ choosing an attention mechanism. This paper does some experimentation to show that GTs out-perform GNNs on tasks where GNNs are known to struggle: graph structure prediction that requires expressivity exceeding the 1-WL test (e.g., triangle counting), handling heterophilic data, and modeling long range relationships. Not much is said about GTs’ ability to perform in tasks where GNNs are strong, like efficiently modeling homophilic graph data.


1. What do you do for structural/positional encodings so that the model can effectively leverage graph topology? E.g.:
1. Calculate simple structural properties and add as features, or learn embeddings for them (degree, motif counting, random walk statistics, Ricci curvature)
2. Calculate shortest-path distance measures to anchor nodes
3. Calculate eigenvectors of the adjacency/Laplacian

With these, need to consider if this constrains your model to the transductive setting (e.g., as with eigenvectors of the Laplacian), and if they are permutation equivariant.

The help answer these, the paper compares GTs to a Graph Isomorphism Network (GIN) GNN, which is known to have an expressivity equivalent to 1-WL, on a handful of structural prediction tasks (e.g., triangle counting). For GTs, they compare two forms of positional encodings (Random Walk vs Laplacian-based) and Graphormer, which adds a bias to the attention mechanism based on shortest path length between tokens. Finally, they also consider a pure Transformer with no graph information.

The authors find that all the GTs match or out-perform GIN, particularly on the harder tasks. Among the GTs, there is no clear winner across tasks. All methods struggle generalizing triangle counting from small to large graphs.

2. How do you tokenize the graph into a sequence/set?
A few options:

  1. Each node is an element (attention scales as O(n^2))
  2. Each node and edge is an element (attention scales as O((n+m)^2))).
  3. Subgraphs are elements, where subgraphs result from graph coarsening/partitioning (attention scales as O(k^2), where k is the number of partitions)

3. What attention mechanism do we use?
The majority of approaches can be placed in the following categories:

  1. Standard Transformer attention that considers all pairs of tokens
  2. Modified to be structure aware (e.g., add a bias in the attention function based on whether or not the token pair shares a connection in the graph, or the shortest path length between the tokens, as in Graphformers.)
  3. Kernelized attention mechanisms that have linear scalability
  4. Hybrid approaches that combine GNN layers and attention mechanisms/layers

Over-smoothing and heterophilic datasets

If GNNs are low-pass filters that smooth over a neighborhood, this is expected to fail for heterophilic datasets, where dissimilar items are likely to share connections. It’s postulated that the full attention of GTs can overcome this problem and the authors test it on heterophilic benchmarks. They compare the following:

  1. A GCN with/without various positional encodings (e.g., Random Walk, Laplacian…etc)
  2. A hybrid GCN → Transformer, with/without positional encodings
  3. Transformer, with various positional encodings
  4. Graphormer (graph-aware attention mechanism)

The findings were as follows:

  • All transformer-based models outperform a 2-layer GCN
  • Augmenting GCNs with positional encodings had a minimal performance impact
  • Adding attention after the GCN universally improved the performance
  • Disabling the GCN prior to the Transformer (i.e., becoming the Transformer model), increases the performance even further
  • The attention bias of Graphormer had no impact on its performance

In sum, GNN-like methods did poorly on these heterophilic datasets, and the global attention of a transformer was more successful.

Over-squashing and long-range relationships

Here they consider the NeighborsMatch problem, which is a contrived dataset used to test GNN’s ability to search for long range matches, magnifying the “over-squashing” problem that comes with needing to represent an exponentially increasing receptive field size with a fixed-size vector. Unsurprisingly, Transformers perform perfectly on this task, however, it gives little empirical information on the problem of usefully representing giant receptive fields.


Overall, GTs compared favorable in the tasks presented here. It’s also noted that GTs are currently state of the art on large graph classification benchmarks, like PCQM4Mv2. It should be pointed out that all of these tasks emphasize known failure modes of GNNs: exceeding 1-WL in structure prediction, handling heterophilic data, and modeling long-distance relationships. While it’s valuable to see Transformers succeed where GNNs fail, I would like to also see a comparison on tasks where GNNs do well. When local smoothing of features is helpful (e.g., ogbn-arxiv), can Transformers learn to aggregate or otherwise handle the variance of tokens? This key benefit of GNNs is what makes them perform well on homophilic graphs, and from this survey, it’s unclear if Transformers might be a replacement for GNNs, or a compliment when they fail.

1 Like