Authors:
- Jiaxuan You
- Rex Ying
- Jure Leskovec
tl;dr
This method defines a notion of position by measuring distances to “anchor-sets”, which are sets of nodes and the distance metric is defined as the minimum path length among the path-lengths to each node in the anchor-set.
To apply P-GNN to a node, we first consider how it interacts with every node in all the anchor sets. This “message” computation includes essentially weighting by one over the distance to the node, and then concatenates the node features of source/target. Then for a single anchor set, we collapse all the node-to-node messages down to a fixed-length representation \in \mathbb R^r to represent the interaction with the full anchor set. These can be stacked to form \mathbf M_v \in \mathbb R^{k \times r}, where k is the number of anchor sets. Now, \mathbf M_v is used in two different ways to generate two different outputs. To get the position-aware embeddings, we simply take the inner-product with \mathbf w \in \mathbb R^r and apply a point-wise non-linearity, which converts these r-vectors for each anchor set into scalars, which are interpreted as distances. The other way to use \mathbf M_v is to construct node-level vector representations that can be used as input to additional P-GNN layers. To get these, we aggregate across the anchor-sets (rather than using a dot product to eliminate the r-vectors): \mathbf h_v = \text{AGG}_S(\{ \mathbf M_v [i] , \, \forall i \in [1,k] \}).
Interestingly, this means all nodes receive messages from the same nodes (the set of anchor-sets), and not from their connected neighbors. This is very different than normal message passing algorithms.
Summary
GNNs cannot distinguish topologically isomorphic neighborhoods. By imbuing nodes with a global position, GNNs can distinguish distant neighborhoods that are structurally isomorphic in an inductive way. The key idea is constructing an embedding that quantifies the distance to a chosen set of anchor nodes.
Node embeddings are considered “position-aware” if they can be used to re-construct the shortest-path distance between two nodes–i.e., node embedding \bf z_i = f_p( v_i) \, \forall v \in V is position-aware if there exists a function g_p such that d_{sp} = g(\bf z_i, \bf z_j) where d_{sp} is the shortest-path distance between v_i and v_j. Conversely, an embedding is “structurally-aware” if it’s a function of the k\text{-hop} neighborhood of a node. Proposition 1 says that a mapping can only exist between structure-aware embeddings and position-aware embeddings if there are no isomorphic k-hop neighborhoods. If there were, how could you distinguish among the identical structural embeddings that are in different positions?
The key design choice is that nodes are allowed to not only receive messages from their k-hop neighborhood, but also from an “anchor set” of nodes, which are a random subset of all the nodes.
We will represent positional encodings as distances to anchor sets. The distance to an anchor set is the minimum of the distances to all the nodes in the set. We will have multiple anchor sets of differing size, and each dimension in our positional encoding will represent the distance to a single anchor set.
How many anchor sets and of what size? An exponentially increasing number of nodes at an exponentially shrinking quantity. E.g., lots of size 1 anchor sets, half as many size 2, a quarter as many size 4…etc.
Algorithm is something like this (for a single layer), for node v:
- Create a set of anchor-sets {S_i} \text{ for } i=1...k
- For each anchor set i…
- Create a set of messages \in \mathbb R^{r} between the target node and the nodes in the anchor set: \mathcal M_i = \{F(u, v, \bf h_u, \bf h_v) | \, \forall u \in S_i\}
- In practice, F(u, v, \mathbf h_u, \mathbf h_v) = s(u, v) \text{concat}(\mathbf h_u, \mathbf h_v), where s(u,v) is some distance function between nodes u and v
- Aggregate these messages to get a vector for this anchor-set that’s independent of the number of nodes in the anchor-set: \bf{M}_v[i] = \text{AGG}_M(\mathcal M_i)
- Create a set of messages \in \mathbb R^{r} between the target node and the nodes in the anchor set: \mathcal M_i = \{F(u, v, \bf h_u, \bf h_v) | \, \forall u \in S_i\}
- Apply a non-linear transformation and transform these anchor-set vectors into scalars: \bf{z}_v = \sigma(\bf{M}_v \cdot \bf w) where \bf w \in \mathbb R^r. These are the positional embeddings.
- Get a representation of node v for use in down-stream layers by aggregating across the anchor-sets: \bf h_v = \text{AGG}_S(\{\bf M_v[i] | \forall i \in [1, k] \})
Interestingly, it appears there is no message passing on the normal graph structure, but rather that each node does message passing with the same set of anchor-sets. The graph structure information is encoded in the minimum-path distance.
Does that even make sense? That means the set of messages is very similar, with the exception that the target node’s representation is concatenated and the distance function applies a scaling factor.
Experimental Setting
When talking about inductive learning, they describe using “order-invariant node attributes”. I think this rules out things like OHE and include things like using the degree or simply a constant value. They also note, “Inductive position-aware node classification is not well-defined due to permutation of labels in different graphs. However pairwise node classification, which only decides if nodes are of the same class, is well defined in the inductive setting.” I think they are specifically talking about community assignment, as one could imagine having different labels for a given community.
In any case, they compare both Link Prediction tasks, and a form of “inductive node classification” in which the make pairwise decisions on whether or not two nodes belong to the same community.
Datasets
- Grid: a 2D grid of size 20x20 with no features
- Community: A “connected caveman” graph with some rewiring and 20 known communities of 20 nodes each
- PPI: 24 Protein-protein interaction networks, each of which have an average of 3000 nodes of degree 29 and there are 50-dimensional node features.
Baselines
- GCN
- GraphSAGE
- GAT
- GIN
Evaluation metrics
ROC-AUC.
Results
Link Prediction was much, much better than baseline methods on the “Grid” and “Community” datasets. However, since these datasets don’t have node features, it seems that the authors created some by assigning scalar values to the nodes. I’m not sure if they used degree, or simply labeled the nodes with integers. In any case, that failed miserably (as expected) and their method had comparable performance to the transductive setting. On the PPI dataset, which has node features, there was no real performance benefit for their method.
Node classification wasn’t done normally, but instead was “pairwise node classification” in which a pair of nodes were input and a prediction of shared community was output.
Conclusions
This paper is confusing to me. It’s not clear that their evaluation method makes sense if they’re essentially using a constant as “features” and then using a model that generates the same prediction for every isomorphic 3-hop neighborhood. It demonstrates the problem of position-awareness, but it’s not clear how well this method will work in real-life settings.
Note that to get the shortest-path distances as they suggest, it’s an O(n^2) to O(n^3) computation with O(n^2) space complexity. This is a one-time operation, but it’s very costly.