Scalable Graph Neural Networks for Heterogeneous Graphs

Authors:

  • Lingfan Yu
  • Jiajun Shen
  • Jinyang Li
  • Adam Lerer

tl;dr

SIGN is an architecture in which graph-based features are calculated as a one-time preprocessing step and the model is simply an MLP on top of these features. They achieve this by concatenating, for example, \{ \mathbf H_k \mid \mathbf{H}_k = \mathbf{A}^k\mathbf X \ \forall k \leq K ; k \in \mathbb Z \} where multiplying the node features \mathbf X by power k of the adjacency matrix \mathbf A aggregates information from the k\text{-hop} neighborhoods of the nodes. Note that this does not depend on model parameters and can be calculated as a preprocessing step, and then the model can simply be \mathbf{Z} = \sigma(\mathbf{H \Theta}), where \mathbf{H} = \|_{k=1}^K \mathbf H_k. In other words, we calculate features at different neighborhood sizes of message passing and then concatenate the results. This allows us to separately maintain information at various distances away rather than updating node representations via stacked GNN layers, which can lead to over-smoothing.

NARs does the same thing, but generalizes to heterogeneous graphs by taking different relations into account rather than simply treating it as a homogenous graph as SIGN would. Given a set of edge-types \mathcal R in a heterogeneous graph, their key proposal is to create subsets of edge-types R_i \subseteq \mathcal R, which then allow the creation of adjacencies \mathbf A_i that only contain edges with edge-types in R_i.

For each layer k (which represents k\text{-hop} message passing on the sub-graph), these \mathbf H^i_k are aggregated across the relation subsets using a dot-product with learnable vector: \mathbf H^{\text{agg}}_{k} = \sum_{i=1}^{|\mathcal R |} \mathbf a^i_k \ \cdot \mathbf H_{k}^i . We then can concatenate across the message-passing depth: \mathbf{H} = \|_{k=1}^K \mathbf H_k^\text{agg}.

Summary

There’s some evidence that the main power of GNNs comes from using the graph to do local feature smoothing, not necessarily by creating graph-based hierarchical features. How do you do this with heterogeneous graphs? This paper develops a method for that.

We propose Neighbor Averaging over Relation Subgraphs (NARS), which computes neighbor averaged features for random subsets of relation types, and combines them into a single set of features for a classifier using a 1D convolution.

Core Idea

  1. Choose a subset of possible relation types, R_i \subseteq \mathcal{R}
  2. Create a subgraph that only contains edge-types in R_i: G_i
    1. Model this graph as being homogeneous
  3. Aggregate the neighbor features to get updated representations for each relation subset:
H^i_{v,l} = \frac{1}{| \mathcal{N}_i(v)|} \sum_{u \in \mathcal{N}_i(v)} H^i_{u,l-1}
  • Note that there are no learnable parameters here, this is simply smoothing the raw features via message passing, but this can be repeated for L iterations.
  • LIMITATION: This assumes feature dimensions are the same across node-types. Recall that this subgraph is treated as a homogeneous graph even though it can have multiple node- and edge-types.
  1. Aggregate across the relation subsets to get a single representation of each node:
H^{\text{agg}}_{v,l} = \sum_{i=1}^{K} a_{i,l} \ \cdot H_{v,l}^i

Handling featureless nodes (Sec. 5.4)

Tried four approaches:

  1. padding zero
  2. taking the average of features from neighboring papers nodes
  3. using pre-trained Metapath2vec embedding
  4. using pre-trained TransE relational graph embeddings with L2 norm (Bordes et al., 2013)

All models achieve their best performance with pre-trained TransE graph embedding features. Featurization is especially important for neighbor-averaging approaches (SIGN and NARS)

Details

  • If there are |\mathcal{R}| number of edge types, then the maximum number of possible (non-empty) subsets K_\text{max} = 2^{|\mathcal{R}|}-1
    • Each relation is either in our out of the subset. Subtract 1 for the empty subset.

Experimental Setting

Datasets

Academic graphs:

  • ACM
    • Only “Paper” node-types have bag of words features
  • OGB-MAG
    • Only “Paper” node-types have word2vec features
  • OAG
    • Not 100% sure, but seems it’s the same as others with only “Paper” node-types having text-based features
    • Details in HGT code

Baselines

  • Heterogeneous graph modeling methods:
    • R-GCN
    • HAN
    • HGT
  • SIGN

Evaluation metrics

  • Test micro-F1 / accuracy (these are identical when only 1 class is possible)

Results

Tied in two datasets (ACM, OAG-L1-Field), beat in two datasets (ACM, OAG-Venue).

Insights

  • R-GCN still performs really well
  • Seems that most lift from GNNs is coming from graph-based feature smoothing, not complicated “deep” features

Questions

  • Does using TransE to get relational embeddings for featureless nodes make this transductive?
    • Yes, as TransE learns embeddings for the entities and relationships.
    • Could we do this, but essentially use an average of the entities of same type as features? Or could we modify TransE to train for this explicitly, so that we learn embeddings merely for the type and not the individuals?
    • Idea: could we do normal TransE so that we have values for V_\text{train} \subseteq V and all relation-types \mathcal{R}, and then simply calculate features for a new node v \notin V_\text{train} like: $$\mathbf{h}v = \frac{1}{|\mathcal N(v) \cap V\text{train}|} \sum_{r \in \mathcal R} \sum_{u \in \mathcal N^r(v) \cap V_\text{train}} \left ( \mathbf{h}_u + \mathbf{l}_r \right )$$
  • Why does SIGN have D^2L number of parameters?
    • I think this assumes the output of the MLP layer is also D, so \Theta \in \mathbb{R}^{D \times D}
  • Is there a way to do “relation set feature selection” in some reasonably cheap way to narrow down on the set of important relations?