Authors:

- Maria Kalantzi
- George Karypis

## tl;dr

How do you handle nodes without features? This paper creates two types of embeddings. The first encode relative node position and partition the graph into components, and then learn embeddings for each partition, which are shared by all the nodes in that partition. This can also be done hierarchically, so that partitions are further partitions, and each node therefore gets multiple embeddings that become increasingly localized. This turns out to not add much value in the benchmark datasets. These partition embeddings seem to out-perform learning unique embeddings for each node despite having many fewer parameters.

The other method is to essentially use the hashing trick, where nodes get put into random buckets, which then have learned embeddings. This has performance that is sometimes better than, sometimes worse than learning full embeddings.

## Summary

Embedding tables grow with the vocabulary size. There are hashing tricks from NLP that allow collisions to make smaller embedding tables. This paper wants to do something similar, but instead of randomly grouping nodes wants to instead exploit homophily and group nearby nodes. This apparently works even better than using full embeddings, suggesting this over-parameterizes our models. They create an embedding based on two different components: 1/ a position-specific component and 2/ a node-specific component.

There are “single-level” and “hierarchical” approaches for the position-specific component. For single level, the position-specific component of node i is \bf{p}_i and is based on graph partitioning into k partitions. There’s an embedding table \bf{P} \in \mathbb{R}^{k \times d} that maps each partition to an embedding, and all nodes within a partition share the same embedding.

The hierarchical approach to the position-specific component is similar, but recursively partitions the partitions. For L levels of partitioning, we start by partitioning as in the single-level case, but then partition each of those k partitions into k partitions for a total of k^2 partitions. For L levels of $k-$partitioning, you will have m=\sum_{j=1}^{L} k^j partitions. Since higher level partitions have many more nodes, the embedding vectors are larger. Somehow, these various partition vectors are summed (perhaps by padding the smaller ones with zeros).

The node-specific component is simply a result of the hashing trick, but where multiple different hash functions are used and the results are concatenated. Further, there are node-specific weights applied to the results of each hash function, i.e., \bf{v}_i = \bf{W} (y_i^1 \bf{u}_i^1 + \dots + y_i^h \bf u_i^h). The total number of parameters is B \times d + n \times h where B are the number of hash buckets, d is the embedding size, and there are n \times h node-level weights for each of the h hash functions.

There are two versions for these node-specific embeddings that either leverage the partition information or not. With “intra-partition shared embeddings”, each embedding table is split into m sub-tables where m is the number of partitions, and all the nodes within a partition use the same embedding sub-table. The result is that hash collisions only occur between nodes that are within the same partition. Conversely, “inter-partition shared embeddings” ignore partitioning and simply have a global embedding table.

## Experimental Setting

Node classification, ignore any features if they exist.

### Datasets

- ogbn-arxiv: homogeneous graph where nodes are papers and edges are citations. Predict 1 of 40 subject areas.
- ogbn-proteins: homogeneous graph where nodes are proteins, edges represent interactions. Edges have features, and they’re ignored here. Predict the binary outcome of 112 different tasks (i.e., whether or not the protein has that property/behavior).
- ogbn-products: homogeneous “products” graph where edges are co-purchases, product feature information is ignored. Predict the product category among the 47 options

### Baselines

- GCN and GAT models
- For baseline methods to get features:
- Learning embeddings for each node
- Various hashing mechanisms

### Evaluation metrics

- Accuracy for classification tasks, average ROC AUC for multi-label proteins dataset.

## Conclusions

Interestingly, Positional Embeddings consistently outperformed full embeddings, despite having far less capacity. When deciding number of partitions k=n^\alpha, using \alpha=0.5 seemed to work well across model types and datasets (i.e., the square root of number of nodes). Using hierarchical method sometimes improved and sometimes hurt–ultimately multi-level didn’t seem to move the needle although the impact depended on the dataset.

Node-specific embeddings were less impactful, and the intra- and inter- variations didn’t seem to have a consistent impact. There were some cases where these improved over full embeddings, and some cases where it was worse.

## Follow-up ideas

**Idea 1**: To get “inductive” partitions, could we simply do a “partition-voting” mechanism within the neighborhood and then simply do a weighted average of the partition embeddings, where the weight is equal to the proportion of votes? For example, if a node u had a neighborhood of partitions like \{a, b, b\}, then it’s position embedding would be \bf{p}_u = \frac{1}{3}\bf{p}_a + \frac{2}{3}\bf{p}_b. Question: is this any different than simply averaging the \{\bf{p}_v | \forall v \in \mathcal N(u)\}?

**Idea 2**: Would using a hyperbolic geometry for these hierarchical embeddings make more efficient use of the embedding dimension?