## tl;dr

Nodes are one-hot-encoded based on 1) their distance from the target node and 2) node type. This relies only on local information contained within the subgraph and not any global graph information. As a node encoder, they use an L-layered R-GCN model and concatenate the embeddings at each layer for the final node representation. To encode the full subgraph, they concatenate the embeddings for the target user and item node pair: \mathbf{g} = \text{concat}(\mathbf h_u, \mathbf h_v). This subgraph-level embedding is passed through an MLP to generate a “rating” prediction and the parameters are trained to minimized the MSE for the rating.

They argue that they model the subgraph whereas typical GNNs model at the node level, and this is a problem. Specifically, aggregating the L-hop neighborhood for each node independently and then comparing the user/item with a decoder will mean that the node embeddings are aggregated independently. Consequentially, there’s no way to tell whether these two nodes come from highly overlapping subgraphs, or completely disconnected ones.

They also argue that restricting the subgraph size independently from the number of messaging passing layers allows them to scale to relatively deep GNNs without suffering from the information bottleneck problem.

### Experiments

In practice, they use 1-hop enclosing subgraphs. 2-hops was better performing, but took much longer.

A new regularization method is introduced to account for the fact that different edge types are used to represent different ratings (e.g., rating=1, rating=5), which means they get dedicated projection matrices in the R-GCN layers. The method is essentially to create a loss that’s the (root of) sum of squares of the differences of the parameter matrices across ratings that are adjacent (e.g., rating=1, rating=2). This ensures that parameter matrices of similar ratings are similar. This regularizer seemed to have a small or non-existent impact in the ablation studies.