Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks 05/21/2024 Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks torch_geometric.nn.conv.GraphConv keywords to search: graph kernels, Weisfeiler-Lehman algorithm, message passing Introduction Graph-structured data is ubiquitous across application domains ranging from chemo- and bioinformatics to image and social network analysis. To develop successful machine learning models in these domains, we need techniques that can exploit the rich information inherent in graph structure, as well as the feature information contained within a graph's nodes and edges. For example, one of the most successful kernel approaches, the Weisfeiler-Lehman subtree kernel, which is based on the 1-dimensional Weisfeiler-Leman graph isomorphism heuristic, generates node features through an iterative relabeling, or coloring, scheme. GNNs can be viewed as a neural version of the WL algorithm, where colors are replaced by continuous feature vectors and neural networks are used to aggregate over node neighborhoods. In effect, the GNN framework can be viewed as implementing a continuous form of graph-based “message passing”, where local neighborhood information is aggregated and passed on to the neighbors. 일반적인 GNN은 graph isomorphism을 구별하는데 잇어서 1-WL 알고리즘보다 강력하지 않으므로, k-GNN을 제안. k-GNN은 k-WL 알고리즘을 기반으로 개별 노드간의 message passing이 아닌 서브 그래프 구조간의 정보를 전달. (CNN이 필터별로 stride하면서 feature map을 만드는 것 처럼 GNN은 그래프 노드의 k-hop 만큼 집계 하는 심상) Preliminaries Notation and Background A graph \(G\) is a pair \((V, E) \) with a finite set of nodes \(V\) and a set of edges \( E \subseteq \{ \{u, v\} \subseteq V \mid u \neq v \} \). We denote the set of nodes and the set of edges of \(G\) by \(V(G)\) and \(E(G)\), respectively. \(N(v)\) denotes the neighborhoods of \(v\) in \(V(G)\), i.e., \(N(v) = \{ u \in V(G) \,|\, (u, v) \in E(G) \} \) We say that two graphs \(G\) and \(H\) are isomorphic if there exists an edge preserving bijection \(\phi : V(G) \to V(H) \), i.e., \( (u, v) \in E(G) \) if and only if \( (\phi(u), \phi(v)) \in E(H) \). Let \(S \subseteq V(G) \) then \(G[S] = (S, E_S) \) is the subgraph induced by \(S\) with \(E_S = \{ ( u, v ) \in E(G) \,|\, u, v \in S \} \) A node coloring is a function \(V(G) \to \sum \) with arbitrary codomain \(\sum\). Then a node colored or labeled graph \((G, l) \) is a graph \(G\) endowed with a node coloring \(l: V(G) \to \sum \). We say that \(l(v)\) is a label or color of \(v \in V(G) \). Graph Neural Networks Let \((G, l)\) be a labeled graph with an initial node coloring \(f^{(0)} : V(G) \to \mathbb{R}^{1 \times d} \) that is consistent with \(l\). This means that each node \(v\) is annotated with a feature \(f^{(0)}(u) = f^{(0)}(v) \) if and only if \(l(u) = l(v) \). \(f^{(0)}(u)\) can be an arbitrary real-valued feature vector. (개별 \(1 \times d \) 형태의 feature vector가 노드 개수 \(n\)개 만큼 존재하므로 \(n \times \ d\)의 feature matrix가 됨) A GNN model consists of a stack of neural network layers, where each layer aggregates local neighborhood information, i.e., features of neighbors, around each node and then passes this aggregated information on to the next layer. A basic GNN model can be implemented as follows. In each layer \(t > 0\), we compute a new feature \[ \,\\ f^{(t)}(v) = \sigma( f^{(t-1)}(v) \cdot W^{(t)}_1 + \sum_{w \in N(v)} f^{(t-1)}(w) \cdot W^{(t)}_2 ) \,\\ \] in \(\mathbb{R}^{1 \times e} \) for \(v\), where \(W^{(t)}_1 \) and \(W^{(t)}_2 \) are parameter matrices from \(\mathbb{R}^{d \times e} \), and \(\sigma\) denotes a non-linear function, e.g., a \(\text{sigmoid}\) or a \(\text{ReLU}\). (mapping the feature matrix \(n \times d \to n \times e \) by neural network layers — 노드 임베딩) 이웃 노드에 대해서 permutation-invariant하도록 위 수식의 함수를 변경 가능하고, differentiable function으로 대체될 수 있음. In fully generality a new feature \(f^{(t)}(v) \) is computed as \[ \,\\ f^{W_1}_{merge} \, ( f^{(t-1)}(v), \, f^{W_2}_{aggr}(\{\{ f^{(t-1)}(w) \,|\, w \in N(v) \}\}) ) \,\\ \] A vector representation \(f_{GNN}\) over the whole graph can be computed by summing over the vector representations computed for all nodes, i.e., \[ \,\\ f_{GNN}(G) = \sum_{v \in V(G)} f^{(T)}(v) \,\\ \] where \(T > 0\) denotes the last layer. k-dimensional Graph Neural Networks For a given k, we consider all k-element subset \([V(G)]^k \) over \(V(G)\). Let \(s = \{s_1, \, ... \, s_k\} \) be a k-set in \([V(G)]^k\), then we define the neighborhood of \(s\) as \[ \,\\ N(s) = \{ t \in [V(G)]^k \, | \,\, |s \cap t| = k - 1 \} \,\\ \] (집합 \(s\)의 이웃 \(N(s)\)는 그래프 \(G\)에서 크기가 \(k\)인 모든 부분집합 \(t\) 중에서, \(s\)와 \(k - 1\)개의 노드를 공유하는 부분집합들의 집합) Let's say \(V(G) = \{a,b,c,d,e\}, \, k = 3 \). In this, the subset satisfying \(k = 3\) is as: \[ \,\\ [V(G)]^k = \{ \{a, b, c\}, \{a, b, d\}, \{a, b, e\}, \{a, c, d\}, \{a, c, e\}, \{a, d, e\}, \{b, c, d\}, \{b, c, e\}, \{b, d, e\}, \{c, d, e\} \} \,\\ \] For \(s = \{a, b, c\}\), \(N(s)\) is subset that shares \(k - 1 = 2\) nodes as: \[ \,\\ N(s) = \{ \{a, b, d\}, \{a, b, e\}, \{a, c, d\}, \{a, c, e\}, \{a, c, d\}, \{b, c, e\}. \} \,\\ \] Definition of the local neighborhood \(N_{L}(s)\) is as: \[ \,\\ N_{L}(s) = \{ t \in N(s) \,|\, (v, w) \in E(G) \text{ for some } v \in s \setminus t \text{ and } w \in t \setminus s \} \,\\ \] 엣지로 연결된 노드를 의미. \( s = \{a, b, c\}, t = \{a, b, d\} \)일 때, \(s \setminus t = \{c\}, t \setminus s = \{d\}\). 이 때 엣지 \(c\)와 \(d\)가 엣지로 연결되어 있으면 \(N_{L}(s) \)에 속함. Definition of the global neighborhood \(N_{G}(s) \) is as: \[ \,\\ N_{G}(s) = N(s) \setminus N_{L}(s) \,\\ \] Let \((G, l)\) be a labeled graph. In each k-GNN layer \(t \geq 0 \), we compute a feacture vector \(f^{(t)}_k(s) \) for each \(k\)-set \(s\) in \([ V(G) ]^k \). In each layer \(t > 0\), we compute new features by \[ \,\\ f^{(t)}_k(s) = \sigma\left( f^{(t-1)}_k(s) \cdot W^{(t)}_1 + \sum_{u \in N_L(s) \cup N_G(s)} f^{(t-1)}_k(u) \cdot W^{(t)}_2 \right) \,\\ \] Moreover, one could split the sum into two sums ranging over \(N_L(s) \) and \(N_G(s) \) respectively, using distinct parameter matrices to enable the model to learn the importance of local and global neighborhoods. \[ \,\\ f^{(t)}_k(s) = \sigma\left( f^{(t-1)}_k(s) \cdot W^{(t)}_1 + \sum_{u \in N_L(s)} f^{(t-1)}_k(u) \cdot W^{(t)}_{2,L} + \sum_{u \in N_G(s)} f^{(t-1)}_k(u) \cdot W^{(t)}_{2,G} \right) \,\\ \] To scale k-GNNs to larger datasets and to prevent overfitting, we propose local k-GNNs, where we omit the global neighborhood of \(s\), i.e., \[ \,\\ f^{(t)}_{k, L}(s) = \sigma\left( f^{(t-1)}_{k, L}(s) \cdot W^{(t)}_1 + \sum_{u \in N_L(s)} f^{(t-1)}_{k, L}(u) \cdot W^{(t)}_2 \right) \,\\ \]