Tutorial on baseline and evaluation procedures
In the following, we give a short overview on how to use the dataset collection together with the graph kernel and GNN baselines and standardized evaluation methods. First, follow the instruction on github.com/chrsmrrs/tudataset to install the TUDataset Python package.
Throughout this tutorial, we assume that your base directory is tudataset/tud_benchmark/
.
Kernel baselines and 10-CV using SVMs
We provide Python-wrapped C++ implementations of the following kernels:
- Weisfeiler-Lehman subtree kernel (1-WL) [1],
- Graphlet kernel (GR) [2],
- Shortest-path kernel (SP) [3],
- Weisfeiler-Lehman optimal assignment kernel (WL-OA) [4]
For the first three kernels, we provide, both, Gram matrix output (numpy.array
, [n,n]
) and sparse feature vector output (scipy.sparse.csr_matrix
, [n,d]
).
Weisfeiler-Lehman subtree kernel
The following code will download the ENZYMES
dataset, compute the 1-WL for 3
iterations using the discrete node labels of the dataset, and output the Gram matrix as a numpy.array
.
Since the dataset does not provide edge labels we set se_edge_labels = False
.
import kernel_baselines as kb
import auxiliarymethods.datasets as dp
use_labels, use_edge_labels = True, False
dataset = "ENZYMES"
# Download dataset.
classes = dp.get_dataset(dataset)
iterations = 3
gram_matrix = kb.compute_wl_1_dense(dataset, iterations, use_labels, use_edge_labels)
Instead of computing the Gram matrix, we can skip its computation and output a scipy.sparse.csr_matrix
feature vector for each graph by replacing the last line by
feature_vectors = kb.compute_wl_1_sparse(dataset, iterations, use_labels, use_edge_labels)
Graphlet kernel
Similarly, the Gram matrix for the Graphlet kernel can be computed by
gram_matrix = kb.compute_graphlet_dense(dataset, use_labels, use_edge_labels)
The sparse feature vectors can be computed by
feature_vectors = kb.compute_graphlet_sparse(dataset, use_labels, use_edge_labels)
Shortest-path kernel
The Gram matrix for the Shortest-path kernel can be computed by
gram_matrix = kb.compute_graphlet_dense(dataset, use_labels, use_edge_labels)
The sparse feature vectors can be computed by
feature_vectors = kb.compute_graphlet_sparse(dataset, use_labels, use_edge_labels)
Weisfeiler-Lehman optimal assignment kernel
The Gram matrix for the WL-OA kernel can be computed by
gram_matrix = kb.compute_wloa_dense(dataset, use_labels, use_edge_labels)
SVM evaluation
Here, we show how to jointly optimize the number of iterations of the 1-WL and the C
parameter of the (dual) SVM using 10-CV. See the paper on details on the evaluation procedure.
The following code computes the 1-WL for 1
to 5
iterations, and applies cosine normalization. The number of iterations is then jointly optimized with the C
parameter (C=[10**3, 10**2, 10** 1, 10**0, 10**-1, 10**-2, 10**-3]
) using 10-CV.
The experiment is repeated 10
times and the average accuracy, the standard deviations over all 10
10-CV runs and all 100
runs are output.
all_matrices = []
num_reps = 10
# 1-WL kernel, number of iterations in [1:6].
for i in range(1, 6):
gm = kb.compute_wl_1_dense(dataset, i, use_labels, use_edge_labels)
# Cosine normalization.
gm = aux.normalize_gram_matrix(gm)
all_matrices.append(gm)
accuracy, std_10, std_100 = kernel_svm_evaluation(all_matrices, classes, num_repetitions=num_reps, all_std=True)
Similarly, we can do the same for sparse feature vectors using a linear SVM.
feature_vectors = []
# 1-WL feature vectors, number of iterations in [1:6].
for i in range(1, 6):
fv = kb.compute_wl_1_sparse(dataset, i, use_labels, use_edge_labels)
# \ell_2 normalization.
fv = aux.normalize_feature_vector(fv)
feature_vectors.append(fv)
accuracy, std_10, std_100 = linear_svm_evaluation(feature_vectors, classes, num_repetitions=num_reps, all_std=True)
See main_kernel.py
for more examples.
GNNs baselines and 10-CV evaluation
Here, we show how to optimize the hyperparameters (number of layers in {1,2,3,4,5}
, hidden dimension {32,64,128}
) of the GIN
layer [5] using 10-CV.
We set the maximum number of epochs to 200
, the batch size to 64
, the starting learing rate to 0.01
, and the number of repetitions of the 10-CV to 10
.
import auxiliarymethods.datasets as dp
from auxiliarymethods.gnn_evaluation import gnn_evaluation
from gnn_baselines.gnn_architectures import GIN
dataset = "ENZYMES"
dp.get_dataset(dataset)
accuracy, std_10, std_100 = gnn_evaluation(GIN, dataset, [1, 2, 3, 4, 5], [32, 64, 128], max_num_epochs=200, batch_size=64, start_lr=0.01, num_repetitions=num_reps, all_std=True)
In case the dataset includes edge labels or (continuous) attributes you can use the GINE
layer.
import auxiliarymethods.datasets as dp
from auxiliarymethods.gnn_evaluation import gnn_evaluation
from gnn_baselines.gnn_architectures import GINE
dataset = "Yeast"
dp.get_dataset(dataset)
accuracy, std_10, std_100 = gnn_evaluation(GINE, dataset, [1, 2, 3, 4, 5], [32, 64, 128], max_num_epochs=200, batch_size=64, start_lr=0.01, num_repetitions=num_reps, all_std=True)
See main_gnn.py
for more examples. More compatible GNN layers are available from Pytorch Geometric.
Loading the graphs as NetworkX graphs
import auxiliarymethods.datasets as dp
from auxiliarymethods.reader import tud_to_networkx
dataset = "PROTEINS"
# Download dataset.
dp.get_dataset(dataset)
# Output dataset as a list of graphs.
graph_db = tud_to_networkx(dataset)
TODO: Add details on how to access label information.
Bibliograpphy
[3] Shortest-path kernel paper
[5] GIN paper