Hyper-Parameter Tuning for Graph Kernels via Multiple Kernel Learning
This research focuses on hyper-parameter tuning for graph kernels using Multiple Kernel Learning, emphasizing the importance of kernel methods in learning on structured data like graphs. It explores techniques applicable to various domains and discusses different graph kernels and their sub-structures, highlighting the challenges in selecting the most suitable kernel for predictive performance.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Hyper-parameter tuning for graph kernels via Multiple Kernel Learning Carlo M. Massimo, Nicol Navarin and Alessandro Sperduti Department of Mathematics University of Padova, Italy 1/19
Machine learning on graphs Motivations: Many real-world problems require learning on data can be naturally represented in structured form (XML, parse trees) e.g. predict the mutagenicity of chemical compounds In these domains, kernel methods are an interesting approach because they do not require an explicit transformation in vectorial form Note: We focus on graphs, but the proposed techniques can be applied to virtually every domain. 2/19
Kernel methods Learning algorithm: SVM SVR K-means It access examples only via dot products. Kernel function: Informally, a similarity measure between two examples Symmetric positive semidefinite function kernel function (Mercer s theorem) < (x), (x )>: projects the entities in a feature space and computes the dot product between these projections This mapping may be done IMPLICITLY or EXPLICITLY Input space Feature Space 3/19
Graph kernels Haussler s Convolution Graph kernels count matching subgraphs in two graphs: Explicit feature space representation A A B B ... ... ... ... ... ... 1 1 1 1 1 1 1 0 0 0 G A A C D D A A The kernel is the dot product between the two feature vectors B A A C B C A B C D B C C B D D D B D D A B G 1 1 1 1 1 ... 0 ... 0 ... 1 ... 1 ... 1 ... C D 4/13 4/19
The kernel (Gram) matrix Given a set of examples S={x1x2...xm} and a kernel function k depending on some parameters , the kernel matrix is defined as: k (x1,x1) k (x1,x2) k (x1,xm) k (x2,x1) k (x2,x2) K S= k (x3,x3) k (xm,x1) k (xm,xm) 5/19
Overview on graph kernels Different Graph kernels consider different sub-structures (e.g. random walks, shortest paths ) Weisfeiler-Lehman (WL): subtree-patterns, 1 parameter, Shervashidze and Borgwardt 2011 WLC: subtree-patterns with contextual information (i.e. the structure of the graph surrounding the feature), derived from Navarin et al., ICONIP 2015 ODDST: sub-DAGs, 2 parameters, Da San Martino et al., SDM 2012 TCKST: sub-DAGs with contextual information Navarin et al., ICONIP 2015 Problem: Different kernels induce different similarity measures different biases different predictive performances.. There is no way to know a-priori which kernel will perform better 6/19
The Parameter selection problem Given a task, we have to choose among: several (graph) kernels, each one depending on different parameters the learning algorithm parameter(s) (e.g. the C of SVM) How can we select the best kernel/parameters? Grid-search approach: fix a priori a set of values for each parameter select the best performing combination may be costly .. . 7/19
Grid-search approach Error Estimation Procedure (e.g. k-fold CV) Hyper-parameter Selection procedure (e.g. nested k-fold CV) 1 Gram matrix for each point in the grid For each classifier parameter: Grid_size error estimation procedures For nested k-fold: k*Grid_size trainings 8/19
The Parameter selection problem Given a task, we have to choose among: Several (graph) kernels, each one depending on different parameters The learning algorithm parameter(s) (e.g. the C of SVM) How can we select the best kernel/parameters? Grid-search approach fix a priori a set of values for each parameter select the best performing combination may be costly Possible solution: randomized search in the parameter grid Still depends on the number of considered trials Not optimal Is there a different alternative? Possibly faster and that can lead to better results Our proposal: MKL... 9/19
Multiple Kernel Learning Aiolli and Donini, Neurocomputing 2015 Goal: learning a combination of weak kernels, that hopefully performs better than the single ones K= Ri=1 iKi i 0 EasyMKL: computes the kernel combination and solves an SVM- like learning problem at the same cost of a single SVM (linear in the number of kernels) WARNING! We are not selecting but combining As an edge case, maybe that just a single kernel is selected (all the weight on a single kernel) In general, a few important kernels will have the most of the weight Another perspective: avoiding kernel hyper-parameter optimization combining all the kernels in the grid together Computational benefits: A single learning procedure has to be computed Predictive performance: in principle, information from different kernels can be combined 10/19
Proposed approach Error Estimation Procedure (e.g. k-fold CV) Hyper-parameter Selection procedure (e.g. nested k-fold CV) For each classifier parameter: 1 error estimation (MKL) procedure For nested k-fold: k trainings For each classifier parameter: Grid_size error estimation procedures For nested k-fold: k*Grid_size trainings 11/19
Experimental setting Every example is a graph Chemical graphs / Proteins secondary structure 12/19
Experimental results (1) Comparable performances! hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL ODDST(2 params, ~100 matrices) TCKST (2 params, ~100 matrices) ODDST+TCKST(~200 matrices) 13/19
Kernel Orthogonalization MKL performs better when the information provided by each kernel is orthogonal w.r.t. each other Idea: partitioning the feature space of a given kernel s.t. the resulting subsets have non-overlapping features [Aiolli et al, CIDM 2015]. K(x,y)=K1(x,y)+K2(x,y) The ODDSTand WL feature spaces have an inherent hierarchical structure WL orthogonalization is a minor contribution of the paper 1(x) (x)= 2(x) 14/19
Experimental results (2) 90.38 0.10 Improved performances! For this dataset, the grid was reduced for computational constraints hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination ODDST(2 params, ~100 matrices) TCKST (2 params, ~100 matrices) ODDST+TCKST(~200 matrices) 15/19
Experimental results (3) WL: (1 param, 10 matrices) WLC: (1 param, 10 matrices) WL, WLC: (20 matrices) c: slightly improved performances! oc: no significant improvement in this case hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination 16/19
Experimental results (4) Combining different kernel functions: consistent results the method can be applied also in this case. hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination 17/18
Runtimes hs: grid search. AUROC with SVM classifier c: kernel combination. AUROC (with EasyMKL oc: orthogonalized combination The performance of the proposed method depends from: The gram marix size The number of matrices to consider TCK_ST kernel (100 matrices) WLC kernel (10 matrices) 18/19
Conclusions We presented an alternative to kernel hyper-parameter selection: that combine kernels depending on their effectiveness faster w.r.t grid search if the number of kernels is large complexity grows linearly in the number of kernel two possible approaches: combine different kernel matrices as-is divide the Reproducing kernel Hilbert Space in a principled way achieves comparable results w.r.t. Grid search Future work: application to other (similar) domains (kernels on graph nodes, commonly used in bioinformatic applications) speedup of MKL computation adopting SVM instead of KOMD as classifier selection of the kernel matrices to use (instead of combination) (may be beneficial according to the weight analysis in the paper) 19/19
Weights analysis ODDSTon AIDS dataset WLC on AIDS dataset 20/13