Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
In: NIPS 2008(2009).
For supervised and unsupervised learning, positive definite kernels allow to use
large and potentially infinite dimensional feature spaces with a computational cost
that only depends on the number of observations. This is usually done through
the penalization of predictor functions by Euclidean or Hilbertian norms. In this
paper, we explore penalizing by sparsity-inducing norms such as the ℓ1-norm or
the block ℓ1-norm. We assume that the kernel decomposes into a large sum of
individual basis kernels which can be embedded in a directed acyclic graph; we
show that it is then possible to perform kernel selection through a hierarchical
multiple kernel learning framework, in polynomial time in the number of selected
kernels. This framework is naturally applied to non linear variable selection; our
extensive simulations on synthetic datasets and datasets from the UCI repository
show that efficiently exploring the large feature space through sparsity-inducing
norms leads to state-of-the-art predictive performance.