A tree-based distance between distributions: application to classification of neurons
The usual strategy for computing a distance between two distributions consists of modeling the distributions in feature space, and of computing the distance between the models. We propose here to model the distributions of points by using unsupervised trees. Our main contribution is the definition of a tree-based approximation of the Kullback-Leibler divergence for very large feature spaces, from which we derive a symmetric distance. Our tree-based KL divergence consists first of building for each set of samples a balanced tree. Then, for any pair of sets of samples, we effectively compute the KL divergence between the empirical distributions at the leaves for the set used to build the tree, and the empirical distribution at the leaves for the other set. We show experimentally on synthetic data the consistency between this quantity and the exact KL divergence, and demonstrate its efficiency for both unsupervised and supervised classification on multiple standard real-world data-sets. Our main application is the characterization of abnormal neuron development.