Binary Similarity Measures and their Applications in Machine Learning
PhD thesis, London School of Economics.
Measures which quantify the similarity between two vectors have been of interest in the Machine Learning community where they are used in both supervised and unsupervised learning algorithms, and much attention has been paid to their general theory. In this the- sis we consider, by contrast, measuring the similarity between a binary vector and a set of binary vectors. We explore a number of different such measures and investigate how these may be used in binary classification tasks.
We investigate mathematical properties of particular binary similarity measures, and the relationships among them. The measures studied build on a particular similarity measure initially investigated by Anthony and Hammer. In their paper they give characterisations of the similarity measure in terms of logical formulae in disjunctive normal form (DNF). We examine this relationship further in the context of binary classification tasks. We show that with some assumptions on the parameters of a DNF representing the true classifica- tions, high similarity of an example to the training set can ensure correct classification.
Work by Subasi et. al. and Morrow has investigated the use of binary similarity mea- sures for classification confidence. We use a different methodology for classifying and obtaining classification confidences using similarity measures, and report on experiments performed using these methods. We find that some of the similarity measures perform relatively well compared with standard classification algorithms, and others not so well. We show how the parameters of a particular binary similarity measure can be optimised to improve its performance. We also introduce a new DNF learning algorithm to try to improve on the well known ID3 algorithm, but find we cannot improve on it.