Exploring the link between bolstered classification error and dataset complexity for gene expression based cancer classification
Oleg Okun, Giorgio Valentini and H. Priisalu
New Signal Processing Research
Nova Science Publishers
Gene expression profiles were shown to be useful in genomic signal processing when discriminating between cancer and normal (healthy) examples and/or between different types of cancer. K-nearest neighbors (k-NN) is one of the classification algorithms that demonstrated good performance for gene expression based cancer classification.
Given that distance metric is fixed, the conventional k-NN has a single
parameter (k - the number of nearest neighbors for each example) to set, which makes k-NN a very attractive choice in addition to the fact that it does not need training.
Classification performance of any classifier, including a k-NN, is typically characterized by classification error achieved on independent examples, which are often unavailable for the considered task. Thus, unbiased and low-variance error estimation is of ultimate importance in this case. We found that bolstered error satisfies these requirements and it was therefore chosen for our study. Bolstered error estimation is built
on random sampling in the neighborhood of each example (with example-dependent neighborhood radius) and computing the number of errors made on such artificially created data. Because of random sampling, all examples can be employed in assessing the error, unlike cross-validation or bootstrap procedures.
In this work, we investigate the link between k-NN bolstered error and dataset complexity characterizing how difficult to classify a certain dataset. Our measure for the dataset complexity is the normalizedWilcoxon rank sum statistic. Through extensive simulation coupled with the copula method for analysis of association in bivariate data, we show that dataset complexity and bolstered error are related in terms of several dependence types such as positive quadrant dependence, tail monotonicity, and stochastic monotonicity.
As a result, we propose a new scheme for generating ensembles of k-NN classifiers, which is based on the selection of low complexity feature subsets for k-NNs in the ensemble, which constitutes to choosing accurate k-NNs according to the found dependence relation. The candidate subsets are randomly sampled from the whole set of the original features in order to make predictions of individual k-NNs diverse.
Experiments carried out on eight gene expression datasets containing different types of cancer demonstrate that our ensemble generating scheme is superior (in terms of bolstered resubstitution error) to a single best classifier in the ensemble and to the traditional ensemble construction scheme that is ignorant of dataset complexity. It also outperforms the redundancy-based filter, especially designed to remove irrelevant genes.