Boosting and Noisy Data - Outlier Detection and Removal
Learning binary classifications from examples is one of the basic tasks in machine learning. A very successful algorithm for such tasks is AdaBoost by Freund and Schapire. AdaBoost uses a weak learner repeatedly on the re-weighted training data, and predicts according to a weighted sum of the resulting predictors. The weighting scheme forces the weak learner to concentrate on examples that have not been learned already, and examples that receive high weights after some rounds of boosting can be considered to be hard to learn. We develop methods using AdaBoost's example weights to detect outliers in a noisy regime. We remove them, learn again, and try to enhance prediction this way. Our approach involves combining the boosted hypotheses from several runs of AdaBoost for classification and removal. In thorough experiments on artificial data we achieved up to 30% less prediction error for many tasks, compared to that of cross-validation selected AdaBoost hypotheses. On one learning task the error was even halved by using our approach. Also the recognition rate of faulty examples was very good and only few correct examples were removed. We further propose slightly more time-consuming variants that we expect to yield even better results. Some enhancements and extensions of our design that may allow additional improvement are also suggested.