Significance Tests for Bizarre Measures in 2-Class Classification Tasks
Mikaela Keller, Johnny Mariéthoz and Samy Bengio
Statistical significance tests are often used in machine learning to
compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as F1, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.