Generalization error bounds using unlabeled data
The 18th Annual Conference on Learning Theory (COLT '05)
We present two new methods for obtaining generalization error bounds in
a semi-supervised setting. Both methods are based on approximating the
disagreement probability of pairs of classifiers using unlabeled data.
The first method works in the realizable case. It suggests how the
ERM principle can be refined using unlabeled data and has provable
optimality guarantees when the number of unlabeled examples is large.
Furthermore, the technique extends easily to cover active learning.
A downside is that the method is of little use in practice due to its
limitation to the realizable case.
The idea in our second method is to use unlabeled data to transform
bounds for randomized classifiers into bounds for simpler deterministic
classifiers. As a concrete example of how the general method works in
practice, we apply it to a bound based on cross-validation. The result
is a semi-supervised bound for classifiers learned based on all the
labeled data. The bound is easy to implement and apply and should be
tight whenever cross-validation makes sense. Applying the bound to
SVMs on the MNIST benchmark data set gives results that suggest that
the bound may be tight enough to be useful in practice.