Concentration Bounds for Unigrams Language Model
Evgeny Drukh and Yishay Mansour
In: COLT 2004, 1-4 July, 2004, Banff, Alberta, Canada.

We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly $k$ times in a sample of size $m$. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a PAC bound of approximately $O \left( \frac k {\sqrt m} \right)$. We improve its dependency on $k$ to $O\left( \frac {\sqrt[4] k} {\sqrt m} + \frac k m\right)$. We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately $O\left( \frac 1 k + \frac {\sqrt k} {m} \right)$. We derive a combined estimator, which has an error of approximately $O\left(m^{-\frac 2 5}\right)$, for any $k$. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. We show that the leave-one-out method can be used for estimating the log-loss of the unigrams model with a PAC error of approximately $O\left( \frac 1 {\sqrt m} \right)$, for any distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.