## AbstractAssessing the significance of word frequencies in texts is of broad importance to both linguistics and data mining. For instance, corpus linguists commonly study language use with tools such as collocations and key words, which are based on word frequencies and utilize a model of word frequency distributions to assess statistical significance. A standard approach to computing word frequency probabilities in corpus linguistics is the likelihood computation introduced by Dunning (1993). This approach is based on the assumption that words can occur in a text at each position with equal probability, i.e., words can be described as a Bernoulli process. Recent work by Altmann et al. (2009) shows that the recurrence patterns of words, or the distribution of inter-arrival times of lexical items, can be better described using the Weibull distribution, although this is not coupled with assessing the significance of word frequencies. By analyzing word frequencies and the distribution of words in a large diachronic corpus of English, we show that the Weibull distribution is a significant improvement on de facto standard methods in both linguistics and information retrieval, specially for assessing the significance of under-represented words. In addition, we show that the parameters of the Weibull distribution (as well as other distributions that have been used to model inter-arrival times of words in large corpora) cannot be estimated with high confidence for typical word frequencies and text lengths.
[Edit] |