A Comparative Study on Word Co-occurrence for Term Clustering in Language Model-based Sentence Retrieval
Saeedeh Momtazi, Sanjeev Khudanpur and Dietrich Klakow
In: NAACL 2010, 1 June - 6 June 2010, Los Angeles, CA, USA.
Sentence retrieval is a very important part of
question answering systems. Term clustering,
in turn, is an effective approach for improving
sentence retrieval performance: the more similar
the terms in each cluster, the better the performance
of the retrieval system. A key step in
obtaining appropriate word clusters is accurate
estimation of pairwise word similarities, based
on their tendency to co-occur in similar contexts.
In this paper, we compare four different
methods for estimating word co-occurrence
frequencies from two different corpora. The results
show that different, commonly-used contexts
for defining word co-occurrence differ
significantly in retrieval performance. Using an
appropriate co-occurrence criterion and corpus
is shown to improve the mean average precision
of sentence retrieval form 36.8% to 42.1%.