Inferring Document Similarity from Hyperlinks
David Grangier and Samy Bengio
In: Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005.
Assessing semantic similarity between text documents is a crucial
aspect in Information Retrieval systems. In this work, we propose
to use hyperlink information to derive a similarity measure that can
then be applied to compare any text documents, with or without
hyperlinks. As linked documents are generally semantically closer
than unlinked documents, we use a training corpus with hyperlinks
to infer a function a,b to sim(a,b) that assigns a higher value
to linked documents than to unlinked ones. Two sets of experiments
on different corpora show that this function compares favorably
with OKAPI matching on document retrieval tasks.