PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Automatic Meaning Discovery Using Google
Rudi Cilibrasi and Paul Vitányi
unpublished 2005.

Abstract

We present a new theory of relative semantics between objects, based on information distance and Kolmogorov complexity. This theory is then applied to construct a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters, the ability to understand electrical terms, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87\% with the expert crafted WordNet categories.

PDF - Archive staff only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Article
Additional Information:this paper's focus is still unfolding
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Multimodal Integration
Information Retrieval & Textual Information Access
ID Code:1826
Deposited By:Rudi Cilibrasi
Deposited On:29 November 2005