Automatic Meaning Discovery Using Google
Rudi Cilibrasi and Paul Vitányi
We present a new theory of relative semantics between objects,
based on information distance and Kolmogorov complexity.
This theory is then applied to construct a method to
automatically extract the meaning
of words and phrases from the world-wide-web using Google
The approach is
novel in its unrestricted problem domain, simplicity of implementation,
and manifestly ontological underpinnings.
The world-wide-web is the largest database on earth,
and the latent semantic context information
entered by millions of independent users
averages out to provide automatic meaning of useful quality.
We give examples
to distinguish between colors and numbers,
cluster names of paintings by 17th century Dutch masters,
the ability to understand
electrical terms, and primes,
and we demonstrate the ability to do a simple
automatic English-Spanish translation.
Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87\% with
the expert crafted WordNet categories.