Semantic relatedness

Computational Measures of Semantic Relatedness are publically available means for approximating the relative meaning of words/documents. These have been used for essay-grading by the Educational Testing Service, search engine technology, predicting which links people are likely to click on, etc.


 * LSA (Latent semantic analysis) (+) vector-based, adds vectors to measure multi-word terms; (-) non-incremental vocabulary, long pre-processing times


 * PMI (Pointwise Mutual Information) (+) large vocab, because it uses any search engine (like Google); (-) cannot measure relatedness between whole sentences or documents


 * GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (-) non-incremental vocabulary, long pre-processing times


 * ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (-) cannot measure relatedness between multi-word terms, long pre-processing times


 * NGD (Normalized Google Distance; see below) (+) large vocab, because it uses any search engine (like Google); (-) cannot measure relatedness between whole sentences or documents


 * WordNet: (+) humanly constructed; (-) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary


 * ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP

Google distance
Google distance is a measure of semantic interrelatedness derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.

Specifically, the normalized Google distance between two search terms x and y is



\operatorname{NGD}(x,y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x,y)} {\log M - \min\{\log f(x), \log f(y)\}} $$

where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(x, y) is the number of web pages on which both x and y occur.

If the two search terms x and y never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero.

Google distance references

 * Rudi Cilibrasi and Paul Vitanyi (2005). Automatic Meaning Discovery Using Google.
 * Google's search for meaning at Newscientist.com.
 * Jan Poland and Thomas Zeugmann (2006), Clustering the Google Distance with Eigenvectors and Semidefinite Programming
 * Aarti Gupta and Tim Oates (2007), Using Ontologies and the Web to Learn Lexical Semantics (Includes comparison of NGD to other algorithms.)
 * Wilson Wong, Wei Liu and Mohammed Bennamoun (2007), Tree-Traversing Ant Algorithm for term clustering based on featureless similarities, Journal of Data Mining and Knowledge Discovery (the use of NGD for term clustering)