Normalized Google distance

from Wikipedia, the free encyclopedia

One theory is the normalized Google distance ( Engl. Normalized Google distance , short- NGD ) as a statistic for the semantic proximity of two terms or semantic serve concepts. It is determined by the number of hits that are found for two terms entered in the Google search engine , i.e. the number of documents that contain both terms. The NGD is usually between 0 and 1, the lower it is, the closer two terms are related.

detection

If you enter a term, for example “horse”, into the Google search engine, you will get around 12,300,000 indexed pages (as of September 2007). For another term, for example “rider”, there are 13,900,000 pages. If you combine the terms, about 1,690,000 pages are found. For the common occurrence of the terms “horse” and “beard”, 262,000 pages are still listed, but it is clear that “horse” and “rider” are more closely related. This results in a certain probability that these terms will appear together. Compared to the total number of indexed pages (around 8,000,000,000), this results in the NGD.

The following formula is defined for the NGD of two terms and :

Where the number of hits for a certain term and the total number of pages indexed names. is not defined for the special case . The NGD of "horse" and "rider" is about 0.307, the NGD of "horse" and "beard" about 0.700.

Practical areas of application

The Dutch scientist Paul Vitanyi and the American scientist Rudi Cilibrasi believe that this process can automatically teach the meaning of terms to an artificial intelligence . Open source software developed by Cilibrasi called Complearn has already been able to use the NGD to separate colors from numbers or to group Dutch painters based on the title of their works.

Other possible applications could also be found in translation software .

Related procedures

Another method for measuring the distance between two information normalized information distance ( engl. Normalized information distance , short NID ) has been previously described by Paul Vitányi their properties introduced which reference the vicinity of the reference objects analyzed.

Individual evidence

  1. Vitanyi, Cilibrasi: Automatic Meaning Discovery Using Google ( arxiv : cs / 0412098v3 )

swell