Jaccard coefficient

from Wikipedia, the free encyclopedia

The Jaccard coefficient or Jaccard index after the Swiss botanist Paul Jaccard (1868–1944) is an indicator for the similarity of quantities .

Intersection of sets A and B.svg
Union of sets A and B.svg
Intersection (above) and union (below) of two sets A and B

history

Jaccard developed the "Jaccard coefficient" in his 1902 publication Lois de distribution florale dans la zone alpine on page 72. He called it "coefficient de communauté florale".

The Jaccard coefficient was able to establish itself in mathematics and is used as a measure of similarity for sets, vectors and, more generally, for objects. The Jaccard coefficient is used specifically for automatic text recognition and interpretation.

definition

To calculate the Jaccard coefficient of two sets, one divides the number of common elements (intersection) by the size of the union:

.

The following applies to quantities

.

The closer the Jaccard coefficient is to 1, the greater the similarity of the sets. The minimum value of the Jaccard coefficient is 0.

example

The two sets and have the Jaccard coefficient

Jaccard metric

The Jaccard metric can be derived from the Jaccard coefficient. This metric is calculated using the formula

.

General:

.

Applications

In the field of text mining and in particular the duplicate detection , the Jaccard similarity is a known measure for the similarity of two elements. Two strings are decomposed into tokens (eg. B. divided at the space, or using N-grams with ). The resulting sets of string sections are used as described above to calculate the similarity of the two sets.

Individual evidence

  1. ^ Paul Jaccard: Lois de distribution florale dans la zone alpine , Bulletin de la Société Vaudoise des Sciences Naturelles, Volume 38 (1902), p. 72, accessed online on November 23, 2018.
  2. Similarity measures for vectors at Fraunhofer. Retrieved November 23, 2018.
  3. ^ Jaccard coefficient in Hans Friedrich Eckey, Reinhold Kosfeld, Martina Rengers: Multivariate Statistics , Betriebswirtschaftlicher Verlag Dr. Th. Gabler GmbH, Wiesbaden, 2002, ISBN 3-409-11969-8 , p. 219. Accessed November 23, 2018.
  4. Jaccard coefficient in seo-suedwes. Retrieved November 23, 2018.
  5. ^ Bing Liu: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data . 2nd Edition. Springer-Verlag, Berlin / Heidelberg 2011, ISBN 978-3-642-19459-7 , pp. 231 f .