Cosine similarity

from Wikipedia, the free encyclopedia

Cosine similarity is a measure of the similarity between two vectors . The cosine of the angle between the two vectors is determined. The cosine of the included zero angle is one; for any other angle, the cosine of the included angle is less than one. It is therefore a measure of whether two vectors point roughly in the same direction.

Typical applications can be found in the comparison of documents , multimedia objects, in text mining , in data mining , in finding plagiarism , in search engines or in cryptography when decrypting encrypted texts. By determining the cosine similarity of the character placement vectors , the Codex Copiale , a document in ciphertext , was deciphered in 2011 .

calculation

The angle between two vectors is related to the standard scalar product:

.

The cosine similarity of two vectors and is the cosine of the included angle :

.

The cosine similarity lies between −1 (exactly opposite) and 1 (exactly in the same direction). A value of 0 usually means independence ( orthogonality ). Intermediate values ​​indicate similarity or dissimilarity.

When comparing texts, one takes as attribute vectors and usually frequency vectors of the documents, the weight of which can never be negative. Therefore the cosine similarity in this case is always between 0 and 1.

See also