N-grams

from Wikipedia, the free encyclopedia

N-grams are the result of breaking a text into fragments. The text is broken down and  successive fragments are summarized as n-grams . The fragments can be letters, phonemes , words, and the like. N-grams are used in cryptology and corpus linguistics , especially in computational linguistics , quantitative linguistics and computer forensics . Individual words , whole sentences or complete texts are broken down into n-grams for analysis or statistical evaluation.

Types of N-grams

Bigram frequency range: Distribution of bigram in a German text.
Trigram frequency range: Distribution of the trigrams in a German text. The triples ER_ and EN_ are the most common (“_” stands for the space).

Important N-grams are the monogram , the bigram (sometimes also referred to as the digram) and the trigram . The monogram consists of one character, for example just a single letter, the bigram of two and the trigram of three characters. In general, one can also speak of multigrams if there is a group of "many" characters.

The prefixes of scientific names are often formed with the help of Greek numerals . Examples are mono for “alone” or “only”, tri for “three”, tetra for “four”, penta for “five”, hexa for “six”, hepta for “seven”, octo for “eight” and so on . Bi and multi are prefixes of Latin origin and stand for "two" and "many" respectively.

The following table gives an overview of the designation of the N-grams , sorted according to the number of characters together with an example in which alphabet letters were used as characters :

N-gram name N example
monogram 1 A.
Bigram 2 FROM
Trigram 3 U.N.
Tetragram 4th HOUSE
Pentagram 5 TODAY
Hexagram 6th UMBRELLA
Heptagram 7th PHONE
Octogram 8th COMPUTER
... ... ...
Multigram N OBSERVATION LIST ( )

Formal definition

Let be a finite alphabet and be a positive integer. Then a -gram is a word of length above the alphabet , that is .

analysis

The N-gram analysis is used to answer the question of how likely a particular character or a particular word will follow a certain letter or phrase. The conditional probabilities for the next letter in the sequence "for ex ..." for a certain sample from English in descending order of priority are approximately: a = 0.4, b = 0.00001, c = 0, ... with a total of 1. On the basis of the N-gram frequencies, a continuation of the fragment with “a” → “for exa (mple)” seems much more likely than the alternatives.

The language used is not important for the analysis, but its statistics are important : The n-gram analysis works in every language and every alphabet. Therefore, the analysis has proven itself in the fields of language technology : Numerous approaches to machine translation are based on the data obtained with this method.

The n-gram analysis is particularly important when large amounts of data, for example e-mails , are to be examined for a specific topic. Due to the similarity to a reference document, such as a technical report on atom bombs or polonium , clusters can be formed: The closer the word frequencies in an email are to those in the reference document, the more likely it is that the content revolves around its topic and under certain circumstances - in this example - could possibly be relevant to terrorism, even if keywords that clearly indicate terrorism do not appear themselves.

Commercially available programs that take advantage of this fault-tolerant and extremely fast method are spell checkers and forensics tools. In the programming language Java which has library Apache OpenNLP about tools for the N-gram analysis in Python is NLTK available.

Google corpus

Web indexing

In 2006 the company Google published six DVDs with English-language N-grams of one to five words, which were created during the indexing of the web. Here are some examples from the Google corpus for 3-grams and 4-grams at the word level (i.e. n is the number of words) and the frequencies with which they occur:

3 grams:

  • ceramics collectables collectibles (55)
  • ceramics collectables fine (130)
  • ceramics collected by (52)
  • ceramics collectible pottery (50)
  • ceramics collectibles cooking (45)

4 grams:

  • serve as the incoming (92)
  • serve as the incubator (99)
  • serve as the independent (794)
  • serve as the index (223)
  • serve as the indication (72)
  • serve as the indicator (120)
example
A string to search is
= {"Welcome to come"}.
(so-called bigram)
The frequency of occurrence of the individual bigrams is determined.
Thus the "frequency vector" for the character string is :
_W: 1
We: 1
el: 1
lc: 1
co: 2
om: 2
me: 2
e_: 1
_t: 1
to: 1
o_: 1
_c: 1

That is . The length of the vector is limited upwards by, where the length of and is the binomial coefficient .

Google Books corpus

A data set from Google Books with a reference date of July 2009 was provided with a web interface and graphic analysis in the form of diagrams and put on the Internet under the name Google Books Ngram Viewer . By default, it shows the normalized frequency to the number of books available for that year for up to 5 grams. Operators can be used to combine several terms into a graph (+), add a multiplier for very different terms (*), show the relationship between two terms (-, /) or compare different corpora (:). The graphics can be used freely ("freely used for any purpose"), whereby the source and a link are desired. The basic data can be downloaded in individual packages for your own evaluations and is under a Creative Commons Attribution license. In addition to an evaluation option for English in general, there are special queries for American English and British English (differentiated based on the publication locations), as well as for English Fiction (based on the classification of the libraries) and English One Million . For the latter, up to 6000 books per year were randomly selected in proportion to the number of published and scanned books from 1500 to 2008. There are also corpora for German, Simplified Chinese, French, Hebrew, Russian and Spanish. The spaces were simply used for tokenization . The n-gram formation happened across sentence boundaries, but not across page boundaries. Only words that occur at least 40 times in the corpus were included.

A new corpus dated July 2012 was made available at the end of the year. Italian was added as a new language, and English One Million was not created again. Basically, it is based on a larger number of books, improved OCR technology and improved metadata . The tokenization was done according to a set of handwritten rules, except for Chinese, where a statistical method for segmentation was used. The n-gram formation now ends at sentence boundaries, but now goes beyond page boundaries. With the sentence limits now taken into account, new functions have been introduced for the 2012 corpus, which allow the position in the sentence to be evaluated with a high degree of probability for 1-, 2- and 3-grams and thus, for example, distinguish homographic nouns and verbs in English, although this works better in modern language.

Dice coefficient

The Dice coefficient is how similar two terms are. To do this, it determines the proportion of the N-grams that are present in both terms. The formula for two terms and is

where is the set of N-grams of the term . d is always between 0 and 1.

example
  • Term a = "effective"
  • Term b = "work"
When using trigrams, the decomposition looks like this:
  • T (a) = {§§w, §wi, we, irk, rk§, k§§}
  • T (b) = {§§w, §wo, wor, ork, rk§, k§§}
  • T (a) T (b) = {§§w, k§§, rk§}
That means d (effective, work) = .
The Dice coefficient (one can also say the similarity) is 0.5 (50%).
application areas
Due to the extensive language neutrality, this algorithm can be used in the following areas:

statistics

The N-gram statistic is a statistic about the frequency of N-grams, sometimes also of word combinations of N words. Special cases are the bigram statistics and the trigram statistics. N-gram statistics are used in cryptanalysis and in linguistics , especially in speech recognition systems . The system checks the various hypotheses together with the context during recognition and can thus differentiate between homophones . In quantitative linguistics, the ranking of the N-grams according to frequency and the question of which laws it follows are of interest. Statistics of digrams (and trigrams) in German, English and Spanish can be found in Meier and Beutelspacher .

For meaningful statistics, sufficiently large text bases of several million letters or words should be used. As an example, the statistical evaluation of a German text base of around eight million letters results in “ICH” as the most common trigram with a relative frequency of 1.15 percent. The following table gives an overview of the ten (in this text base) the most frequently determined trigrams:

Trigram frequency
I 1.15%
A 1.08%
AND 1.05%
OF THE 0.97%
NDE 0.83%
SCH 0.65%
THE 0.64%
THE 0.62%
END 0.60%
CHT 0.60%

literature

  • Wolfgang Schönpflug : N-gram frequencies in the German language. I. Monograms and digrams. In: Journal for experimental and applied psychology XVI, 1969, pp. 157-183.
  • Pia Jaeger: Changing social justice. An idealistic construct and / or a means of ensuring political acceptance. Baden-Baden, Nomos 2017, ISBN 978-3-8452-8440-8 , pp. 25–56: Representation and application to the expression "social justice" - an application example

Web links

Wiktionary: N-gram  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. ^ Dan Jurafsky Stanford University and James H. Martin University of Colorado Boulder : Speech and Language Processing - An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. In: 3. Language Modeling with N-Grams. Accessed April 3, 2020 (English).
  2. How to use NGram features for Document Classification in OpenNLP. In: TutorialKart. Retrieved April 3, 2020 (American English).
  3. Generate the N-grams for the given sentence. In: Python Programming. May 3, 2019, accessed April 4, 2020 .
  4. Web 1T 5-gram version 1
  5. Alex Franz and Thorsten Brants: All Our N-gram are Belong to You . In: Google Research Blog . 2006. Retrieved December 16, 2011.
  6. ^ Google Books Ngram Viewer
  7. a b Google Books Ngram Viewer - Info
  8. ^ Google Books Ngram Viewer - Datasets
  9. ^ Helmut Meier: German language statistics. Second enlarged and improved edition. Olms, Hildesheim 1967, pp. 336-339
  10. ^ Albrecht Beutelspacher: Kryptologie . 7th edition, Vieweg, Wiesbaden 2005, ISBN 3-8348-0014-7 , pages 230-236; also: trigrams.