Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a generative probability model for "documents" presented by David Blei , Andrew Ng and Michael I. Jordan in 2003 . The model is identical to a model published in 2000 for genetic analysis by JK Pritchard , M. Stephens and P. Donnelly . In this case, documents are grouped, discrete and unordered observations (hereinafter referred to as “words”). In most cases, text documents are processed in which words are grouped, regardless of the word order. But it can also z. B. pixels are processed from images.

Generating process

LDA models documents through a process:

First, the number of topics is determined by the user. ${\ displaystyle K}$

The document collection contains different terms that make up the vocabulary. First multinomial all terms of Dirichlet distributions drawn, these distributions are "subjects" (English topics mentioned). ${\ displaystyle V}$ ${\ displaystyle K}$ ${\ displaystyle V}$

For each document, a distribution across the topics is drawn from a Dirichlet distribution. A document therefore contains several topics. A generating Dirichlet distribution with parameters can be used to express the assumption that documents contain only a few topics. This assumption is the only new feature of LDA compared to previous models and helps to resolve ambiguities (such as the word "bank"). The increase in topic quality through the assumed Dirichlet distribution of the topics is clearly measurable. ${\ displaystyle K}$ ${\ displaystyle <1}$

Then a topic is drawn from a document for each word and a term is drawn from this topic.

properties

In LDA, each document is viewed as a mixture of latent topics . Each word in the document is assigned to a topic. These topics, the number of which is determined at the beginning, explain the common occurrence of words in documents. In newspaper articles, the words “euro, bank, economy” or “politics, election, parliament” often appear together. These sets of words then each have a high probability in a topic. Words can also have a high probability in multiple topics.

LDA will u. a. Used to analyze large amounts of text, to classify text, reduce dimensions or to find new content in text corpora. Other applications can be found in the field of bioinformatics for modeling gene sequences.

literature

David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent dirichlet allocation . In: Journal of Machine Learning Research , Vol. 3 (2003), pp. 993-1022, ISSN 1532-4435
David M. Blei: Probabilistic Topic Models . In: Communications of the ACM , (2013), vol 55/4, pp. 77-84.

Web links

LDA implementation in C by David Blei.

Individual evidence

↑ David M. Blei, Andrew Y. Ng, Michael I Jordan: Latent Dirichlet Allocation . In: Journal of Machine Learning Research . 3, No. 4-5, January 2003, pp. 993-1022. doi : 10.1162 / jmlr.2003.3.4-5.993 .
↑ JK Pritchard, M. Stephens, P. Donnelly: Inference of population structure using multilocus genotype data . In: Genetics . 155, No. 2, June 2000, ISSN 0016-6731 , pp. 945-959.
↑ Mark Girolami: On an Equivalence between PLSI and LDA . In: Proceedings of SIGIR 2003. Association for Computing Machinery, 2003. ISBN 1-58113-646-3

[blei2003-1] David M. Blei, Andrew Y. Ng, Michael I Jordan: Latent Dirichlet Allocation . In: Journal of Machine Learning Research . 3, No. 4-5, January 2003, pp. 993-1022. doi : 10.1162 / jmlr.2003.3.4-5.993 .

[pritchard2000-2] JK Pritchard, M. Stephens, P. Donnelly: Inference of population structure using multilocus genotype data . In: Genetics . 155, No. 2, June 2000, ISSN 0016-6731 , pp. 945-959.

[3] Mark Girolami: On an Equivalence between PLSI and LDA . In: Proceedings of SIGIR 2003. Association for Computing Machinery, 2003. ISBN 1-58113-646-3