The lemma ( ancient Greek λῆμμα lēmma , actually "the taken", "the accepted"; plural: lemmas ) is the basic form of a word in lexicography and linguistics, i.e. the word form under which a term can be found in a reference work ( nominal form , citation form ).

Lemma, lexeme and citation form

The lemma is the entry or key word in a dictionary ( lexicon , encyclopedia ). It is called both the basic form of a word and the citation or basic form of a lexeme . The process of determining the more precise lemmas is called lemma selection or lemmatization .

A lexeme - a linguistic unit of meaning - could in principle be named in any way, since it is abstracted as a linguistic unit of various forms, but does not itself have a specific form that distinguishes it from other forms. Lexemes are usually named after a conventionally determined form, which is then called the citation form (also: basic form , keyword ) of this lexeme:

  • In German, the citation form for nouns is usually the nominative singular (e.g. dream ), for verbs the infinitive present active (e.g. dream ).
  • In Latin, the citation form for verbs is the paradigm (example), which specifies a sequence of certain modes (infinitive, indicative, subjunctive) and tenses (present tense, perfect ...), which is particularly helpful with irregular verbs . This order is in most dictionaries: 1st person singular indicative present active, 1st person singular indicative perfect active, active supinum I or participle perfect passive (PPP) neuter and finally infinitive present active. For example, the paradigm for “bring, carry” is: fero , tuli , latum , ferre . In textbooks, on the other hand, the infinitive present active comes first.

Linguistic reference works based on the word ( lexica , thesauri , etymological works) use all lexemes as lemmas, while reference works that are more interested in conceptual lemma selection ( specialist lexicons , specialist glossaries , encyclopedias and the like) prefer the simplest noun as a form of citation - especially in German : For example, “the dream”, “dream”, “the dream” and “the dreamed” are summarized under a common lemma dream , as far as the same issue is involved. Here the lemma is usually used as a descriptor .

The following example shows that the choice of citation form depends on the type of reference work:

  • The word "mice" is classified under the lemma mouse .
    This approach selects a normal dictionary, since "mouse" is the basic form of the plural "mice".
  • In biology, the word "mouse" is classified under the lemma mice .
    In a biological textbook, the mice genus is used as an umbrella term. The taxonomic citation form mice expresses that there are many different types of mice and not just “the mouse”. The view of biology differs from the colloquial language, which calls everything that looks like a mouse a "mouse".
  • For computer mice, mouse is the lemma in a textbook ; For example, in a universal dictionary, the entry could be Mouse (Computer) .
    Computer mice can look different and differ in details, but the similarities are perceived as more important than the differences when they are classified in the dictionary. Therefore, unlike in biology, the lemma is used in the singular.


The lexicographical reduction of the inflected forms of a word to a basic form, i.e. the definition of the basic form of a lexeme and the arrangement of the lemmas is also called lemmatization . A subset of directly successive lemmas forms a Lemmastrecke .

Lemmatization is also understood to mean the determination (or return) of a full form to the corresponding lemma. Depending on the application, this process is important in speech technology . When using statistical models, for example, the lemmatization of a very small text corpus is sometimes suitable for increasing the frequency of individual lexemes and thereby reducing statistical noise . The full forms of the corpus are replaced by their lemma before the statistical evaluation. For example, if the word forms “met”, “meet”, “meets” and “meet” previously existed once in the corpus, after the lemmatization there is only the lemma “meet” - but with a frequency of four. The lexeme “meet” thus has a potentially much higher weight in the body than the individual full forms had before lemmatization.

Lemma selection

Before the lemmatization, a lemma selection is carried out in which it is decided which types of lemmas are to be included in the lexicon. The Lemmaselektion is necessary because a complete lemmatization of all words, parts of words or phrases of a language is difficult. A criterion for the inclusion of a lemma in a lexicon is the time span in which the term exists in the respective language.

Closely is the Lemmaselektierung with the indexing of relied on texts - which is unnecessary for total linguistic works because the full vocabulary to be developed, at times - and other group linguistic but quite relevant encyclopedias, and the question of synonymy , homonymy and polysemes .

