Morphological Analysis (Computational Linguistics)

from Wikipedia, the free encyclopedia

In computational linguistics, morphological analysis is a process that determines the morphological , syntactic and possibly semantic properties of words . In detail, morphological analysis methods can solve the following subtasks:

  1. Segmentation, d. H. Division of complex words into free and bound morphemes . The latter include prefixes and suffixes .
  2. Lemmatization: Returning a simple or complex word to its lemma and determining its syntactic properties. Example: The word "houses" is reduced to its lemma "house" with the properties { noun , plural , dative }.
  3. Determination of the word structure; this is often determined in connection with a word semantic analysis.

Problems

  • Regular and irregular allomorphism
    • Regular allomorphism is, for example, the insertion of e in verb forms with certain stems , for example "calculate" - "calculate e st", but "love" - ​​"love". This includes the affection of vowels in plural noun ( "W a ld" - "W ä forests") or comparative - and superlative forms of adjectives ( "r o t" - "r ö ter").
    • Irregular allomorphy consists for example when Ablaut ( "s i gs" - "s a ng" - "tot u gs") or master changes ( " thinking s" - "ge roof t").
  • Unrestricted derivation and composition : In German, words of almost any length can be formed using composition and derivation , for example, "Real estate transfer authorization transfer regulation" or "Great, great, great, great grandfather". Since there are any number of such words, a static lexicon in which all word forms are listed is not sufficient. Rather, the word must be actively segmented into its parts in order to determine the word properties using word-syntax regularities (in German, for example, the part determining the basic properties is on the far right).

Procedure

Most methods of morphological analysis are based on finite automata , more precisely finite transducers . The theoretical model used is usually the so-called two-level model (Koskeniemi), in which quasi-context-sensitive rules mediate between the lexical form of a morpheme and its surface form (morph). Such a rule for German could e.g. B. look like this:

  • ε → e / (ppn | chn | tm | d | tt) {VERBSTEM} _ (n | t | st) {VERBFLEX}

This rule allows the empty word to be replaced by e (effectively inserting e ) after a verb stem on ppn , chn , tm , d or tt ("arm", "calculate", "breathe", "ground", "save" ") before the verbal flexives n , t or st . Example : "calc" + "n" → "rechn e n".

See also

literature

  • K.-U. Carstensen et al. (2004): Computational Linguistics and Language Technology . Chapter 3.1, 3.2.
  • D. Jurafsky & JH Martin (2000): Speech and Language Processing . Prentice Hall.