Machine translation
Machine translation (MT or MT for machine translation ) refers to the automatic translation of texts from one language into another language by a computer program . While human translation is the subject of applied linguistics , machine translation is explored as a sub-area of artificial intelligence in computational linguistics .
Human dream
The understanding of a language, without having learned it is an old dream of mankind ( Tower of Babel , J. cup numerical Interlingua , timerio , Babel Fish , Pentecost , science fiction stories). The invention of the computer in combination with the study of the phenomenon of language as a scientific discipline ( linguistics ) paved the way for the first time to make this dream come true.
history
To this day, military interests have decisively shaped the path of MT. One of the earliest projects was a Russian-English translation program for the US military . Despite its anecdotally poor quality, the program enjoyed great popularity among the US military, who for the first time were able to get at least an impression of the content of Russian documents without going through third parties (interpreters and translators).
The ALPAC report drawn up in 1966 for the United States Department of Defense confirmed that the MT was fundamentally unrealisable and suddenly brought research to a virtual standstill for almost 20 years. It was not until the 1980s that electrical companies such as Siemens AG ( metal project) began research again. These projects also include research work in the special research area "Electronic Language Research" at Saarland University . This is where the "SUSY" system was developed, which was able to translate from German and into German. Another system of the Collaborative Research Center was ASCOF, in which not only morpho-syntactic information but also semantic information was used for the translation. At the same time, the Japanese government initiated the fifth generation project, in which MT was initially implemented from English into Japanese on the basis of the Prolog programming language. The close collaboration between universities, electronics companies and government resulted in the world's first commercial MT programs for personal computers and put Japan in the forefront of MT research worldwide. In the 1990s, the BMBF lead project Verbmobil ran in Germany , the aim of which was to interpret spoken dialogue language in German, English and Japanese . The Verbmobil system was supposed to recognize spontaneous speech , analyze the input, translate it, generate a sentence and pronounce it.
In the 2000s, statistical methods were increasingly used. Google has been offering a statistical translation system since 2006. Rule-based approaches have also been further developed. One of the best-known research projects of this type is the free software Apertium , which is financed by the Spanish government and the government of Catalonia and is being further developed at the University of Alicante .
The status of the MT in 2010 was rated by many as unsatisfactory. In principle, however, science still does not understand human language sufficiently. Most linguists even assumed that there are fundamental limits to machine translation without the competencies of automatic systems that go far beyond pure language understanding, since many translations also have large amounts of conceptual knowledge, meta-knowledge and knowledge of the constitution of the human environment in general and of the conventions of social ones Require interaction.
Since 2016, artificial neural networks have been increasingly used for translation programs . H. artificial intelligences were used, which led to rapid progress. Examples are DeepL , Google Translate , Yandex.Translate and the Bing Translater, which from now on achieved significantly better results.
In March 2018, Microsoft announced that it would use AI to achieve Chinese-English translations with the quality of a professional human translator. This is a breakthrough in machine translation that Microsoft did not expect so early.
The need for MT applications continues to grow:
- Many texts are now available digitally (i.e. easy to process for the computer).
- The globalization requires the transfer of more and more texts in more and more languages (the market for translation doubled every four years), while the popularity of the profession of translator / interpreter stagnant.
- Languages spoken by only a few Western Europeans / Americans or languages that are difficult to learn for them from regions whose residents hardly speak any Western languages are becoming increasingly important:
- commercially important: the East Asian languages Chinese , Korean and Japanese ; as well as Thai .
- militarily important: languages of the international conflict regions, especially with the participation of the US military. In 2003 several US software companies launched translation programs for Arabic and even Pashto (one of the languages in Afghanistan and the border regions of Pakistan ). Also in 2003, DARPA held a blind competition for an unknown source language. In 2011, the BOLT program was launched, with the aim of promoting research into the translation of Chinese and Arabic texts into English.
Translation methods
Rules-based methods
Direct machine translation
The words in the source text are translated word for word and in the same order into the target language using the dictionary. Subsequently, sentence order and inflection are adjusted according to the rules of the target language. This is the oldest and simplest MT method, which, for example, was also the basis of the above Russian-English system.
Transfer method
The transfer method is the classic MT method with three steps: analysis, transfer, generation. The second step gave the whole method its name. First, the grammatical structure of the source text is analyzed, often in a tree structure. Depending on the selected transfer method, a semantic structure is often derived from this. Then the structures are transferred into the target language (= transferred). Finally, sentences are in turn generated in the target language from the structures with grammatical rules and thus the target text is generated (= generated).
Interlingua method
The Interlingua method first analyzes the grammatical information of the source text and transfers it, according to predefined rules, into an "intermediate language" (= Interlingua). The grammatical information in the target language is generated from this intermediate language. The Interlingua method is useful for ambiguous expressions. So you can German colloquially "When I was working, I would buy a car." (High linguistically with the subjunctive: "When I was working, I bought a car") not having a transfer rule would → would translate ( "If I would work, I would buy a car. ”), because in English would not allow conditional sentences . In Interlingua, the would information would be passed on abstractly as “unreal conditional” and, depending on the sentence context, would be implemented in English with or without would .
Example-based MT
( Example-Based Machine Translation , EBMT)
The centerpiece of example-based MT system is a translation memory , stored in the frequently recurring sentences or phrases with their translations. Statistically, it is calculated (using information retrieval methods) how similar all entries in the translation memory are to a sentence in the source text. The translation is generated from the combination of the translation of the most similar sentences.
Statistical MT
( Statistics-Based Machine Translation , SBMT)
Before the actual translation, a program analyzes the largest possible text corpus of bilingual texts (often for example parliamentary minutes, for example from the Canadian Hansard corpus). Words and grammatical forms in the source and target language are assigned to one another based on their frequency and mutual proximity, thus extracting a dictionary and grammar transfer rules. The texts are translated on this basis. The statistical MT is very popular because it does not require any knowledge of the languages involved. Therefore, by analyzing real texts, the statistical MT can theoretically also capture rules that have not yet been precisely explained in linguistic terms.
Neural MT
( Neural Machine Translation , NMT)
Like statistical MT, neural MT is based on the analysis of bilingual texts. These texts are learned from an artificial neural network and the relationships between the source and target language are recorded. During the translation, however, it cannot be understood how the result came about, although it seems to translate many texts more precisely than the competition.
MT with human help
( Human-Aided Machine Translation , HAMT)
With human-assisted MT, the user has to translate or avoid ambiguous or difficult-to-translate constructions himself. This can be done in advance, for example by the user breaking long sentences into short sentences, or interactively, for example by the user choosing the correct meaning of a word.
Demarcation
Not to machine translation is one of the computer-aided translation ( Machine-Aided Human Translation , MAHT , and Computer-Aided Translation or CAT called), in which a computer program the human translator supports.
quality
rating
MT research uses evaluation , the scaled assessment of translation quality. MT translations are initially rated per sentence; the normalized sum of the sentence ratings is the quality of the whole text. In most cases, the assessment is carried out by hand by a native speaker of the target language and expressed in a code number . In Japan, a five-digit scale with 0–4 points is often used:
- 4 points : very easy to understand to perfect; not an obvious mistake.
- 3 points : One or two wrong words; otherwise easy to understand.
- 2 points : With a good will one can roughly guess what was originally meant.
- 1 point : The sentence is understood in a different sense than it was intended (if at all). This is often due to partially incorrect or completely incorrect grammar translation (structure).
- 0 points: The sentence makes no sense; looks like a random, chaotic arrangement of words.
For the automatic assessment of the translation quality, algorithms such as the Bleu Score are used , which measure the similarity of the automatic translation to a human reference translation. Bleu and other evaluation measures have been criticized because they are unreliable and - especially at sentence level - only differentiate between good and bad translations to a limited extent. Nevertheless , automatic evaluation measures correlate relatively well with human evaluations, especially when evaluating entire text documents with several thousand sentences.
An effective evaluation method for the quality of a machine pre- translation is based on the so-called hit rate : "Number of terms, based on all terms in the document, which the translator can use unchanged during manual post-translation (without manual intervention) (inflection - position of the clause in the sentence ¦ des Terms in the sentence) ".
- Terms represent single words or fixed groups of words.
- Depending on the quality, machine pre-translation is worthwhile or hinders the translator.
Practical problems
The fact that MT quality is often perceived as unsatisfactory also has more tangible, partially remediable causes:
- User knows target language
- Especially when translating between Western languages, the user often understands the target language to a certain extent and is more sensitive to deviations than someone who is solely dependent on the translation.
- linguistic style
- Every style of language has its own characteristics, some of which have not even been described in linguistics. MT systems are mostly based on the written newspaper language. MT systems deliver particularly poor results with text types for which they were not developed, i.e. mostly with literary texts, with spoken language or occasionally with technical texts.
- Dictionary too small or incorrect
- With the changes in society and science, the vocabulary of a language is increasing rapidly every day. In addition, many words have multiple meanings (see homonym ) that could be disambiguated by context analysis . Dictionary deficiencies, such as in the Russian-English example, are to a surprisingly large extent responsible for the poor translation quality. The largest MT programs have dictionaries with several million entries and a multiple of meanings.
- Lack of transfer rules
- Many grammatical phenomena differ greatly from language to language or are only present in certain languages. Solving these problems often requires basic linguistic research ; MT companies try to avoid this effort.
- Computational Linguistic Problems
- In addition, MT has many problems that also occur with other computational linguistic applications, for example the understanding of world knowledge .
Grammatical problem areas of rule-based methods
In no MT system is every grammatical rule applied or analyzed. Rather, it is often trusted that a grammatical phenomenon that has not been analyzed happens to occur in a similar form in the other language, so that only the words need to be translated. One example is the article der, die, das , which is almost always translated to the and almost never to a in English. An analysis as a “specific article” can therefore be dispensed with. The above if sentence with “would” shows that such simple translations can also fail between German and English . Such direct translations are often not a safe choice, even at the word level, between less close and unrelated languages, for example Latin and German or Chinese and German.
Many complex grammar phenomena have not yet been researched by MT, or only partially researched. Then free rides are often the only solution. Such phenomena are (selection):
- items
- The Germanic and Romance languages have articles, but many other languages do not. When translating from another language, the correct article has to be generated “out of nowhere” - but not in all cases.
- Compound nouns
- In languages such as German or Japanese, the exact relationship between nouns can be "concealed" by simply placing them next to each other. In other languages the relationship has to be made explicit. Example: Donaudampfschifffahrtsgesellschaftskapitän = "A captain who works for a company that operates steamers on the Danube".
- Compound sentence components
- In the Welsh language , a very long noun phrase can be in one word, e.g. B. Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch = "St. Mary's Church in a hollow of white hazelnuts near a swift vortex and in the area of the Thysiliokirche, which is located near a red cave".
- relative pronoun
- Most languages have no or only a single relative pronoun. When translating into German ( der, die, das ) or English, a differentiation must be made.
- Tense / modality
- Each language has its own system of saying that a sentence happened in the past or is an order. In European languages this is often done with verb and adverb .
Examples
Examples of machine translation are:
See also
literature
- Douglas Arnold et al .: Machine Translation. An Introductory Guide. Blackwell, Manchester et al. a. 1994, ISBN 1-85554-246-3 .
- Kurt Eberle: Integration of rule and statistics-based methods in machine translation . In: Uta Seewald-Heeg, Daniel Stein (Hrsg.): Machine translation - from theory to application. JLCL, issue 3/09, 2009.
- John W. Hutchins: Machine Translation. Past, present, future. Harwood and Wiley, Chichester / New York 1986, ISBN 0-470-20313-7 .
- Uwe Muegge : Localization and machine translation systems. In: Jörg Hennig, Marita Tjarks-Sobhani (Hrsg.): Localization of technical documentation. Schmidt-Römhild, Lübeck 2002, ISBN 3-7950-0789-5 , pp. 110-121.
- Jörg Porsiel (Ed.): Machine translation. Basics for professional use. BDÜ Weiterbildungs- und Fachverlagsgesellschaft mbH, Berlin 2017, ISBN 978-3-93843-093-4
Web links
- Texts by John Hutchins on the history of machine translation, including the standard work Machine translation: past, present, future.
- The Babel machine in Brussels - article at heise.de
Individual evidence
- ↑ John R. Pierce, John B. Carroll, et al .: Language and Machines - Computers in Translation and Linguistics . ALPAC report, National Academy of Sciences , National Research Council , Washington, DC, 1966.
- ↑ H.-D. Maas: The Saarbrücken translation system SUSY. In: Language and Data Processing. 1978 (1).
- ↑ Axel Biewer et al .: A modular multilevel system for French-German translation. In: Computational Linguistics (Special issue on machine translation). Volume 11 Issue 2-3, April-September 1985, pp. 137-154.
- ↑ Verbmobil - Info Phase 2. In: verbmobil.dfki.de. Retrieved July 16, 2016 .
- ↑ statistical machine translation live . Och, Franz: Google Research Blog. Retrieved July 21, 2013.
- ↑ This AI researcher knows that smart robots will soon imitate us - podcast, minute 13:10. Retrieved March 16, 2018 .
- ↑ AI translates as good as a human , golem.de of March 16, 2018
- ↑ “Historical breakthrough” - AI translates Chinese as well as a human , vrodo.de of March 15, 2018
- ↑ Broad Operational Language Translation (BOLT). In: www.darpa.mil. Retrieved July 16, 2016 .
- ↑ BOLT | Linguistic Data Consortium. In: www.ldc.upenn.edu. Retrieved July 16, 2016 .
- ^ Phillip Koehn: Statistical Machine Translation . Ed .: Cambridge University Press. ISBN 978-0-521-87415-1 .
- ↑ Dzmitry Bahdanau, et al .: Neural Machine Translation by Jointly Learning to Align and Translate . In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, 2015.
- ↑ DeepL. DeepL GmbH, Cologne, accessed on September 18, 2017 .
- ↑ Anna Gröhn: Online translator in comparison: "I want to breathe my throat slowly" . In: Spiegel Online . September 17, 2017 ( spiegel.de [accessed September 18, 2017]).
- ↑ Kishore Papineni et al .: BLEU: a method for automatic evaluation of machine translation . In ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. 2002, pp. 311-318.
- ↑ Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in the 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006 pp . 249-256
- ↑ Chris Callison-Burch, et al .: Findings of the 2012 Workshop on Statistical Machine Translation . In Proceedings of the Seventh Workshop on Statistical Machine Translation. 2012, pp. 22-23.
- ↑ Microsoft's Bing Translator , viewed January 8, 2018
- ↑ This translation software won the European ICT Prize in 2005 .