Corpus Linguistics

from Wikipedia, the free encyclopedia

The corpus linguistics is a field of linguistics . In it, new knowledge about language in general or about certain individual languages ​​is obtained or existing hypotheses are checked, using quantitative or qualitative data as a basis, which are obtained from the analysis of special text corpora or (less often) corpora of spoken language . Corpus linguistics became widespread in German-speaking countries from the second half of the 1990s. From an epistemological point of view, it is opposed to generativism . It is still controversial whether corpus linguistics is a method or a new branch of linguistics of its own.

Data material and subject of research

The subject of corpus linguistics is language in its various manifestations. The corpus linguistics is characterized by the use of authentic language data that are documented in large corpora. Such text corpora are collections of linguistic utterances that are compiled according to specific criteria and with a specific research goal. The findings of corpus linguistics are thus based on natural expressions of a language, i.e. on language as it is actually used. These utterances can either be in writing or they can be spontaneous or elicit spoken language. Most corpora are now available in digital form and can be used for linguistic research using specific software.

The aim of corpus linguistics is to use this data to either check (confirm or refute) existing linguistic hypotheses or to gain new hypotheses and theories about the subject through exploratory data analysis. In the first case one speaks of “corpus-based” linguistic analysis and in the second case of “corpus-based” linguistic analysis.

Corpus linguistic questions concern both the linguistic system itself (“ Langue ” according to Ferdinand de Saussure or “ Competence ” according to Noam Chomsky ) and the use of language (“Parole” according to de Saussure or “Performance” according to Chomsky). Corpus linguistics is about to abolish the dichotomous view of language that dominates linguistics .

A typical question regarding the language system is, for example:

  • Can the run-up to a German sentence have multiple entries? If so, with which parts of the sentence? Are there rules that describe the possibilities of multiple apron manning?

Typical questions relating to language use include:

  • Are there more typographical errors in the texts of e-mails than in traditional letters? What types of errors are characteristic of emails?
  • Which mistakes do learners of German (different source languages) make particularly often at a certain level, are certain words or grammatical constructions avoided by these learners?

In the case of numerous research questions that corpus linguistics tries to answer, however, it is not possible to clearly decide which of the two domains langue and parole a phenomenon is to be assigned to, such as the questions:

  • Which adjectives does the noun “hair” typically appear with?
  • Are particles used more, less often, or differently in spoken language than in written language?

Because on the one hand the distribution of the adjectives with “hair” and the modal particles can be seen as a phenomenon of a certain language or - after comparison with other languages ​​- as a characteristic of language in general, but on the other hand it can also be seen as the result of a specific linguistic usage.

(The works of Lemnitzer / Zinsmeister (2010) for German and McEnery / Xiao / Tono (2006) for English offer an insight into the facets of corpus linguistic research.)

Methodological problems

A significant methodological problem in corpus linguistics is the relationship between the database, i.e. the corpus, and the examined object. Theoretically, the database could completely cover the subject if it was a language that is still used today. However, a corpus cannot be regarded as a valid sample in the sense of inferential statistics, since the object to which the sample relates cannot in practice be recorded as a whole - i.e. a certain language or a certain linguistic usage. Today one avoids calling a corpus (as originally required) "representative" in the statistical sense for the examined object and only considering findings that are gained on the basis of corpora as provisionally plausible. The compilation of large corpora should therefore be “balanced”, i.e. consist of different types of text in a certain ratio.

The basic assumption of corpus linguistics that knowledge about language can be gained or checked on the basis of real linguistic utterances brings with it two further methodological problems or objections:

  1. Misleading positive evidence: In spontaneous spoken and even in carefully formulated written utterances, deviations from the linguistic norm can occur to a certain extent. When examining a corpus, it can be difficult in individual cases to decide whether a (mostly small) amount of evidence of a certain linguistic phenomenon is an expression of an actually existing systematic use of language and thus supports a linguistic thesis, or whether this evidence is considered to be non-standard or incorrect Must view language usage.
  2. Negative evidence: Many statements about linguistic phenomena cannot be substantiated even in very large corpora if the use of certain linguistic constructions is very rare. However, the absence of such a sought-after construction in the corpus does not necessarily mean that it does not exist or that it is ungrammatic.

In the first case, one can try to support the results obtained through corpus analysis by means of a parallel speaker survey. In the second case, only the investigation of further data or, as a last resort, also a speaker questionnaire helps.

Corpus Linguistics vs. Generative grammar

Corpus linguistics is based on the use of natural languages. It is an inductive / empirical method to gain knowledge about the language: The observation of as many concrete examples as possible leads to the formulation of a general statement about the subject. This approach (“from the specific to the general”) can be assigned to empiricism , which assumes that all knowledge is based on experience. In contrast to this is the deductive method, which is derived from the philosophical tradition of rationalism : starting from the consideration of what a certain linguistic phenomenon is like, an attempt is made to find evidence in the languages ​​to confirm it ("from general to specific" ).

This fundamentally distinguishes corpus linguistics from the generative transformation grammar founded by Noam Chomsky and its successors, whose stated aim is also to examine the language skills of the competent speaker as a cognitive achievement. Chomsky himself has clearly denied the value of authentic language evidence for linguistic knowledge gain. He found that authentic language data, such as those available in text corpora, are unsuitable for investigating performance, since errors always occur in the production of language. Therefore, no valid statements about the linguistic system can be made on the basis of the data obtained in this way. Chomsky therefore methodically focused on introspection and on speaker's judgments, which are elicited under laboratory conditions by competent native speakers. Corpus linguistics, on the other hand, does not consider the difference between language competence and language performance, which Chomsky considers essential.

Recently, however, a convergence between these two positions has been observed. In both camps, people are now looking at their own database more critically and are ready to use the data preferred by the other side at least as an instrument to control their own findings.

History and areas of application

The widespread use and great importance of the English language as well as an overall high affinity for empirical research in linguistics are two reasons why computer-aided data analysis, such as corpus linguistics, first developed in the Anglo-American region.

The modern corpus linguistics there was founded in 1967 by Henry Kucera (1925-2010) and Nelson Francis through their work "Computational Analysis of Present-Day American English". Their results were obtained using the “Brown Corpus” (precisely: “Brown University Standard Corpus of Present-Day American English”). This originally comprised around 1 million words. Other English-language corpora followed, such as the "Lund-Oslo-Bergen corpus" (LOB) of the same size in the 1980s. A new milestone was reached with the creation of a text corpus that far exceeded this number as part of the lexicographical work at the English Collins publishing house. The result was the first edition of the "Collins Cobuild Dictionary of English". This was followed on a new scale by the non-commercial creation of a balanced “British National Corpus” comprising 100 million running words, which is still used today as the reference corpus for linguistic research into British English. Today the "American National Corpus" joins him. Other regional varieties of English are recorded in the International Corpus of English (ICE).

The pioneers of German corpus linguistics were the Institute for Communication Science and Phonetics (IKP) at the University of Bonn and the Institute for the German Language in Mannheim. Today the following German-speaking corpora should be mentioned in particular:

  • the “German Reference Corpus” (DeReKo) at the Institute for German Language in Mannheim, which comprises several billion text words
  • the core of the “Digital Dictionary of the German Language” (DWDS) at the Berlin-Brandenburg Academy of Sciences
  • the corpus of the project "German vocabulary" at the University of Leipzig (mainly texts from online media)
  • the "Swiss text corpus" at the University of Basel (with 20 million text words)

In addition to these corpora with guaranteed long-term care, which are accessible to the public free of charge, there are a large number of special corpora for many language levels and varieties of German. (Lemnitzer / Zinsmeister (2010) provide an overview of this.)

Corpora are, as the example of the Collins Cobuild project, but also the American Heritage Dictionary Show (1969), used by a lexicography that the user not only prescriptive (as should be used a word), but also descriptive (as is a word actually used) wants to offer descriptions. Quantitative surveys of word frequency statistics can control and objectify the selection of lemmas for many types of dictionaries. Today the use of corpora is also established in German dictionary publishers. Some types of lexical information can only be obtained on the basis of the analysis of large text corpora (e.g. frequency profiles staggered over time), others can be better secured by corpora than by the language competence of individual lexicographers.

Corpora are now increasingly used as a research basis in language didactics. Based on the results of how a language is actually used, the teaching materials are also designed, and so-called learner corpora show which errors in language production prevail in which learning stages.

For special linguistic questions, other special corpora are being developed to an increasing extent, which are of course much smaller in scope than reference corpora intended to cover a language as a whole. There are, for example, studies of language use in politics and the media.

Corpus Linguistics - Method or Discipline?

The question of whether corpus linguistics is a method of general or applied linguistics or whether it represents a linguistic discipline of its own has not yet been conclusively answered.

The fact that many branches of linguistics, from theoretical to forensic linguistics, make use of an empirical, corpus-related analysis technique in a methodically reflected manner, although mostly not exclusively, speaks in favor of the assessment as a method. A genuine object of corpus linguistics, however, is not recognizable. However, this would be necessary if one wanted to give it the status of an independent scientific discipline.

The assessment that corpus linguistics is an independent discipline is supported by the fact that it specifically defines the use of language as its object of knowledge and thus sets itself apart from schools of linguistics that focus on human language ability or the general structures of language as a semiotic system to have.

Regardless of this fundamental consideration, corpus linguistics has established itself as a branch of science in academic life. This is indicated by the existence of several thematic journals, a two-volume manual (Lüdeling / Kytö 2008, 2009) and two dedicated chairs at the University of Birmingham and the Berlin Humboldt University.

literature

Printing unit
  • Andrea Abel, Renata Zanin: Corpora in teaching and research. Bozen-Bolzano University Press, Bozen 2011, ISBN 978-88-6046-040-0 .
  • Noah Bubenhofer: linguistic usage pattern. Corpus linguistics as a method of discourse and cultural analysis. de Gruyter, Berlin / New York 2009, ISBN 978-3-11-021584-7 .
  • Noam Chomsky : Knowledge of Language. Praeger, New York 1986.
  • Reinhard Fiehler, Peter Wagener: The Spoken German Database (DGD). In: Conversation Research - Online Journal for Verbal Interaction. Volume 6, 2005, pp. 136-147.
  • Hagen Hirschmann: Corpus Linguistics. An introduction . Metzler Verlag, Stuttgart 2019, ISBN 978-3-476-05493-7 .
  • Werner Kallmeyer, Gisela Zifonun (Hrsg.): Language corpora - amount of data and progress in knowledge. (= IDS yearbook. 2006). de Gruyter, Berlin / New York 2007.
  • András Kertész, Csilla Rákosi: Data and Evidence in Linguistic Theories: A Research Review. In: A. Kertész, Cs. Rákosi (Ed.): New Approaches to Linguistic Evidence. Pilot Studies / New Approaches to Linguistic Evidence. Pilot studies. Lang, Frankfurt am Main et al. 2008, pp. 21–60.
  • Reinhard Köhler: Corpus Linguistics. On the theoretical principles and methodological perspectives. In: LDV Forum 20/2. (PDF; 5.4 MB). 2005, pp. 1-16.
  • Snježana Kordić : The relative clause in Serbo-Croatian (=  Lincom Studies in Slavic Linguistics . Volume 10 ). Lincom Europa, Munich 1999, ISBN 3-89586-573-7 , LCCN  2005-530314 , OCLC 47905097 , DNB 963264087 , p. 330 .
  • Lothar Lemnitzer, Heike Zinsmeister: Corpus linguistics. 2nd, revised edition. Gunter Narr Verlag, Tübingen 2010.
  • Winfried Lenders: Computational lexicography and corpus linguistics until approx. 1970/1980. In: RH Gouws, U. Heid, W. Schweickard, HE Wiegand (Eds.): Dictionaries - An International Encyclopedia of Lexicography. Supplementary Volume: Recent Developments with Focus on Electronic and Computational Lexicography. de Gruyter Mouton, Berlin 2013, ISBN 978-3-11-214665-1 , pp. 982-1000.
  • Anke Lüdeling, Merja Kytö: Corpus Linguistics. An International Handbook. Vol. 1, de Gruyter, Berlin / New York 2008; Vol. 2, 2009.
  • Tony McEnery, Andrew Wilson: Corpus linguistics: an introduction. 2nd Edition. Edinburgh University Press, 2001.
  • Tony McEnery, Richard Xiao, Yukio Toni: Corpus-Based Language Studies: An advanced resource book. Routledge, New York 2006, ISBN 0-415-28622-0 .
  • Rainer Perkuhn, Holger Keibel, Marc Kupietz: Corpus Linguistics . Fink / UTB, Paderborn 2012, ISBN 978-3-8252-3433-1 .
  • Carmen Scherer: Corpus Linguistics. (= Short introductions to German linguistics. Volume 2). Winter, Heidelberg 2006.
  • P. Wagener, K.-H. Bausch (Ed.): Sound recordings of spoken German. Documentation of the holdings of linguistic research projects and archives. (= Phonai. Volume 40). Niemeyer, Tübingen 1997.
Online editions

Web links

Wiktionary: Corpus linguistics  - explanations of meanings, word origins, synonyms, translations
Courses and link lists
Corpora
software
  • CorpusExplorer - open source software for easy preparation (over 100 file formats), automatic annotation (over 60 languages) and evaluation (over 40 different analyzes). In addition, annotated reference corpora (plenary minutes, historical language levels, written / oral corpora, etc.) with over 5.5 billion tokens are available for the CorpusExplorer.

Individual evidence

  1. Snježana Kordić : Words in the border area of ​​lexicon and grammar in Serbo-Croatian (=  Lincom Studies in Slavic Linguistics . Volume 18 ). Lincom Europa, Munich 2001, ISBN 3-89586-954-6 , OCLC 42422661 , DNB 956417647 , p. 280 .
  2. ^ Burghard Rieger : Representativity: from the inappropriateness of a term to characterize a problem of linguistic corpus formation. In: H. Bergenholtz, B. Schaeder (Hrsg.): Empirische Textwissenschaft. Structure and evaluation of text corpora. (= Monographs on linguistics and communication studies. 39). Scriptor, Königstein / Taunus 1979, pp. 52-70.
  3. See Chomsky 1986.
  4. Kertész / Rákosi 2008 and Lenders 2013 provide a historical overview.