Formal language

A formal language is an abstract language in which, in contrast to natural languages, the focus is often not on communication but on mathematical use. A formal language consists of a certain number of symbol strings (generally character strings ) ("words" of the language), which can be put together from a set of characters / symbols ("alphabet", basic symbols). Formal languages are used in linguistics , logic and theoretical computer science .

Formal languages are suitable for (mathematically) precise description of the handling of character strings. For example, data formats or entire programming languages can be specified. Together with formal semantics , the defined character strings are given a (mathematical) meaning. In a programming language, a programming instruction (as part of the formal language) can be assigned a unique machine behavior (as part of the semantics ).

On the basis of formal languages, however, logic calculations can also be defined with which mathematical conclusions can be drawn. In connection with formally defined programming languages, calculi can help to check programs for their correctness .

definition

A formal language over an alphabet is a subset of clover ash shell of the alphabet: . ${\ displaystyle L}$ ${\ displaystyle \ Sigma}$ ${\ displaystyle L \ subseteq \ Sigma ^ {*}}$

An alphabet defines the characters from which a " word " of the language can be formed. For example, you can create the decimal representation of any natural number from the alphabet . ${\ displaystyle \ Sigma}$ ${\ displaystyle \ Sigma = \ {0,1,2,3,4,5,6,7,8,9 \}}$

All words from a given alphabet with a finite length (length 0 or longer), each of which is an element of each letter , this largest possible amount of words for the alphabet , is called the Kleen's shell of the alphabet , in short . A formal language over an alphabet is therefore a certain subset of the Kleene envelope of your alphabet - in general, not every arbitrary combination of characters is a valid word in the language. ${\ displaystyle \ Sigma}$ ${\ displaystyle \ Sigma}$ ${\ displaystyle \ Sigma}$ ${\ displaystyle \ Sigma}$ ${\ displaystyle \ Sigma ^ {*}}$ ${\ displaystyle \ Sigma}$

Formal languages can be empty, finite, or infinite; at most they can encompass the entire Kleenesche shell of their alphabet. They can be defined by a mathematical condition on their words: "The language ... is the set of all words for which ..." applies.

The languages used in theoretical computer science, however, are usually more specifically defined by a certain replacement procedure - rules on how the alphabet characters can be / may be combined. There are different types of replacement methods: Semi-Thue systems , Chomsky grammars , Lindenmayer systems and the like. a. With such replacement methods, one starts with a specific start character string, which is gradually converted into word structures by repeated ("recursive") application of the rules (text replacements), which then as a whole, or only a specified section thereof, as words apply to the language. One speaks here of generative grammars , because the words of a language are generated step by step through such text substitutions . Conversely, languages can also be defined as the set of all words from which a certain predefined word or one of several predefined words can be generated (via the language replacement process). ("Everything belongs to language that can be traced back to ... through the rules.")

Differentiation from natural languages

With the help of formal languages, natural languages can also be modeled, especially their syntax. When comparing formal languages with natural languages, however, it should be noted that natural languages have at least the two superimposed hierarchical levels of the word and the sentence above the elementary phonetic signs . The rules for their structure are usually divided into morphology on the one hand and syntax on the other. In formal languages, on the other hand, there is often only one level of the hierarchy of the formal word above the elementary alphabet sign ; with regard to the structure of the words, one speaks of syntax in formal language. If a natural language is modeled using a formal one, then the sentences of the natural language are called words from a formal point of view .

In addition, utterances in natural language have a natural meaning, while the meaning of formal languages must always be defined in a formal way.

Examples

The programming language C is a formal language. The words from C are the respective programs. The alphabet of C are the keywords and characters set out in the definition of C.

The natural numbers in unary representation:

{\ displaystyle \ mathbb {N} _ {\ rm {un}}: = \ {\ varepsilon, 1,11,111,1111, \ ldots \}}

The unary language above , which contains only square-length words:

{\ displaystyle \ {a \}}

{\ displaystyle {\ rm {quad \ _count}}: = \ {a ^ {(n ^ {2})} \ mid n \ in \ mathbb {N} \}}

{\ displaystyle {\ rm {count}} _ {2}: = \ {a ^ {n} b ^ {n} \ mid n \ in \ mathbb {N} \}}

{\ displaystyle {\ rm {count}} _ {3}: = \ {a ^ {n} b ^ {n} c ^ {n} \ mid n \ in \ mathbb {N} \}}

The language of all palindromes : , where the reflection of the word is. ${\ displaystyle {\ rm {pal}}: = \ {w \ in \ {0.1 \} ^ {*} \ mid w = w ^ {R} \}}$
${\ displaystyle w ^ {R}}$ ${\ displaystyle w}$

The decimal coding of prime numbers: . Here the coding of the natural numbers denotes in the decimal system and PRIM stands for the set of prime numbers . ${\ displaystyle {\ rm {prim}} _ {\ rm {dec}}: = \ {{\ rm {dec}} (p) \ mid p \ in {\ rm {PRIM}} \}}$
${\ displaystyle {\ rm {dec}} \ colon \ mathbb {N} \ rightarrow \ {0,1,2,3,4,5,6,7,8,9 \} ^ {*}}$

The Morse or Thue sequence : , wherein a homomorphism is defined as follows: and , . So the first elements of the Thue sequence are: 0, 01, 0110, 01101001, 0110100110010110 ... ${\ displaystyle {\ rm {thue}}: = \ {h_ {t} ^ {n} (0) \ mid n \ in \ mathbb {N} \}}$
${\ displaystyle h_ {t}}$ ${\ displaystyle h_ {t} (\ varepsilon) = \ varepsilon}$ ${\ displaystyle h_ {t} (w0): = h_ {t} (w) 01}$ ${\ displaystyle h_ {t} (w1): = h_ {t} (w) 10}$

Operations on formal languages

Two languages above the alphabet and above the alphabet are banal, both languages are also above , that is, sets of words . Therefore are also ${\ displaystyle L_ {1}}$ ${\ displaystyle \ Sigma _ {1}}$ ${\ displaystyle L_ {2}}$ ${\ displaystyle \ Sigma _ {2}}$ ${\ displaystyle \ Sigma _ {1} \ cup \ Sigma _ {2}}$ ${\ displaystyle (\ Sigma _ {1} \ cup \ Sigma _ {2}) ^ {*}}$

the union ${\ displaystyle L_ {1} \ cup L_ {2}}$
the average ${\ displaystyle L_ {1} \ cap L_ {2}}$
the difference ${\ displaystyle L_ {1} \ setminus L_ {2}}$

Languages about . ${\ displaystyle \ Sigma _ {1} \ cup \ Sigma _ {2}}$

Further operations on languages are:

Concatenation

The concatenation of two languages and is the language of words that by cascading letters ( concatenation ) per any word from and from arises: ${\ displaystyle L_ {1}}$ ${\ displaystyle L_ {2}}$ ${\ displaystyle u}$ ${\ displaystyle L_ {1}}$ ${\ displaystyle v}$ ${\ displaystyle L_ {2}}$

{\ displaystyle L_ {1} \ circ L_ {2}: = \ {uv \ mid u \ in L_ {1}, v \ in L_ {2} \}}

.

For example, the concatenations of different languages above the alphabet are : ${\ displaystyle \ Sigma = \ {a, \, b \}}$

{\ displaystyle \ {a \} \ circ \ {ab \} = \ {aab \}}

{\ displaystyle \ {a, \, bb \} \ circ \ {aa, \, b \} = \ {aaa, \, ab, \, bbaa, \, bbb \}}

{\ displaystyle \ {abb, \, bab \} \ circ \ {\ varepsilon, \, aab, \, bb \} = \ {fig, \, bab, \, abbaab, \, babaab, \, abbbb, \ , babbb \}}

The neutral element of concatenation is language, which only contains the empty word . The following applies to any language : ${\ displaystyle L}$

{\ displaystyle L \ circ \ {\ varepsilon \} = \ {\ varepsilon \} \ circ L = L}

The absorbing element of the concatenation is empty language, so that for every language : ${\ displaystyle L \ subseteq \ Sigma ^ {*}}$

{\ displaystyle L \ circ \ {\} = \ {\} \ circ L = \ {\}}

The concatenation of languages, like the concatenation of words, is associative , but not commutative . For example:

{\ displaystyle (\ {a, \, bab \} \ circ \ {a, \, b \}) \ circ \ {ab \} = \ {a, \, bab \} \ circ (\ {a, \ , b \} \ circ \ {ab \}) = \ {aaab, \, abab, \, babaab, \, babbab \}}

but:

{\ displaystyle \ {a, \, bab \} \ circ \ {a, \, b \} = \ {aa, \, ab, \, baba, \, babb \} \ not = \ {aa, \, abab, \, ba, \, bbab \} = \ {a, \, b \} \ circ \ {a, \, bab \}}

In addition, since the power set of Kleen's envelope of any alphabet (which is equal to the set of all languages that can be formed from ) is closed with respect to concatenation, it forms a monoid together with the concatenation as the operator and the language of the empty word as the neutral element . ${\ displaystyle \ Sigma}$ ${\ displaystyle \ Sigma}$

power

The power of a language is the -fold concatenation of this language with itself. It is defined recursively with: ${\ displaystyle L ^ {n}}$ ${\ displaystyle n}$

{\ displaystyle L ^ {0}: = \ {\ varepsilon \}}

{\ displaystyle L ^ {n + 1}: = L ^ {n} \ circ L}

(for )

{\ displaystyle n \ in \ mathbb {N} _ {0}}

For example:

{\ displaystyle \ lbrace aa, \, abab, \, bbab, \, ba \ rbrace ^ {0} = \ lbrace \ varepsilon \ rbrace}

{\ displaystyle \ lbrace a, \, b \ rbrace ^ {2} = \ lbrace a, \, b \ rbrace ^ {1} \, \ circ \, \ lbrace a, \, b \ rbrace = (\ lbrace a , \, b \ rbrace ^ {0} \, \ circ \, \ lbrace a, \, b \ rbrace) \, \ circ \, \ lbrace a, \, b \ rbrace = (\ lbrace \ varepsilon \ rbrace \ , \ circ \, \ lbrace a, \, b \ rbrace) \, \ circ \, \ lbrace a, \, b \ rbrace = \ lbrace aa, \, ab, \, ba, \, bb \ rbrace}

{\ displaystyle \ lbrace a \ rbrace ^ {4} = \ lbrace a \ rbrace \, \ circ \, \ lbrace a \ rbrace \, \ circ \, \ lbrace a \ rbrace \, \ circ \, \ lbrace a \ rbrace = \ lbrace aaaa \ rbrace}

In particular, the following applies to each single-element, formal language (with ) and each : ${\ displaystyle L = \ lbrace w \ rbrace}$ ${\ displaystyle w \ in \ Sigma ^ {\ ast}}$ ${\ displaystyle n \ in \ mathbb {N} _ {0}}$

{\ displaystyle L ^ {n} = \ lbrace w \ rbrace ^ {n} = \ lbrace w ^ {n} \ rbrace}

Kleene - * - degree and Kleene - + - degree

The Kleene - * - closure (Kleene envelope, also called iteration ) and the Kleene - + - closure (positive envelope) of a formal language are defined by the union of the power languages of : ${\ displaystyle L ^ {*}}$ ${\ displaystyle L ^ {+}}$ ${\ displaystyle L}$ ${\ displaystyle L}$

{\ displaystyle L ^ {*}: = \ bigcup _ {i \ in \ mathbb {N} _ {0}} L ^ {i}}

{\ displaystyle L ^ {+}: = \ bigcup _ {i \ in \ mathbb {N}} L ^ {i}}

Important formal language classes

In 1956 Noam Chomsky established a hierarchy of formal grammars that produce different types of formal languages. This is known today as the Chomsky hierarchy . A distinction is made here between type 0, type 1, type 2 and type 3: recursively enumerable , context-sensitive , context-free or regular languages.

Aristid Lindenmayer proposed a rule system in which replacement steps are carried out in parallel at every point in every step. These systems are called Lindenmayer systems .

With semi-Thue systems , languages may be set, which are derived from starting words.

With Church-Rosser systems languages explains let their words be reduced to a terminal word.

Term rewriting systems generate the set of terms that are equivalent to a starting term.

We obtain generalizations of formal languages with graph grammars with which we can generate graph languages.

Hypergraph grammars produce hypergraphs , a generalization of graphs.

Historical

Gottlob Frege's conceptual writing is considered to be one of the first formal languages, as Frege wrote "the language of formulas of pure thought". Axel Thue's Semi-Thue system , which was introduced in 1914 and can be used to transform strings, also influenced the development of formal grammars.

Quote

Today's basic research has been mastered

“[…] From the spirit of mathematics. [...] It has been mathematized through to the extreme limits of what can be achieved today on the basis of an advanced formalization technique. The aim of this research is a fairly ambitious goal. It is the mastery of the greatest possible number of deep-seated problems from the field of basic research with a kind of precision that can be described as 'precision in the smallest parts'. [...]

As in mathematics, the desired accuracy can only be achieved through the creation of precision languages , the desired accuracy in the smallest parts only through the creation of precision languages, the degree of accuracy of which the degree of accuracy of the most highly developed mathematical precision language of the present age, the language of set theory and the Language far surpasses modern algebra . [...] Such a precision language is a formalized scientific language. [...] a tool whose performance can be compared with the resolution of an electron microscope . [...] Leibniz was the first to demand precision languages with this level of accuracy. "

- Heinrich Scholz in 1941: A new form of basic research

Heinrich Scholz met Konrad Zuse in 1944 , who was working on his plan calculus as part of his doctoral thesis . In March 1945, Scholz expressed his appreciation for the application of his logic calculus.

literature

Lars Peter Georgie: Predictability, Complexity, Logic . Vieweg, Braunschweig Wiesbaden,
- A third edition appeared in 1995.
- English edition: Computability, Complexity, Logic . Published in the series: Studies in logic and the foundations of mathematics . North Holland, Amsterdam 1985.

A presentation of formal languages in the context of computability theory, logic and complexity theory. Makes high demands on the reader, but provides deep insights.

Michael A. Harrison: Introduction to Formal Language Theory . In the series: Series in Computer Science , Addison-Wesley, 1978.

A very detailed and highly praised introduction.

John E. Hopcroft and Jeffrey D. Ullman : Introduction to Automata Theory, Formal Languages, and Complexity Theory . Addison-Wesley, 1988.
- English original: Introduction to Automata Theory, Languages and Computation . Addison-Wesley, 1979.
- A revised third edition in German was published in 1994 by Oldenbourg R. Verlag GmbH. In 2004, Addison-Wesley published a second revised edition.

The English original is the most cited book in theoretical computer science. The evidence is occasionally misrepresented in older German translations. This book has been translated into numerous languages.

Grzegorz Rozenberg and Arto Salomaa: The Mathematical Theory of L-Systems . Academic Press, New York, 1980.

The most detailed book on L-systems.

Grzegorz Rozenberg and Arto Salomaa (editors): Handbook of Formal Languages . Volume I-III, Springer, 1997, ISBN 3-540-61486-9 .

A detailed overview of the most important areas of formal languages is presented by academics who are active in this area.

Arto Salomaa: Formal Languages . Springer, 1978.
- English original: Formal Languages . Academic Press, 1973.
Ingo Wegener: Theoretical Computer Science . Teubner Stuttgart, 1993. ISBN 3-519-02123-4 .

In the representation of the formal languages, the complexity of the formal language constructions is always dealt with. Otherwise this can only be found in the original literature.

U. Hedtstück: Introduction to Theoretical Computer Science - Formal Languages and Automata Theory . Oldenbourg Verlag, Munich 2000, ISBN 3-486-25515-0
S. Abramsky, Dov M. Gabbay, TSE Maibaum (eds.): Handbook of logic in computer science. Vol. 5: Logical and algebraic methods. Oxford University Press 2000, ISBN 0-19-853781-6
Mogens Nielsen, Wolfgang Thomas: Computer Science Logic. Springer 1998, ISBN 3-540-64570-5

Individual evidence

↑ Chomsky, Noam (1956). "Three models for the description of language". IRE Transactions on Information Theory (2): 113-124.
↑ Martin Davis (1995). Influences of Mathematical Logic on Computer Science. In Rolf Herken. The universal Turing machine: a half-century survey. Jumper. P. 290. ISBN 978-3-211-82637-9 .
↑ Ronald V. Book and Friedrich Otto, String-rewriting Systems, Springer, 1993, ISBN 0-387-97965-4 . P. 36.
^ In: Research and Progress No. 35/36, year 1941, p. 382ff.
↑ Hartmut Petzold , Modern Calculators. The industrialization of computer technology in Germany. Munich, CH Beck Verlag, 1992.

[1] Chomsky, Noam (1956). "Three models for the description of language". IRE Transactions on Information Theory (2): 113-124.

[2] Martin Davis (1995). Influences of Mathematical Logic on Computer Science. In Rolf Herken. The universal Turing machine: a half-century survey. Jumper. P. 290. ISBN 978-3-211-82637-9 .

[3] Ronald V. Book and Friedrich Otto, String-rewriting Systems, Springer, 1993, ISBN 0-387-97965-4 . P. 36.

[4] In: Research and Progress No. 35/36, year 1941, p. 382ff.

[5] Hartmut Petzold , Modern Calculators. The industrialization of computer technology in Germany. Munich, CH Beck Verlag, 1992.