Alphabet (computer science)
In computer science, an alphabet is a finite set of distinguishable symbols, which are also called characters or letters .
Alphabets are often referred to with the formula symbol ( sigma ), more rarely is used as a formula symbol as an abbreviation for vocabulary ( English vocabulary ). The Kleen's shell of the alphabet describes the set of all words (ie finite sequences) above the alphabet that can be formed from symbols . Formally, this is given by the disjoint union
- .
(The countable, infinite sequences of characters from the alphabet are denoted by, see: = . Denotes the entire set of finite sequences and infinite sequences of characters.)
To every word of unique one of with is length of the word . Operations on words are concatenation (chaining), power (setting several times in a row), mirroring; A word can be part of another word (infix, English substring ) - for more details see word (Theoretical Computer Science) : Alphabets thus provide the character inventory for words and thus form the basis for formal languages .
A distinction must be made between the alphabet made up of single characters and the words of different lengths that are formed using this alphabet. In particular, the empty word (the trivial word with length 0) never belongs to the alphabet because it is in every word set. Set of non-empty words:
- .
definition
An alphabet is a finite set. Often it is also required that the set is not empty. The elements of an alphabet are called letters , symbols, or characters . According to this definition, the alphabet is a set of characters, equivalent to a character set . The word character set , however, often also means a character encoding . Alphabets, on the other hand, are independent of any coding.
According to DIN 44300, on the other hand, an alphabet is a totally ordered, finite set of distinguishable symbols. More precisely, it is a set of characters together with a total order , see also lexicographical order .
Differentiation from natural language
In computer science, the alphabet is a generalization of the common alphabets of natural languages . For example, the alphabet of the Latin letters is also an alphabet in the sense of computer science. In theoretical computer science, however, there are often alphabets whose elements are symbols that are represented with several letters . For example, there is an alphabet with three elements. They can be put together in any order, such as to .
Here the arbitrariness of the symbols is particularly important: which characters are used for the elements of the alphabet is irrelevant as long as they can be distinguished from one another. The character string can stand for a tone sequence, for example, but also for a program control with three different commands.
In this context, it should also be noted that in computer science, any sequence of characters in an alphabet is called a word . In many computer languages, the English term string is used for this . In computer science, the character string is also a word above the Latin alphabet .
Examples
- With the help of the alphabet , all natural numbers can be formed in the decimal system . In numerical theory, a distinction is made between digits and numerical representations according to the distinction between characters of an alphabet and words above this alphabet . A number is then an abstraction, namely the meaning ( semantics , here: numerical value) of a syntactically correct number representation.
- The basic form of the Roman number system is based on the alphabet Σ = {I, V, X, L, C, D, M} (with various extensions for large numbers ). Here, however, the rules as to how the character string must be designed to be considered a word of the Roman number system are complex (IV instead of IIII, larger units further to the left than smaller ones, ...). However, they can be represented by a formal grammar . The character strings 13 and XIII are different representations of the same (abstract) number.
- Two different alphabets can be specified for the Morse code , which describe the Morse code on different levels: First there is the alphabet or , from which the set of Morse symbols is formed on the basis of the individual letter frequencies . In addition to letters and numbers, SOS ( ) is also a Morse code, as it is morse without a break between dit and dah . Apart from this exception, the characters in a message are generally not simply morsed one after the other, rather a short pause is inserted between the individual characters. This is necessary because some characters also start other characters; see prefix code . So Morse code itself consists entirely of the characters and the break between the characters: .
These examples are intended to make it clear that the structure of a complex communication system can be described by possibly hierarchically structured pairs of alphabets and assigned languages.
Individual references and comments
- ↑ In connection with a formal grammar and the formal language generated by it , the character set of the formal language is called the terminal alphabet and is often referred to by the sign (instead of ). In addition, a formal grammar needs a disjoint , non-empty set of non-terminals ( variables ), often designated with (less often with < ); formally, this is also an alphabet. Nonterminal must not appear in words from . The (disjoint) union of terminals and non-terminals is then the entire vocabulary , often referred to as (or even ).
- ^ Christian Wagenknecht, Michael Hielscher: Formal languages, abstract automata and compilers: textbook and workbook for basic studies and advanced training. ISBN 3-8348-0624-2 , p. 20. (online)
- ↑ Klaus Reinhardt: Priority payment machines and the synchronization of half-track languages ( Memento from January 17, 2018 in the Internet Archive ), Faculty of Computer Science at the University of Stuttgart; PhD thesis 1994
- ↑ Katrin Erk, Lutz Priese: Theoretical Computer Science: A Comprehensive Introduction . 2., ext. Edition. Springer-Verlag, 2002, ISBN 3-540-42624-8 , pp. 27 .
- ^ Juraj Hromkovič : Theoretical Computer Science . Formal languages, predictability, complexity theory, algorithms, communication and cryptography (= Teubner guidelines for computer science ). 2. about. u. exp. Edition. BG Teubner Verlag, 2004, ISBN 3-519-10332-X , p. 33 ( limited preview in Google Book search).
- ^ Susanna S. Epps: Discrete Mathematics with Applications . 4th edition. Brooks Cole Pub Co, 2010, ISBN 0-495-39132-8 , pp. 781 ( limited preview in Google Book search).
- ↑ Alexandru Mateescu, Arto Salomaa: Formal Languages: an Introduction and a Synopsis . In: Grzegorz Rozenberg, Arto Salomaa (Ed.): Handbook of formal languages . Word, Language, Grammar. Vol. 1. Springer-Verlag, 1997, ISBN 3-540-60420-0 , pp. 10 ( limited preview in Google Book search).
- ^ Manfred Broy : System structures and theoretical computer science . In: Computer Science . A basic introduction. 2. revised Edition. tape 2 . Springer, Berlin 1998, ISBN 3-540-64392-3 , pp. 191 ( limited preview in Google Book search).
- ↑ Hans-Jürgen Appelrath, Jochen Ludewig: Computer science script . A conventional introduction. 5th edition. BG Teubner, Stuttgart / Leipzig 2000, ISBN 3-519-42153-4 , p. 24 ( limited preview in Google Book search).
- ^ Gerhard Goos, Friedrich L. Bauer: Informatik 1 . An introductory overview. 4. verb. Edition. Springer, Berlin / Heidelberg 1991, ISBN 3-540-52790-7 , pp. 29 ( limited preview in Google Book search).