Regular expression

from Wikipedia, the free encyclopedia

In theoretical computer science, a regular expression ( English regular expression , abbreviation RegExp or Regex ) is a character string that is used to describe sets of character strings with the help of certain syntactic rules. Regular expressions are mainly used in software development . In addition to implementations in many programming languages , many text editors also process regular expressions in the " Find and Replace " function". Wildcards are a simple use case of regular expressions .

Regular expressions can be used as filter criteria in the text search by matching the text with the regular expression pattern. This process is also called pattern matching . For example, it is possible to search for all words from a word list that begin with S and end with D without having to explicitly specify the letters in between or their number.

The concept of the regular expression goes back essentially to the mathematician Stephen Kleene , who used the similar term regular set .

Regular expressions in theoretical computer science

Theoretical foundations

Regular expressions describe a family of formal languages and thus belong to theoretical computer science . These languages, the regular languages , are at the lowest and thus the weakest level of expression in the Chomsky hierarchy (type 3). They are generated by regular grammars .

For every regular expression there is a finite automaton that accepts the language specified by the expression. A corresponding (nondeterministic) finite automaton can be constructed from a regular expression with the Thompson construction . This makes regular expressions relatively easy to implement. Conversely, there is a regular expression for every finite automaton that describes the language accepted by the automaton. A corresponding regular expression can be constructed from a nondeterministic finite automaton using Kleene's algorithm . Kleene's algorithm usually produces very long regular expressions. The state elimination (German actually: "state elimination") usually provides shorter regular expressions in practice. In the highest case ("worst case"), however, both algorithms return regular expressions of the length , with the number of characters in the underlying alphabet and the number of states in the machine.

syntax

The syntax precisely defines what regular expressions look like.

Regular expressions are always defined using a given set of characters , the so-called alphabet . Regular expressions are based on exactly three operations: alternative, concatenation and repetition. The formal definition looks like this:

  1. (the special symbol for the empty set ) is a regular expression.
  2. for all is (a character from the underlying alphabet representation) a regular expression.
  3. If and are regular expressions, then (alternative), ( concatenation ) and ( Kleene shell , Kleene star) are also regular expressions.

The symbol is used instead for alternative . Then you write . As an alternative, there is also an operator symbol for concatenation; one then writes .

You can also allow additional constants and operations, provided that their effect can also be described using the basic rules mentioned above. In the literature one can find, among other things, the regular expression or the positive Kleenesche shell , which can be regarded as an abbreviation of .

If you give a precedence of the operators , you can do without some brackets. The order of precedence is usually Kleene star over concatenation over alternative . Instead , the spelling is sufficient .

The number of nested * operators is called the star height .

semantics

The semantics of regular expressions precisely define what formal meaning the syntax of regular expressions has.

A regular expression describes a formal language, i.e. a set of words (strings). The definition of the semantics can be described analogously to the syntax definition. It denotes the formal language that is specified by the regular expression .

  1. The special symbol for the empty set specifies the empty language.
  2. for all true
    Each representative of a character from the alphabet specifies the language that contains only that character.
  3. are and regular expressions, the following applies:
    The alternative between two terms describes the language that results from the union of the two languages ​​described by the two terms.
    The concatenation of two expressions describes the language that only contains the words that have a word from the language described by the first expression as a prefix and whose immediately following residual suffix is ​​a word from the language described by the second expression.
    The Kleen's shell of regular expressions describes the Kleen's shell of the language described by .

If the syntax definition of regular expressions also contains the constant , then its meaning is defined as , i.e. the language that only contains the empty word .

The empty word is a word of a formal language ( ) and is therefore not a regular expression. The language contains only the empty word, but can also be without the constant by a regular expression to describe, for example: . However, a visual distinction is not always made between a regular expression and the associated language, so that instead of being used as a regular expression for the language , the distinction between and and between and can be omitted.

Examples

If the alphabet from the letter , and is, therefore , then the following languages with the corresponding regular expressions can be described:

  • The language of all words, which consists of any number or any number :
    Syntax: . Semantics:
  • The language of all words that start with:
    Syntax: . Semantics:
  • The language of all words that begin and end with and only consist of any number in between :
    Syntax: . Semantics:
  • The language of all words that consist of two characters but do not contain any :
    Syntax: . Semantics:

Use of regular expressions

Ken Thompson used this notation in the 1960s to build qed (a previous version of the Unix editor ed ) and later to write the grep tool . Since then, very many programs and libraries of programming languages ​​have implemented functions to use regular expressions to search for and replace strings. Examples are the programs sed , grep, emacs , the programming languages Perl and Tcl and standard libraries for the programming languages C , C ++ , Java , JavaScript , Python , PHP , Ruby and the .NET framework. The word processor and the spreadsheet of the Office package OpenOffice.org offer the ability to search using regular expressions in the text.

There are differences in functionality and syntax between different Regexp implementations. In programming languages, the Perl Compatible Regular Expressions (PCRE) have prevailed, which are based on the implementation in Perl 5.0. Besides, in POSIX.2 between "basic" regular expressions (basic regular expressions) and "extended" regular expressions (extended regular expressions) distinguished.

Some programs, for example the text editor Vim , offer the possibility to switch back and forth between different regexp syntaxes.

Regular expressions play an important role in the lexical analysis of source texts , for example in compilers or for syntax highlighting in editors. A lexical scanner breaks the source text down into so-called tokens (keywords, operators, ...) using regular expressions . Since most programming languages are context-free languages , regular expressions are not powerful enough to describe their syntax. Therefore, the syntactic analysis that follows with compilers is usually done by a separate program, the parser .

Regular expressions also play a role in bioinformatics . They are used in protein databases to describe protein motifs. The regular expression

W-x{9,11}-[VFY]-[FYW]-x{6,7}-[GSTNE]-[GSTQCR]-[FYW]-R-S-A-P

describes, for example, a protein domain in PROSITE . The regular expression above says the following: At the beginning choose the amino acid tryptophan (one letter code W), then choose 9 to 11 amino acids freely, then choose either V, F or Y, then choose either F, Y or W, then again 6 to 7 Amino acids free, then either G, S, T, N or E, then either G, S, T, Q, C or R, then F, Y or W, then R then S then A then P.

Regular Expressions in Practice

Most implementations today support extensions such as backreferences. These are no longer regular expressions in the sense of theoretical computer science, because the expressions expanded in this way no longer necessarily describe languages ​​of type 3 of the Chomsky hierarchy .

The following syntax descriptions relate to the syntax of common implementations with extensions, so they only partially correspond to the above definition from theoretical computer science.

A common use of regular expressions is to find special strings in a set of strings. The description given below is a (often used) convention to specifically implement concepts such as character class , quantification , linking and summarizing . Here, a regular expression is formed from the characters of the underlying alphabet in combination with the metacharacters [ ] ( ) { } | ? + - * ^ $ \ . (sometimes context-dependent), and in some implementations too : ! < =. The meta-property of a character can be canceled by a preceding backslash . All other characters in the alphabet stand for themselves.

Character literals

Those characters that have to match directly (literally, literally) are also noted directly. Depending on the system, there are also options for specifying the character using the octal or hexadecimal code ( or ) or the hexadecimal Unicode position ( ). \ooo\xhh\uhhhh

One character from a selection

With square brackets allows a character selection define ( [and ]). The expression in square brackets then stands for exactly one character from this selection. Within these character class definitions, some symbols have different meanings than in their normal context. The meaning of a symbol depends in part on the context in which it appears within the brackets.

For example, a circumflex ^ at the beginning of a character class definition means that the character class is negated or inverted (in the sense of complement formation ). However, if a circumflex is found anywhere else in the definition, it is to be understood literally. The meaning of the hyphen ( -) is also context-dependent . In addition, the regex engines (for example POSIX and PCRE) differ from one another in some points. If there is a hyphen -between two characters in the class definition, for example [a-g], it is to be understood as a bis-dash, i.e. as a description of a character interval or character range in relation to the ASCII table. The example given would be equivalent to [abcdefg]. Hyphens at the beginning or end of a character class are interpreted as the character itself.

Examples of character selection
[egh] one of the characters "e", "g" or "h"
[0-6] a number from "0" to "6" (dashes are indicators for a range)
[A-Za-z0-9] any Latin letter or number
[^a] any character except "a" ("^" at the beginning of a character class negates the same)
[-A-Z], [A-Z-](or [A-Z\-a-z], however, not according to POSIX) Selection also contains the hyphen "-" if it is the first or last character in the list of a character class or, in the PCRE, if its metafunction within a selection is canceled by a preceding "\" character

Predefined character classes

There are predefined character classes that are not supported in the same way by all implementations, since they are only short forms and can also be described by a selection of characters . Important character classes are:

\d d igit a number, so [0-9] (and possibly also other numbers in Unicode , e.g. Bengali digits )
\D no d igit a character that is not a digit, so [^ \ d]
\w w ord character a letter, a number or the underscore, so [a-zA-Z_0-9] (and possibly also non-Latin letters, e.g. umlauts)
\W no w ord character a character that is neither a letter nor a number nor an underscore, so [^ \ w]
\s white s pace usually at least the space and the class of the control characters \ f, \ n, \ r, \ t and \ v
\S no white s pace a character that is not whitespace, so [^ \ s]

A period ( .) means that (almost) any character can be in its place. Most RegExp implementations do not see line breaks as any character by default, but in some programs this can be achieved using the so-called single-line modifier s(for example in /foo.bar/s).

In many newer implementations, classes can also be specified within the square brackets after POSIX , which in turn contain square brackets. For example, they are:

Examples of character classes, sorted hierarchically
  • [:cntrl:]- control characters. In ASCII these are the characters 00 to 1F and 7F (DEL).
  • [:print:] - Printable characters: [: alnum:], [: punct:] and spaces
    • [:space:]- Whitespace : Horizontal and vertical tab characters, line and page feed , carriage return and space ZK1
    • [:graph:] - Graphic characters: [: alnum:] or [: punct:]
      • [:punct:]- Punctuation marks like:! "# $% & '() * +, -. /:; <=>? @ [\] ^ _` {|} ~
      • [:alnum:]- Alphanumeric characters : [: alpha:] or [: digit:]
        • [:xdigit:]- Hexadecimal digits: 0 to 9, A to F, a to f.
          • [:digit:] - The digits 0 to 9
        • [:alpha:] - Letters: [: lower:] or [: upper:]
          • [:lower:]- Lower case letters ZK2 : not necessarily only from a to z
          • [:upper:]- Capital letters ZK2 : not necessarily only from A to Z
Remarks:
ZK1The character with the Unicode number 160 (hex: A0) (corresponds to the HTML Entity & nbsp;), also known as “ protected space ”, may not be found by the class [: space:] and must be identified separately using the coding point .
ZK2What letters are depends on the locale in common operating systems, i.e. it depends on the region and language set.

Quantifiers

Quantifiers (English quantifiers , also quantifiers or repetition factors ) allow the previous expression to be allowed in various multiples in the character string.

? The preceding expression is optional, it can occur once, but does not need it, that is, the expression occurs zero or once. (This corresponds to ) {0,1}
+ The preceding expression must appear at least once, but can also appear multiple times. (This corresponds to ) {1,}
* The preceding expression can appear as often as you like (not even once). (This corresponds to ) {0,}
{n} The preceding expression must appear exactly n times. (This corresponds to ) {n,n}
{min,} The foregoing expression must at least min times occur.
{min,max} The preceding expression must occur at least a minimum of and a maximum of a maximum of times.
{0,max} The preceding expression may appear a maximum of times.

The quantifiers relate to the preceding regular expression, but not necessarily to the match it found. For example, a+an “a” or “aaaa” is used, but [0-9]+not only corresponds to repeated identical digits, but also to sequences of mixed digits, for example “072345”.

Further examples are:

  • [ab]+ corresponds to "a", "b", "aa", "bbaab" etc.
  • [0-9]{2,5}corresponds to two, three, four or five digits in a row, e.g. B. "42" or "54072", but not the character strings "0", "1.1" or "a1a1".

If a character string should only consist of the searched pattern (and not just contain it), in most implementations it must be explicitly defined that the pattern should extend from the beginning ( \Aor ^) QF1 to the end of the character string ( \Z, \zor $) QF1 . Otherwise, for example, [0-9]{2,5}also recognizes the substring “12345” for the character string “1234507”. For the same reason, for example a*, there would always be a hit, since every character string - even the empty word "" - contains at least 0 times the character "a".

Quantifiers are (English by default "greedy" greedy implemented). That is, a regular expression is resolved to the closest possible match. However, since this behavior is not always wanted, quantifiers can be declared as "frugal" or "cautious" (English non-greedy , reluctant ) in many newer implementations . For example, a question mark is placed ?after the quantifier in Perl or tcl . The implementation of frugal quantifiers is comparatively complex and slow during the search process (requires backtracking ), which is why some implementations expressly avoid this e.g. B. sed.

Example (Perl syntax)
Assuming the regular expression is A.*Bapplied to the string "ABCDEB", it would find it as "ABCDEB". With the help of the “non-greedy” quantifier *?, the now modified expression - ie A.*?B - only the character string “AB”, thus aborts the search for the first “B” found. An equivalent regular expression for interpreters that do not support this quantifier would be A[^B]*B.
QF1The characters ^and $match in multiline mode, i.e. if the m modifier is set, the beginnings and ends of lines also.

Possessive behavior

A variant of the greedy behavior described above is possessive matching . However, since backtracking is prevented here, characters that match are not released again. Therefore, the synonymous terms atomic grouping , independent subexpression or non-backtracking subpattern can also be found in the literature . The syntax for these constructs varies with the different programming languages. Such partial expressions ("subpattern") were originally formulated in Perl . In addition, since Perl 5.10, there are the equivalent, already available in Java possessive quantifiers , , and . (?>Ausdruck)++*+?+{min,max}+

example
Assuming the regular expression is A.*+Bapplied to the string "ABCDEB" , it would not find a match. When processing the regular expression, the part would .*+match up to the end of the character string. However, in order to find the entire expression, one character - here the "B" - would have to be released again. The possessive quantifier forbids this due to the suppressed backtracking, which is why no successful match can be found.

Groupings and backward references

Expressions can be summarized with round brackets (and : For example, an "abc" or an "abcabc" etc. ) (abc)+

Some implementations save the found matches of groupings and allow them to be reused in regular expressions or text replacement. These are backreferences (English back references called). The notation or is often used for this, where n corresponds to the nth grouping. A special position is n = 0, which usually stands for the match of the entire regular expression. \n$n

example
A search and replace with AA(.*?)BBas a regular search expression and \1as a replacement replaces all character strings enclosed by AA and BB with the text contained between AA and BB . That means AA and BB and the text in between are replaced by the text that was originally between AA and BB , so AA and BB are missing in the result.

Regular expression interpreters that allow backward references no longer conform to type 3 of the Chomsky hierarchy . The pumping lemma can be used to show that a regular expression that determines whether 1the same number of 0is in a character string is not a regular language .

There are also groupings that do not generate a backward reference (English non-capturing ). The syntax for this in most implementations is (?:) . Regexp documentation indicates that the creation of backward references should always be avoided if they are not later accessed. The creation of the references costs execution time and occupies space to store the match found. In addition, the implementations only allow a limited number of backward references (often only a maximum of 9).

example

The regular expression \d+(?:-\d+)*can be used to find sequences of number sequences separated by hyphens without receiving the last number sequence separated by a hyphen as a back reference.

example

A date in the format MM/DD/YYYYshould be converted into the format YYYY-MM-DD.

  1. With the help of the expression ([0-1]?[0-9])\/([0-3]?[0-9])\/([0-9]{4}), the three groups of numbers are extracted.
  2. The \3-\1-\2individual groups are converted into the correct format with the replacement expression .

Alternatives

You can |allow alternative expressions with the symbol.

example
ABC|abcmeans "ABC" or "abc", but z. B. not "Abc".

More characters

In order to support the applications on the computer that are often related to character strings, the following characters are usually defined in addition to those already mentioned:

^ stands for the beginning of the line (not to be confused with ^the character selection using [and ]).
$ can stand for the end of the line or the end of a character string, depending on the context, although in some implementations a “\ n” may follow. The actual ending matches \z.
\ if necessary, cancels the meta-meaning of the next character (see masking characters ). For example, the expression (A\*)+allows the character strings "A *", "A * A *" and so on. In this way, a period “.” Can also be searched \.for while searching for \with \\.
\b empty string at the beginning or at the end of a word
\B empty string that does not form the beginning or the end of a word
\< empty string at the beginning of the word
\> empty string at the end of the word
\n a line break in Unix format
\r a line break in the (old, i.e. before 1999) Mac format
\r\n a line break in DOS and Windows format
\t a horizontal tab character
example
[^ ]$ means: The character string must consist of at least one character and the last character cannot be a space.

Look-around assertions

Perl Version 5 also introduced look-ahead and look-behind assertions in addition to the usual regular expressions (such as “forward-looking” or “backward-looking” assumptions or assertions), which are summarized under the term look-around assertions . These constructs extend the regular expressions by the possibility of formulating context-sensitive conditions without finding the context itself suitable. That is, you want to find all the strings "Sport", where the string "simplistic" follows, but without the string found "simplistic" contains the string itself, this would be a look-ahead assertion possible Sport(?=verein). In the example sentence “An athlete participates in sports in a sports club.”, The regular expression would therefore match the last occurrence of “Sport”, since only this is followed by the string “club”; however, it would not match the substring “sports club”.

Due to the property that the specified context (in the example “club”) is specified, but is not an explicit part of the appropriate character string (here “sport”), the zero-width attribute is usually also mentioned in connection with assertions . The full designations are thus - depending on whether a certain context is required (positive) or forbidden (negative) - zero-width positive / negative look-ahead / behind assertions. The names of the directions come from the fact that Regexp parsers always process a character string from left to right.

definition designation Explanation Notation
(?=Ausdruck) positive look-ahead assertion Expression mustfollow theabove expression Ausdruck(?=Ausdruck)
(?!Ausdruck) negative look-ahead assertion Expression must notfollow theabove expression Ausdruck(?!Ausdruck)
(?<=Ausdruck) positive look-behind assertion Expression mustprecede thefollowing expression (?<=Ausdruck)Ausdruck
(?<!Ausdruck) negative look-behind assertion Expression mustnot precede thefollowing expression (?<!Ausdruck)Ausdruck

Look-arounds are not only supported by Perl and PCRE , but also by Java , .NET and Python , among others . From version 1.5, JavaScript interprets positive and negative look-aheads.

example
\s(?=EUR)stands for a "whitespace" character (i.e., space or tab) followed by the string EUR. In contrast to \sEUR, here EURdoes not belong to a matching character string ("matched character string").

Conditional expressions

Conditional expressions are relatively uncommon. These can be used in Perl, PCRE and the .NET framework, among others. Python offers only limited functionality for such expressions in connection with look-around assertions .

(?(Bedingung)wahr-Ausdruck|falsch-Ausdruck) If the given expression “condition” is found, the “true expression” is used. If the search term cannot be found, the “wrong term” is used.

example

Expressing (\()?\d+(?(1)\))strings are like 1, (2), 34or (567), but not 3)found.

literature

Regular expressions

Regular expressions and natural languages

  • Kenneth R. Beesley, Lauri Karttunen: Finite-State Morphology. Distributed for the Center for the Study of Language and Information. 2003. 2003 Series: (CSLI-SCL) Studies in Computational Linguistics.

Regular expressions and automata theory

Research literature

  • Stephen C. Kleene: Representation of Events in Nerve Nets and Finite Automata. In: Claude E. Shannon, John McCarthy (Eds.): Automata Studies. Princeton University Press, 1956, pp. 3-42.

Web links

software

Individual evidence

  1. a b Stephen C. Kleene : Representation of Events in Nerve Nets and Finite Automata . In: Claude E. Shannon , John McCarthy (Eds.): Automata Studies . Princeton University Press, 1956, pp. 3-42 .
  2. ^ Alfred V. Aho , Ravi Sethi, Jeffrey Ullman : Compilers: Principles, Techniques and Tools. Addison-Wesley, 1986
  3. ^ A b c John E. Hopcroft , Jeffry D. Ullman : Introduction to Automata Theory, Formal Languages, and Complexity Theory . Addison-Wesley, Bonn 1994, ISBN 3-89319-744-3 .
  4. Jacques Sakarovitch: The Language, the expression, and the (Small) Automaton . In: LNCS . 3845, 2006, pp. 15-30. doi : 10.1007 / 11605157_2 .
  5. POSIX specifications
  6. RE Bracket Expression , IEEE Std 1003.1, The Open Group Base Specifications, 2004
  7. Look Ahead and Look Behind Zero-Width Assertions . Regular-Expressions.info
  8. ^ Mozilla Developer Network: JavaScript Reference
  9. If-Then-Else Conditionals in Regular Expressions . Regular-Expressions.info