Backus-Naur form

from Wikipedia, the free encyclopedia

The Backus-Naur form or Backus normal form ( BNF for short ) is a compact, formal metalanguage for representing context-free grammars (type 2 grammars in the Chomsky hierarchy ). This includes the syntax of common high-level programming languages . It is also used for the notation of instruction sets and communication protocols.

Originally it was named after John W. Backus , later it was also named after Peter Naur (at the suggestion of Donald E. Knuth ) . Both were pioneers of computer science who dealt with the creation of the Algol 60 rules and in particular with the art of compiler construction . The Backus-Naur form in the Algol 60 Report made it possible for the first time to represent the syntax of a programming language in a formally exact manner, i.e. without the inaccuracies of natural languages.

There are many variants of the Backus-Naur form. The extended Backus-Naur-Form (EBNF) is a common variant which, among other things, allows a compact notation of repetitive elements. The enhanced Backus-Naur form (ABNF) is predominantly used for syntax definitions in Internet standards.

Basics

A program initially consists of characters that are visible on the screen or on paper. In addition, there are spaces and line separators. The visible signs are the terminal symbols (English terminal symbols / terminals counted).

The BNF uses so-called derivation rules ( productions ) , in which non-terminal symbols (English nonterminal symbols / nonterminals ) are defined. The character |(vertical line) serves as an alternative, the character string ::=is used for definition and the non-terminal symbols (also called syntactic variables) are <…>enclosed in angle brackets .

Alternative:

<Ziffer ausser Null> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

A digit other than zero is either a 1, or a 2, or a 3, etc. Terminal sequences can also be defined, i.e. a sequence. Terminal symbols and non-terminal symbols may appear as elements .

Sequence:

<Ziffer>              ::= 0 | <Ziffer ausser Null>
<Zweistellige Zahl>   ::= <Ziffer ausser Null> <Ziffer>
<Zehn bis Neunzehn>   ::= 1 <Ziffer>
<Zweiundvierzig>      ::= 42

So a digit is a 0 or a digit other than zero. A two-digit number is a non-zero digit followed by a digit. Forty-two is a 4 followed by a 2.

Repetitions must be defined in BNF via recursions . A derivation rule can contain the symbol on the left on the right side, for example:

<Ziffernfolge> ::= <Ziffer> | <Ziffer> <Ziffernfolge>

Read: A digit sequence is a digit or a digit followed by a digit sequence .

A sequence of digits therefore matches the symbol sequences 0, 1, 2, 10, 9870, 8970635 etc., but also to 00, 000, ... A positive number must not begin with 0. The following rule does this:

<Positive Zahl> ::= <Ziffer ausser Null> | <Ziffer ausser Null> <Ziffernfolge>

BNF and Context Free Languages

The production rules of the BNF (according to Backus) are exactly the rules allowed in context-free grammars (according to Chomsky ); so it is clear that both formalisms produce the same languages. They were created at the same time, namely at the end of the 1950s. However, there has only been a reference to the connection since 1961, namely in an overview article on metalanguages ​​by Saul Gorn, there still presented as a connection between BNF and general phrase structure grammars and only later - more precisely - limited to context-free grammars. In the following year there was an exchange of letters between Gorn and Knuth on this subject in the letters to the editor of Comm. ACM . It is plausible to assume that Chomsky and Backus developed their formalisms independently of one another and that Gorn was the first to know both approaches and thus to establish the connection.

BNF and programming languages

In order to display the syntax of programming languages ​​such as ALGOL , Pascal or Java in BNF, the key words ( IF , SWITCH ) must also be included in the terminal symbols. In a compiler they are recognized in a preliminary phase, the lexical analysis , and passed on as special characters. Comments are recognized (and often removed) by the lexical analysis, sometimes also other elements such as floating point numbers, identifiers and character strings.

This allows the entire syntax e.g. B. Represent a PASCAL program in BNF (partially abbreviated):

 <Programm>               ::= 'PROGRAM' <Bezeichner> 'BEGIN' <Satzfolge> 'END' '.'
 <Bezeichner>             ::= <Buchstabe> <Restbezeichner>
 <Restbezeichner>         ::= | <Buchstabe oder Ziffer> <Restbezeichner>
 <Buchstabe oder Ziffer>  ::= <Buchstabe> | <Ziffer>
 <Buchstabe>              ::= A | B | C | D | … | Z | a | b | … | z
 <Ziffer>                 ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
 <Satzfolge>              ::= …
 …

A syntax analysis consists of tracing a program text back to the non-terminal symbol <program> . A program must therefore begin with the word PROGRAM followed by an identifier. Identifiers begin with a letter, followed by any number of letters or numbers.

The return to <Program> succeeds in

  PROGRAM Ggt BEGIN … END .

  PROGRAM DiesisteinlangerBezeichnermit123 BEGIN … END .

but not with

  Ggt BEGIN … END .         (beginnt nicht mit PROGRAM)

  PROGRAM 123 BEGIN … END . (123 ist kein Bezeichner, Bezeichner müssen mit einem Buchstaben beginnen)

example

Here is a BNF for a German postal address :

  <Post-Anschrift>  ::= <Personenteil> <Straße> <Stadt>
  <Personenteil>    ::= <Titelteil> <Namensteil> <EOL>
  <Titelteil>       ::= <Titel> |
  <Namensteil>      ::= <Vornamenteil> <Nachname> | <Vornamenteil> <Namensteil>
  <Vornamenteil>    ::= <Vorname> | <Initial> .
  <Straße>          ::= <Straßenname> <Hausnummer> <EOL>
  <Stadt>           ::= <Postleitzahl> <Stadtname> <EOL>

The formulation is:

  • A mailing address consists of a person part, followed by a street, followed by the city.
  • The person part consists of a title part and a name part, followed by a line end.
  • The title part consists of a title or is empty.
  • The name part consists of a first name part, a surname or a first name part and in turn a name part. (This rule shows the use of recursion in BNFs and represents the case where a person has several first names and / or initials.)
  • The first name part consists of a first name or an initial followed by a period.
  • A street consists of a street name, followed by a house number, followed by the end of a line.
  • A city consists of a zip code, followed by a city name, followed by the end of a line.

Note that some things (like postcode or house number) are not specified any further. It is believed that these lexical details are context sensitive or otherwise specified.

Often the title is put in square brackets, the title part is omitted. This means that the title can be empty:

Option:

 <Personenteil>    ::= [ <Titel> ] <Namensteil> <EOL>

This example is not a pure form from the Algol 60 report . The square brackets […] represent an option. They were introduced a few years later in the definition of IBM's PL / I , but are generally only recognized in EBNF .

Option:

 <Zahl> ::= [ - ] <Positive Zahl>

The minus sign is optional. The definition is equivalent to

 <Zahl> ::=  <Positive Zahl> | - <Positive Zahl>

A number is a positive number, or a minus sign followed by a positive number.

Modifications of the BNF

Syntax diagrams of the modified BNF

The alternative and the sequence are basically suitable for representing the BNF. However, the characters |, [,] cannot be distinguished from the BNF characters. Often, characters such as full stop or minus can only be recognized with difficulty.

The BNF is therefore usually slightly modified and supplemented:

  • No angle brackets <…> for non-terminals.
  • Characters as terminal symbols are placed in quotation marks ("0" | "1" ...)
  • Non-terminal symbols in lower case.
  • Keywords in capital letters.
  • Only = instead of :: =.
  • A period at the end of a rule. Multi-line rules are possible.
  ziffer            = "0" | "1" | "2" | "3" | … | "9" .
  zifferaussernull  = "1" | "2" | "3" | … | "9" .
  ziffernfolge      = ziffer | ziffer ziffernfolge .
  zahl              = [ "-" ] zifferaussernull [ ziffernfolge ] | "0" .
  programm          = PROGRAM bezeichner
                      BEGIN satzfolge END "." .

The option is sometimes not shown in square brackets, but by an appended question mark. Repetition by recursion is often cumbersome:

  • Options are represented by an appended question mark.
  • Repetitions (one or more times) are represented by an appended plus sign.
  • Optional repetitions (none, one or more times) are indicated by an appended asterisk.
  • Brackets are used for grouping
  ziffernfolge ::= ziffer+ .
  zahl         ::= ( "-" )? zifferaussernull ( ziffernfolge )? | "0" .
  bezeichner   ::= buchstabe ( buchstabe | ziffer )* .

The extended Backus-Naur form takes a different approach. It uses square brackets […] for the option, but curly brackets {…} for the optional repetition. There is no strict distinction between terminals and non-terminals. Here the example above would be shown like this:

  Ziffernfolge = Ziffer { Ziffer } ;
  Zahl         = [ "-" ] ZifferAusserNull [ Ziffernfolge ] | "0" ;
  Bezeichner   = Buchstabe { Buchstabe | Ziffer } ;

Self-definition of a (modified) BNF

A modified BNF can define itself:

  Modifiziertebnf   ::= | Satz Modifiziertebnf .
  Satz              ::= Nichtterminal ":" ":" "=" Elementliste "." .
  Elementliste      ::= | Element Elementliste .
  Element           ::= Terminal | Nichtterminal .
  Nichtterminal     ::= Kleinbuchstabe | Kleinbuchstabe Nichtterminal .
  Terminal          ::= Schluesselwort | Anf Sichtbareszeichen Anf .
  Schluesselwort    ::= Grossbuchstabe | Grossbuchstabe Schluesselwort
  Grossbuchstabe    ::= "A" | "B" | … | "Z" .
  Kleinbuchstabe    ::= "a" | "b" | … | "z" .
  Sichtbareszeichen ::= "!" | "$" | "%" | … (''alle sichtbaren Zeichen'') .
  Anf               ::= '"' .

In this version, keywords are shown as uppercase letters and nonterminals as lowercase letters. Repetitions must be defined via recursions. Of it is also exercised in their own definition ( modifiziertebnf , element list , not terminal , keyword ).

BNF and parser generators

Some parser generators use their own form of the BNF as input and use this to generate a parser for the underlying programming language.

The operating system Unix is enclosed with yacc is such a program. It generates a table-controlled parser from a BNF definition, whereby only productions (: instead of :: =) and alternatives (|) are permitted. This is necessary because yacc enables S-attribution , but no meaningful semantic type of the attribute can be assigned to an optional part. The output is a subroutine in programming language C. The underlying grammar must meet the LALR property .

See also

literature

Web links

Individual evidence

  1. Saul Gorn: Specification languages ​​for mechanical languages ​​and their processors - a baker's dozen . In: Communications of the ACM 4, 1961, pp. 336-371.
  2. ^ Anton Nijholt: Computers and Languages . North-Holland, Amsterdam 1988, ISBN 0-444-70463-9 , pp. 207-210.
This version was added to the list of articles worth reading on March 3, 2006 .