Markup language

from Wikipedia, the free encyclopedia

A markup language ( English language markup , abbreviated ML ) is a machine-readable language for structuring and formatting of text and other data. The best known representative is the Hypertext Markup Language (HTML), the core language of the World Wide Web .

Markup languages ​​are used to describe properties, affiliations and forms of representation of sections of a text (characters, words, paragraphs, etc. - " elements ") or a data set. This is usually done by having tags are marked.

The article deals in particular with the “separation of structure and representation” recommended by the Standard Generalized Markup Language (SGML).

Word origin and history

The typographic term distinction comes from the printer language . Originally, this only meant the method of designing parts of a text using fonts that differ from the basic font, e.g. B. by different font sizes and types, but also by underlining, locking or other printing colors. For the typesetter , the corresponding passages were previously indicated by hand on the associated manuscript; this was also called awarding . With the further development of typography for digital texts , these became machine-readable languages , and the concept was expanded to include footnotes , bibliographical references , paragraphs , headings, etc. Then the idea of separating content and form (originally a catchphrase in formal sociology ) became popular, so that references to the formatting of parts of the text in the source texts for documents were increasingly replaced by identifications of the type of information that should be communicated . This led in 1986 to SGML as the international labeling standard (ISO 8879) and in 1998 to the specification of XML by the World Wide Web Consortium . In the years that followed, XML was also used for purposes other than formatting text documents, such as data formats (" data serialization ").

What excellent text looks like

Main features

Typical markup languages ​​identify parts of text or other data with tags . The source texts for this are written using a computer-readable character set , usually ASCII or UTF-8 . Often the language also offers the possibility to describe special characters , mostly with the help of a numeric assignment (Unicode) or by naming ( named character entities ) , for µexample \muin LaTeX and µin HTML .

Result and code in examples

Example for … Representation
HTML Latex MediaWiki - Wikitext
heading section <h2>Abschnitt</h2> \section{Abschnitt}
== Abschnitt ==
  • Point 1
  • Point 2
  • point 3
<li>Punkt 1</li>
<li>Punkt 2</li>
<li>Punkt 3</li>
\item Punkt 1
\item Punkt 2
\item Punkt 3

* Punkt 1
* Punkt 2
* Punkt 3

Hyperlink W3C <a href="">W3C</a> \href{}{W3C} [ W3C]
bold text fat <b>fett</b> \textbf{fett} '''fett'''
italic text italic <i>kursiv</i> \textit{kursiv} ''kursiv''

The hyperlink with LaTeX does not work in general, but at least with the additional package hyperrefand when a result is generated in PDF format.

Examples of “representative” versus “descriptive” distinction

" Bold " and " italic " in the previous table mean a certain representation ( formatting , here specifically choice of a font style ), while " heading " is a semantic characteristic and generally does not define a representation as bold. In printed works, headings are formatted with small caps or italics instead of bold.

For HTML and LaTeX, the previous table therefore gives the code for the font for “bold” and “italic” ; in fact, Wikipedia's MediaWiki software generates '''fett'''the HTML code from the wikitext . In contrast to this, HTML offers the semantic marking of what should express "importance", for example: <b>fett</b>strong

HTML Result with default settings
<strong>wichtig!</strong> important!

The HTML element strongis usually represented by bold text (in browser settings).

The relationship between the HTML element and bthe HTML element strongis analogous to the relationship between the HTML element iand em: This element stands for " emphasis ", its default display is in italics. In LaTeX there is also a "semantic variant" of \emphthe display markup \textit:

HTML Latex Result with default settings
eine <em>Betonung</em>
in normaler Umgebung
eine \emph{Betonung}
in normaler Umgebung
an emphasis in normal surroundings

However, italics are not suitable for emphasis within an italicized text, LaTeX takes this into account:

HTML Result with default settings
<i>eine <em>Betonung</em> in kursiver Umgebung</i> an emphasis in italics
\textit{eine \emph{Betonung} in kursiver Umgebung} an emphasis in italics
Latex Result

Wikitext behaves here in a way like LaTeX; in HTML, the behavior of LaTeX can be achieved ( rudimentarily ) through the CSS declaration . According to the HTML5 specification, nesting of elements is supposed to express increased emphasis (which apparently has hardly been implemented yet). LaTeX, on the other hand, switches back and forth between italic and upright position when nesting , so that when reading triple accentuation cannot be distinguished from single accentuation. Ultimately, common and at the same time meaningful implementations of "emphasis" in HTML are only known for the simplest cases and in LaTeX only for the simplest and the second-simplest cases. i em { font-style: normal; }em\emph

With HTML, efforts have long been made to get rid of descriptive (the jargon in this case is “presentational”) elements ( HTML4 variants “strict” vs. “transitional”). This goal should be achieved with HTML5, although band istill exist - for cases in which bold or italics are “urgently required”. A list of cases in which the specific choice of font style is appropriate is regarded as a “semantic definition” of the two elements.

Internal systematics - levels of abstraction

“Representative” versus “Descriptive” - overview

1981 differed Charles Goldfarb at a conference (the " Lausanne Conference") and in an influential article between procedural ( English procedural markup ) and descriptive ( " descriptive ", English descriptive markup ) Award of documents. 1987 präsentational ( English presentational markup ) in the context of WYSIWYG - word processor called reward than another type text. Quite soon, however, “presentational” was used synonymously with “procedural” (or as a generic term , see below # “procedural” and “presentational” ) - we call this “ representative ” here. The latter determines the formatting of the text , for example the font design by choosing a font , a font style , the font size , a font color or an underlay ; also the alignment of text (relative spacing, absolute position on the page). Other synonyms were used more often later:

to "performing" / "presentational"
visual, physical, specific;
too "descriptive"
structural, declarative, generalized ("generalized"), generic, content, logical, conceptual ( English conceptual markup ), semantic.

The term generic coding instead of descriptive markup comes from William W. Tunnicliffe . In 1992, Furuta used to represent Goldfarbs (and Brian Reids , see below) the distinction not “procedural” and “descriptive”, but “presentational” and “generic” (also separation between content specification and format specification and “generic” logical structure rather than its physical appearance .) In the specifications (and drafts) for HTML 4.0 and HTML 4.01, the predominant pair of opposites is “presentational” (also presentation elements and attributes ) versus “structural” ( separate structure and presentation ), and there is also talk of visual formatting .

At the beginning of his article explains Goldfarb, the markup separate the logical elements from each other and give ( "typically" - probably in relation to the previously known procedural markup), the processing functions ( "processing functions") that are to be applied to these elements.

Popularity descriptive award (advantages, historical development)

Goldfarb, William W. Tunnicliffe and Brian Reid recommended at that time, documents in writing only reward "descriptive" - z. B. to mark phrases and blocks only as "title", "section heading", "block quotation" etc. - in order to enable typographically high-quality typesetting even without typographical expertise and programming knowledge of the author to be able to change the style of presentation with little effort To be dependent on certain typesetting providers and to facilitate the automatic retrieval of information, e.g. to give more weight to occurrences in headings when searching through documents for keywords. Goldfarb points out, for example, that simply marking words as centering means losing information as to whether it is a heading or the caption of a table or figure. Descriptive labeling also facilitates the display in different output formats / devices such as HTML, PDF and screen reader ( accessibility , see Accessible Rich Internet Applications ). In the case of HTML, the use of presentational attributes instead of style sheets can also “bloat” the HTML files. Correspondingly, HTML later worked towards offering only “structural” or “semantic” elements and attributes and completely outsourcing the presentation to the cascading style sheets ( separation of content and form ).

William W. Tunnicliffe advocated the separation of content and form in word processing as early as 1967 at a conference, but this initially had little effect (after all, Goldfarb claims to have been influenced by it). In 1981, Brian Reid presented his Scribe typesetting system at the same session of the “Lausanne Conference” in which Goldfarb presented his ideas. Scribe's separation of content and form (ating) was particularly impressive. Over the next few years, Leslie Lamport developed the LaTeX macro package for the TeX program, particularly with the motivation to offer authors a descriptive markup language. It was released in 1985. LaTeX was already very popular in 1992, initially among North American mathematicians, and in the next few years in the scientific-academic field and in industry. In the next few years, an almost purely European development team took over the further development of LaTeX from Lamport and improved its flexibility with regard to the use of different “stylesheets” (macro definition files with endings .styfor “style” as with Lamport and .clsfor the declaration of the “document class” \documentclass) and with regard to the Use with non-English languages, which made LaTeX even more important.

Tunnicliffe and Goldfarb, on the other hand, cited the further development of IBM Generalized Markup Language to SGML as the basis for the definition of purely descriptive markup languages, from which XML later emerged, which also plays an important role in the work set .

Definition as "language"

A markup language should be a language that is also machine-readable . For this purpose, the syntax and semantics must be specified, which applies in the following cases:

  1. The source code of a document is a program with instructions in a (domain-specific) programming language ; As with other programming languages , syntax and semantics are also defined and form a formal language whose syntax is, for example, through production rules (for example in Backus-Naur form ). This applies to PostScript , troff and TeX (for this at token level after expansion of macros, among others).
  2. In the case of markup languages defined in accordance with SGML or XML , the syntax is represented precisely by a document type definition . Under certain circumstances, the World Wide Web Consortium also specifies (informal) semantics consisting of recommendations directed at users and developers.

It is a little more difficult in the case of TeX and LaTeX , where macro definitions (mainly before reading in the code that represents the content of a document) create a very extensive “procedural” language (we get ahead of something). The choice of “speaking” macro names creates an “illusion” of a purely “descriptive” distinction. By concealing (in the manual) or forbidding the entirely available options of “procedural” or “presentational” markup, one can achieve a “purely descriptive” markup language. Similarly, HTML 4.01 Strict was a purely descriptive markup language by “prohibiting” presentational elements and attributes that were still interpreted by browsers.

"Procedural" and "presentational"

In an important article from 1987, in addition to “procedural” and “descriptive”, other types of markup were described, from which XML co-author Tim Bray adopted “presentational” in his blog. Was meant by the latter such a markup that of WYSIWYG - word processors -source document has been in the inserted when users specific keystrokes typed (called WordStar ). Instead of the source code, the user only sees a preview of the print output. "Presentational" obviously has a different, more special meaning than in the HTML specifications, which do not speak of WYSIWYG editors. One thing they have in common, however, is that the markup code is more concise than the one for "conspicuous procedural markup" in the following sense:

In the example given by Goldfarb, a list, such as that introduced in HTML , is preceded by the following code: <ol>

.tb 4
.of 4
.sk 1

The first two lines represent value assignments for parameters that control the hanging indentation of the following paragraph, the third line creates its vertical spacing from the previous paragraph. The markup language used is the ( troff- like) SCRIPT . It is obviously a part of a computer program in an imperative programming language . in HTML is shorter and does away with details of formatting. The example is only suitable to hint at Goldfarbs' idea of ​​“procedural distinction”, and only illustrates the difference to “descriptive distinction”. <ol>

Bray illustrates “procedural markup” with the PostScript commands gsaveand grestore. These two commands relate to each other like \begingroupand \endgroupin TeX . The instruction \begingroupcauses the previous parameter value to be stored in a stack with each subsequent parameter value change . The corresponding command \endgrouprestores the parameter values ​​before the corresponding \begingroupone. Both commands have no direct effect on the formatting; the effect depends on the values ​​of which parameters are changed between them.

In PostScript there is also the command selectfontthat is reminiscent of the LaTeX command \selectfont:

20 selectfont
72 500 moveto
(Hallo Welt!) show

Overall, the previous observations suggest the following example:

Italics with HTML and LaTeX, the latter with high-level versus low-level commands
presentation HTML LaTeX high-level LaTeX with \begingroup LaTeX with {instead\begingroup
italics set
 kursiv\/} gesetzt

The two low-level examples on the right come very close to how LaTeX \textitactually implements the high-level command . \endgroup gesetztwould result in “ italicized ”, so is \endgroup\used. The need for this trick is avoided in the example on the right, where the curly brackets represent the commands \begingroupand \endgroup, while after they \textitonly indicate its scope. The command \/prevents the distance between “italic” and “set” from being too narrow due to the right inclination of the “v” (so-called italic correction ).

In all four examples there is a descriptive distinction that varies the font style. One of the disadvantages of procedural distinction, which Goldfarb mentions, is that it requires the mastery of a large number of programming commands, as an example he specifically mentions Knuth's TeX . Correction of italics is also a typographical subtlety, the necessity of which is not a matter of course for authors when using TeX. The LaTeX command \textitsaves the user the knowledge of a few low-level commands and italic corrections. The ielement in HTML is just as easy to master. Goldfarbs point of criticism addressed here (in contrast to others) is obviously not directed against any descriptive decoration, but only against programming language-like labeling as in the two examples on the right and against PostScript commands above.

In the case of the HTML example, the designation of the markup by and as "procedural" also appears inappropriate. While in the "cumbersome" examples individual commands are directed to the word processor (gold color: "processing functions"), which only achieve the desired display through their combination, the element only represents an abstract interface to the web browser , its procedural processing of the left For example, it is not accessible at all for authors of HTML documents. The difference is similar to that between imperative programming (“genuinely procedural” in the examples on the right) and declarative programming , in which the algorithms for achieving a described state (here: italics) are not explicitly mentioned. <i></i>i

“Procedural” and “descriptive” markup languages

Also (of "descriptive markup languages" in the literature English descriptive markup languages ) as opposed to "procedural markup languages" ( English procedural markup languages spoken); when a pronunciation is “procedural” or “descriptive” should perhaps be taken for granted after explanations of “procedural distinction” or “descriptive distinction”. A “descriptive markup language” should be a markup language that enables neither “procedural” nor “presentational” markup , ie is “purely descriptive” - as was the intention / “philosophy” of SGML. This applies to DocBook and TEI . The predicate “procedural markup language” seems to apply to markup languages ​​in which value assignments and other similarities with imperative programming languages ​​are “unmistakable”, perhaps also to markup languages ​​that give formatting instructions in a more declarative way, such as HTML (before HTML5). In any case, PostScript , TeX and troff could be counted among them.

The previous interpretation, however, contradicts the fact that, according to Furuta (and the 1994 LaTeX companion "to a large extent"), LaTeX should be a "generic markup language", despite the descriptive \textit(with the LaTeX 2e described in the LaTeX companion ) or \it(with the LaTeX 2.09 valid in 1992). Perhaps a generic (or descriptive) markup language is a language that offers a “certain amount” of generic markup in addition to presentational markup.

Levels of representation

Referring to a work from 1988 in which he was involved, Furuta speaks of three " representations " of a document:

  1. an abstract one that is changed by editing with an editor ( abstract representation ),
  2. a physical one that arises from an abstract one through formatting ( physical representation ), and
  3. a page appearance that is required for a specific output device ( page representation ).

Furuta's article is structured accordingly.

By means of "representational markup", as explained above (starting with examples ), the font style, colors and text alignment can be determined; a corresponding section in the specifications for HTML 4.0 and 4.01 describes this "physical" aspect reasonably comprehensively. In HTML5 there is styleone possibility left with the attribute, e.g. B. Choosing font styles (using CSS code), tables also cause a "physical" reasonably strict display that comes closer to the page-oriented display than the choice of font styles. This type of markup corresponds to the original, narrower term ( text formatting or traditional markup as described at the beginning of the article).

What is generally not determined with such distinction is the line break in a running text paragraph . In the case of a word from the middle of a longer paragraph, you will be “surprised” by whether it is on the left or right in the displayed paragraph on the screen or on the printed page, or whether it is separated by a line break . This is also the case with the usual use of LaTeX , ConTeXt and plain TeX . If necessary (with a little more advanced knowledge) you can manually fix the lines of a paragraph (with websites with and , with LaTeX with and ). More often than not, in individual cases one is not satisfied with the automatic line break and sets a line break manually, or one prevents a line break within a phrase. In addition to the line break, the line spacing is also typically determined automatically (it should be even, but often larger for mathematical formulas with breaks). TeX also emerged with the specialty of putting the characters in mathematical formulas in different sizes and arranging them relative to one another in such a way that the proportions meet high typographical requirements. white-space: nowrap<br />\makebox\linebreak

In contrast to websites, the page break must also be determined in the print set . This, too, is usually left to the typesetting program and the automatic result is occasionally corrected manually. In contrast, when designing the title page of a book, “nothing is left to chance”.

File formats that fix and determine all line breaks in continuous text on an output page and also the exact position of text elements and graphics on it are called (or correspond to) page description languages . These are, for example, PostScript and PDF from Adobe , the original output format DVI from TeX or XML Paper Specification from Microsoft (more in the main article ). However, PDF and DVI cannot be viewed, changed or written in a text editor. This is possible in Postscript, you can in principle write a book in PostScript and determine the exact positions of all characters on the individual pages, similar to a typewriter. In practice, PostScript files are more likely to be generated from source text files marked with LaTeX by converting the DVI file generated by TeX into PostScript with another program ( dvips ).

In general, the author only provides the text with descriptive or "physical" markup (in an editor) without specifying line / page breaks; these (and other types of arrangement) are rather generated automatically and possibly stored in a page description file. Page description files can be viewed as a preview on the screen and printed out with a viewer such as Ghostview (Postscript), Adobe Reader (PDF) or YAP (for DVI under Windows) or xdvi (for DVI under Linux - see DVI viewer ) . They are also advantageous for the electronic exchange of documents or their dissemination ( online publication ) compared to the source formats, since they save the recipient from having to create the new version of the document (which can even fail) (" exchange formats ").

However, the "page appearance" or "page presentation" of a document does not have to exist as a separate page description file. With some "editors" you can / could view them "directly" on the screen or print them out. troff has been extended to ditroff , which can generate its own page description file , other word processing programs have been equipped with the ability to generate PDF.

In web browsers (more precisely: HTML renderers ) and e-book readers (which display HTML or EPUB, for example), the page display (the break of running text paragraphs) is quickly adapted to changing window widths or font sizes.

Implementation of the style variation for generic labeling

Implementation of a representation

For the formatting of generically marked text, general rules for handling the individual tags (possibly depending on " attributes " in SGML-like markup languages) are specified in a formal language (in a kind of program). Corresponding "rule files" are called "stylesheets" in the SGML environment (not with LaTeX). In part or as a first step, the formatting consists of “translating” the generic language into a presentational one.

In the case of HTML, the formatting of individual elements is determined by corresponding instructions in CSS code. For example, the CSS line says that an HTML file should be displayed with blue text on a yellow background, and with the text in elements should be red. In the following sample document body { color: blue; background-color: yellow; }em { color: red; }em

  <title>Hallo Welt!</title>
  <style type="text/css">
    body { color: blue; background-color: yellow; }
    em   { color: red; }

<em>Hallo,</em> Welt!

<em>Hörst</em> du?


this CSS code appears in an styleelement within the headelement. The result should be something like

Hello world!

Do you hear

be and the same as with

  <title>Hallo Welt!</title>
<body style="color: blue; background-color: yellow; ">

<em style="color: red; ">Hallo,</em> Welt!

<em style="color: red; ">Hörst</em> du?


The second file has been replaced with, and each tag - generic markup - in the first file has been replaced with the presentational one. The CSS statement works like inserting into all -starting tags. <body><body style="color: blue; background-color: yellow;">em<em style="color: red; ">tag { stil } style="stil"tag

What HTML renderers actually do to merge CSS and HTML cannot be shown here. After all, the example files are even XHTML , ie code of an XML language, and the “translation” represents a transformation which (again, easily abusively) could be represented by XSL transformation (XSLT). XSL stands for Extensible Stylesheet Language. In the case of XML, the “puristic” use of XSL and XSLT consists in translating generic XML languages ​​according to XSL stylesheets into the presentational language XSL-FO (“XSL Formatting Objects”). In simple cases, this means, as above, replacing generic tags with presentational tags. More details can be found in the articles to which reference has just been made. XSL-FO is not itself a page description language and must first be converted into a PDF file, for example.

An XSL transformation actually creates a file in a different text format from the generic source code of a document. In the case of LaTeX, however, it is similar to HTML renderers: generic commands are translated into presentational or (finally) procedural ones , albeit internally, at token level . becomes a token chain 1 11 11 11 11 11 12 2 , then the first two tokens and the last are gradually replaced by a few others, if some tests have been passed and the inclination of the surrounding writing is not positive, a similar token chain results as 11 11 11 11 11 12 , with standard settings behaving like 11 11 2 . The result would be the same as that of the “procedural version” of in the section # “procedural” and “presentational” . In contrast to the Document Object Model , in which the document is only translated after it has been completely represented in the memory, the TeX engine processes data streams such as the source code, the token chains and other internal lists in as short sections as possible and discards itself after a printed page has been output largely the memory contents required for this (so it was possible to write thick volumes decades ago). \emph{Hallo}emph{Hallo,}begingroupitshapeHallo,/endgroupitshapefontshape>{it}selectfont\textit

In the case of LaTeX (as of TeX in general and also of ConTeXt ), the search and replace that implements the formatting is done by an internal macro processor . The generic markup language IBM Generalized Markup Language , introduced by Charles Goldfarb in 1981, also translated macros into the procedural, Troff- like language SCRIPT .

The examples should also show two advantages of generic markup compared to procedural markup: Generically marked source code takes up less storage space than presentational markup (as soon as the number of corresponding text elements exceeds a number that depends on the complexity of the replacement rule - which is not yet the case in the example ), and in a text editor the actual text to be displayed is easier to find again with generic markup than with procedural markup, it is more intuitive to read. (See also Don't repeat yourself and abstraction (computer science) .)

This memory space effect is increased if the style definitions (unlike in the previous example) are not in the "head" of the text source headfile - the element of an HTML file or above in a LaTeX source file (called "document preamble" there) - but in separate style files that are included from the text source files ( transclusion ). On websites that host a large number of separate documents that are uniformly designed, these are CSS files with the extension ( section in CSS ). In the case of LaTeX, the style files originally had the ending for style. Today, files with the extension that are read in from also determine the method of representation: \begin{document}.css.sty.cls\documentclass

HTML Latex
  <title>Hallo Welt!</title>
  <link rel="stylesheet" type="text/css" href="style.css" />

<em>Hallo,</em> Welt!

<em>Hörst</em> du?

  \emph{Hallo,} Welt!

  \emph{H\"orst} du?

The two CSS lines from before could now be in the file style.cssthat would look like this:

body { color: blue; background-color: yellow; }
em   { color: red; }

Change of display

In the previous pair of examples you can now change the display of the selected text source code by changing the "head":

HTML Latex
  <title>Hallo Welt!</title>
  <style type="text/css">
    em { text-decoration: underline;
         font-style:      normal;    }

<em>Hallo,</em> Welt!

<em>Hörst</em> du?

  \emph{Hallo,} Welt!

  \emph{H\"orst} du?

In the case of LaTeX, the formatting style of the fictional The ABC-Journal was replaced by a LaTeX standard class and the transclusion of the ulem.sty file was added. This defines the \emphresulting token in emphsuch a way (puts another macro replacement rule into effect) that the emphasis is represented by underlining instead of italics. The changed CSS code for the emelement has the same effect. Apart from the font, the result with HTML like LaTeX should look like this:

Hello world!

Do you hear

Alternatively, the CSS code could be style.csschanged to. For journal numbers, the framed and framed parts of the source texts sent in by the individual authors can be combined with the document preamble of the journal so that they are all formatted according to the “type of house”. \begin{document}\end{document}

The representation of XML documents can be changed by using a different XSL transformation.

The generic text of the document does not have to be changed at all to change the presentation . The World Wide Web Consortium pointed this out in the specification of HTML5 as the second disadvantage of presentational markup, and Goldfarb spoke of “inflexibility” in relation to changes in the display method as the second disadvantage of procedural markup. - In practice, LaTeX, for example, does not always find the best line or page breaks, so that the editors of a trade journal number occasionally \pagebreakhave to insert a presentational or something similar.

(Instead of \"ousing LaTeX ö, if the document contains preambles , for example . The file imported in this way is an example of the fact that the ending unfortunately no longer only stands for "style"; rather, such packages often offer possibilities to make work easier, mostly through Extension of the instruction set.) \usepackage[utf8]{inputenc}inputenc.sty.sty

Conclusion: What is the “separation of content and presentation”?

In the case of LaTeX and HTML , the source code of the document contains information on formatting , in the case of purely descriptive / generic markup, however, the information on formatting is only in a "header" of the source file - in the headelement or in the "document preamble". The text to be displayed with generic markup is located in a different part of the source document - bodyelement or environment. The separation of structure and presentation or the like then consists in the fact that source documents have two components, one of which only specifies formatting rules and the other only contains the document text with generic markup. {document}

The formatting rules do not have to be located directly in the header; the header usually incorporates most of the formatting rules from other files ( transclusion ). In the case of LaTeX, the file with the information on the formatting (the "control file") does not have to contain the entire text to be displayed; this is often - especially in the case of books - also included from other (generically labeled) files.

In other cases, the source document does not contain any formatting information at all (does not include files with formatting rules, e.g. XML / XSL). The "separation of content and form" - or to differentiate it from formal sociology : of "content" and "formatting" - is then - even more clearly than in the previous case - achieved by the fact that the content provided with logical distinction is in different files than the formatting rules. When choosing a display style, you do not need to change the files that contain the text to be displayed (“content”).

Automatic code generation and original source code

It has already been mentioned that “excellent text”, which forms the basis for displaying a document on output devices (printer, screen), can be automatically generated from another form of “excellent text”. Insofar as the fixed, page-oriented form of representation can still be viewed as encoded in a markup language (is PostScript a markup language ?, PDF?), It is practically always automatically converted from a purely physical (without semantic-structural information) to purely generic (without references to the method of presentation, as in HTML5 without the styleattribute) or a markup language that mixes physical and semantic-structural information (as with the "non-puristic" use of LaTeX). It can be generated directly from a purely physical appearance of the document (PDF from XSL-FO), and a purely physical, non-page-oriented form can be generated automatically from a purely structural appearance (XHTML) (e.g. by XSL transformation ).

When the work has been published or sent to an addressee, or when the printout required for an archive is available, the underlying files of certain markup formats are often forgotten and some users delete them. If the document is to be (partially) reused, e.g. B. for a new, revised book edition, or if an article printed years ago is also to be published online as HTML, it is good if the original (partly) semantic-structural markup - the original source code - is still available and not exhausted must be "reconstructed" in a purely descriptive format (e.g. unnumbered section headings and sub- section headings).

Authors (or typists) usually do not look at the automatically generated code. When using WYSIWYG editors, one typically does not even pay attention to the original source code. It is the same with LyX , a front-end for LaTeX, with which one can mark semantically and structurally and recognize the generated structure on the screen without seeing the source code.

(In view of the different ways in which text characters are encoded - Unicode or ... one could also say that the original source code consists of a hex dump that one does not look at, the text editor presents a "user-friendly version" of it, which is WYSIWYG with regard to the characters to be read on the output device -like.)

HTML was once a format in which the "original source code" of documents was noted. In the meantime it has also become a target format, for example (from databases that can be noted in XML) using scripting languages ​​such as JavaScript and PHP - or from other source formats with Pandoc . To mark up a text like a Wikipedia article, however, there are no alternatives: the pure text (as can be extracted from the browser window by copy and paste ) must be typed and marked. The <and> are cumbersome to type and are increased by XML requirements. Sometimes attribute names have to be typed, which worsens the ratio of characters to be displayed in the output to characters used for markup. LaTeX is sometimes easier to type and easier to read because (in the running text) it mainly uses positional parameters instead of key-value specifications. In addition, LaTeX users can introduce abbreviation commands (in the document preamble or in files - thanks to the built-in macro processor) for character combinations that occur frequently in a document (such as tags - which occur frequently varies from document to document ). - To simplify the generation of (X) HTML documents, the following options have been devised: .sty

  • HTML editors with autocompletion;
  • TeX4ht converts the DVI output from TeX into HTML or XML;
  • the website Meta Language - tools for programmers, using the m4 macro processor (cf. LaTeX);
  • Content management systems for non-programmers, cf. Editorial system , which is more general in that target formats other than (X) HTML are also targeted here, and includes the WYSIWYG editors already mentioned several times;
  • simplified markup languages - these are described in more detail below. In wikis they represent the “original source format”, from which mainly XHTML is generated - from this it can then also be printed in good quality (PDF), for example via XSL.

Simplified markup languages


Posts in wikis , blogs, and internet forums are typically made in web form windows . The design options can be very limited, which can benefit a neat appearance of the resulting pages. Although the target format (in which the articles are presented to the readers) is HTML or XHTML, HTML input code in the form is only accepted to a limited extent (otherwise it will be filtered out). The markup ( apart from the URLs for hyperlinks ) often only uses ( unusual combinations of) punctuation marks or at least characters that are not letters; or a few HTML tags are shortened and corresponding elements are not closed (similar to SGML ), for example

Textile Translation in XHTML Example representation
h3. Unterabschnitt <h3>Unterabschnitt</h3> Subsection

(similar to Haml ). As a result, the markup minimally disturbs the flow of reading when writing the article in the form window. For the presentation of documents of this markup is then on the server side in the required for this complex markup language like HTML or XHTML converted , for example by Pandoc or, as in the case of Wikipedia, by the MediaWiki software.

Markup examples with two simplified markup languages
MediaWiki- Wikitext Markdown so ... … or so: results in XHTML Representation
== Abschnitt ==
<h2>Abschnitt</h2> section

* Punkt 1
* Punkt 2
* Punkt 3

- Punkt 1
- Punkt 2
- Punkt 3

* Punkt 1
* Punkt 2
* Punkt 3

<li>Punkt 1</li>
<li>Punkt 2</li>
<li>Punkt 3</li>

  • Point 1
  • Point 2
  • point 3
[ W3C] [W3C]( <a href="">W3C</a> W3C
__fett__ <b>fett</b> fat
''kursiv'' *kursiv* _kursiv_ <i>kursiv</i> italic

Furthermore, simplified markup languages ​​typically dispense with the use of simple code line breaks and the indentation of code for the sole purpose of structuring it (in the interests of legibility and comprehensibility); Rather, in the case of MediaWiki, for example, a line break ends an indented paragraph ( "hanging indent" ) of a list or a block quote . An immediately following asterisk ( *) begins a (new) list item and is displayed as a typographical bullet . Disadvantages of this method are possible collisions with another function of the corresponding characters, which can cause errors. In Markdown e.g. B. Italicized text begins with an asterisk ( ich rufe *laut* um Hilfe), which at the beginning of the line ( *laut* rufe ich um Hilfe) can conflict with the use for a list entry. Indented code (i.e. the code line break is followed by at least one space ) is represented in Wikitext "verbatim" as "code" (without syntax highlighting). The articles Wikitext and Markdown as well as the other articles in the category: Simplified markup language offer further and more precise examples .

In addition to purely logical ("descriptive") markings such as headings and pure font markings such as bold , other functions can be fulfilled:

  • Tags for additional labeling of a database with additional information and for categorization;
  • Transclusions to include parts of other documents by reference.

Although the main target format of such languages ​​is HTML or XHTML, thanks to Pandoc , many of them can even be used (to a limited extent) as a front end for LaTeX and ConTeXt and thus ultimately have PDF as the target format, or they can be converted into word processing formats , e-books and documentation formats ( DocBook , man pages ) can be converted.

Historical development

Simplified markup languages ​​have always been used in purely text-based systems (e.g. readme or e-mails) to display highlighting such as italic or bold without these being converted any further. Especially the syntax of Markdown - the converted is - closely based on this historical practice.

Most of the markup languages ​​have developed through the use of different software; there are hardly any standardized or uniform solutions, although the functions are often similar.

Probably the first simplified markup language with conversion was developed by Ward Cunninghams in 1994 and published in 1995 as WikiWikiWeb together with the Portland Pattern Repository , see also Chronology of Hypertext Technologies .


YAML and its subset JavaScript Object Notation (JSON) are simplified markup languages ​​for data serialization .

External systematics: classification as programming language or data format

Filename extensions and MIME types of
selected markup languages
Markup language file extension MIME type
HTML .htm, .html text/html
PostScript .ps application/postscript
Rich Text Format .rtf text/rtf
TeX / LaTeX .tex text/x-tex
XML .xml text/xml

As to whether or not a markup language is a programming language , or whether a certain markup language such as HTML is a programming language (an HTML file is a "program") or not, contradicting statements can be found. In 2001, the W3C declared that XML was not a programming language, but offered rules for defining text formats for structuring data, i.e. for defining data formats (that is not the only thing). In fact, the development from SGML to XML made it possible to use markup languages ​​for purposes completely different from the original one - the formatting of texts  . For example, the configuration of the Linux window manager Openbox is stored in an XML file; instead of lines like in the configuration files of other programs one finds here (see other example ), and superordinate elements like are used to structure the approximately 900 lines of the file. It is by no means intended to “set” this configuration file as a “document”. The article XML gives further examples of such originally unintended uses of XML. As the data format, the markup language used in a (document) file can be recognized by the file name extensions (see table). Those markup languages ​​that are still intended for creating documents (HTML, PostScript, troff, LaTeX, RTF) constitute document formats . Binary document formats ( , , the output format DVI TeX) are no markup languages. key=value<key>value</key>mouse.doc.pdf

Of the paradigms procedural markup languages - PostScript , TeX and the descendant dripping of primeval RUNOFF (the well Goldfarb's GML sat) is known to be Turing complete are. In this respect, they can represent algorithms of any complexity and thus fulfill an essential, generally recognized feature of programming languages. XSLT is another Turing-complete programming language, the "commands" of which, however, as with the aforementioned "languages" are designed for the representation of documents labeled "descriptively" with XML and which, curiously, is itself noted in an "XML data format". The XQuery language for XML databases , which is noted in XML, is also Turing-complete.


  • James H. Coombs, Allen H. Renear, Steven J. DeRose: Markup Systems and the Future of Scholarly Text Processing . In: Communications of the ACM . tape 30 , no. November 11 , 1987, ISSN  0001-0782 , pp. 933-947 , doi : 10.1145 / 32206.32209 ( , [accessed July 7, 2015]).
  • Robin Cover: SGML: A Textual Representation for Information Structure . In: Summer Institute of Linguistics , Inc. (Ed.): Notes on Computing . tape 16 , (September / October), 1997 ( ( memento of April 22, 2003 in the Internet Archive ) [accessed July 27, 2015]).
  • Michael Downes: TeX and LaTeX 2e . In: Notices of the AMS . tape 49 , no. 11 , December 2002, p. 1384–1391 ( [PDF; 822 kB ; accessed on July 26, 2015]).
  • Richard Furuta: Important papers in the history of document preparation systems: basic sources . In: Electronic Publishing: Origination, Dissemination & Design . tape 5 , no. 1 . John Wiley & Sons, Chichester UK March 1992, p. 19–44 ( [accessed July 7, 2015] Relevant sections: 4, 5, 6.1, 6.2.).
  • Charles Goldfarb : A Generalized Approach to Document Markup . In: Proceedings of the ACM SIGPLAN SIGOA Symposium on Text Manipulation (=  SIGPLAN Notices ). tape 16 , no. 6 June 1981, pp. 68–73 ( [PDF; accessed July 9, 2015]).
  • Michel Goossens, Frank Mittelbach, Alexander Samarin: The LaTeX companion . 1st edition. Addison-Wesley, Bonn a. a 1994, ISBN 3-89319-646-3 , Section 1.3 - Generic Markup - and 1.4 - The Need for Visual Markup, p. 7-10 (English: The LaTeX Companion . 1994. Translated by Claudia Kraft and Rebecca Stiels, Motivation of LaTeX through the contributions of Goldfarbs and Reid, which are also explained. Section 1.3.3 is entitled "The separation of content and form" Remarkably, the second edition (Mittelbach and Goossens 2004f., See below) only contains remarks on the relationship between LaTeX and Reids Scribe (p. 2) and a sentence at the beginning of the second chapter, both in completely different terminology.).
  • Dmitry Kirsanov: Chapter 3: SGML and HTML DTD . Procedural and Descriptive Markup. In: Rick Darnell et al. (Ed.): HTML Unleashed . 1st edition., Indianapolis 1997, ISBN 1-57521-299-4 ( ( memento of June 30, 2015 in the Internet Archive ) [accessed July 23, 2015]).
  • Frank Mittelbach and Michel Goossens, with Johannes Braams, David Carlisle and Chris Rowley as well as contributions by Christine Detig and Joachim Schrod: The LaTeX Companion, Second Edition . 4th, revised edition. Addison-Wesley, Boston MA a. a. 2005, ISBN 0-201-36299-6 , section 1.1: A brief history, p. 1-6 .
  • AL Oakley, AC Norris: Page description languages: development, implementation and standardization . In: Electronic Publishing: Origination, Dissemination & Design . tape 1 , no. 2 . John Wiley & Sons, Chichester UK September 1988, p. 79–96 ( [PDF; 122 kB ; accessed on August 3, 2015] on pp. 79f. 8 definitions of page description language from previous publications are cited and summarized. The section Schemes for the description of printed pages from p. 89 to p. 92 describes relationships between page description languages and [other] markup languages.).
  • Eric Steven Raymond : The Art of Unix Programming . Addison-Wesley Professional, Boston 2004, ISBN 0-13-142901-9 , Chapter 8. Minilanguages ( - inter alia on the Turing completeness of individual markup languages, chapter start page of the HTML version dated September 23, 2003).

Web links

  • Tim Bray : On Semantics and Markup. April 9, 2003 (English, presentation of document markup types in a few lines, critical to "semantic").;

Individual evidence

  1. HTML5 - A vocabulary and associated APIs for HTML and XHTML. W3C Recommendation October 28, 2014. W3C, October 28, 2014, accessed on June 10, 2015 (English): "the core language of the World Wide Web: the Hypertext Markup Language (HTML)"
  2. ^ Meyer's encyclopaedic lexicon. Mannheim 1971. Volume 3, p. 188.
  3. HTML 4.01 Specification - W3C Recommendation. 15.2 Fonts. December 24, 1999, accessed July 8, 2015 .
  4. More precisely, the "ASCII apostrophes" in MediaWiki-Wikitext do not actually frame elements, and they do not allow elements to be nested , but allow overlapping markup: The first triple of apostrophes in a source text paragraph generates one <b>, the next one </b>, the third again a <b>, etc. The first pair of apostrophes, which is not followed by another apostrophe, generates a <i>, the next one </i>, the next one <i>, etc. At the end of the paragraph, open tags are automatically supplemented by closing tags.
  5. HTML5 - A vocabulary and associated APIs for HTML and XHTML - W3C Recommendation. 4.5.3 The strong element. (No longer available online.) October 28, 2014, archived from the original on August 1, 2015 ; accessed on October 6, 2018 (English).
  6. ^ <strong>: The Strong Importance element. In: Mozilla Developer Network . Retrieved on August 11, 2019 : “ Browsers typically render the contents in bold type. "
  7. <em>: The Emphasis element. In: Mozilla Developer Network . Retrieved on August 11, 2019 : “ Typically this element is displayed in italic type. "
  8. LaTeX / Fonts # Emphasizing text in the Wikibooks (English)
  9. Mittelbach and Goossens ( #Literature ) p. 341ff.
  10. HTML5 - A vocabulary and associated APIs for HTML and XHTML - W3C Recommendation. 4.5.2 The em element. (No longer available online.) October 28, 2014, archived from the original on August 1, 2015 ; accessed on October 6, 2018 (English).
  11. In HTML3.2 of January 14, 1997 nothing of this could be seen, but the foundation stone was laid on December 17, 1996 with CSS1 . In the working draft for HTML 4.0 of July 8, 1997, it was announced that “presentational” elements and attributes should gradually be replaced by style sheets .
  12. Document type definition
  13. a b HTML5 - A vocabulary and associated APIs for HTML and XHTML - W3C Recommendation. 1.10.1 Presentational markup. October 28, 2014, accessed July 8, 2015 .
  14. HTML5 - A vocabulary and associated APIs for HTML and XHTML - W3C Recommendation. 4.5 Text-level semantics. October 28, 2014, accessed October 6, 2018 .
  15. ^ A b Markup Technologies '98 Conference. Agenda and Schedule - Annotated. In: The CoverPages. January 11, 1998, accessed July 28, 2015 .
  16. ^ Richard Furuta: Important papers in the history of document preparation systems: basic sources. P. 20.
  17. a b c d e Richard Furuta: Important papers in the history of document preparation systems: basic sources. Section 4.1
  18. Goldfarb ( #Literature )
  19. a b Coombs, Renear and DeRose ( #Literature )
  20. a b c d Richard Furuta: Important papers in the history of document preparation systems: basic sources. P. 30.
  21. a b Tim Bray: On Semantics and Markup. Taxonomy of Markup. In: April 9, 2003, accessed July 28, 2015 .
  22. ^ Richard Furuta: Important papers in the history of document preparation systems: basic sources. as early as 1992 (Section 4.1).
  23. a b c HTML 4.01 Specification - W3C Recommendation. 15 Alignment, font styles, and horizontal rules. December 24, 1999, accessed July 8, 2015 .
  24. a b c d Robin Cover: SGML: A Textual Representation for Information Structure.
  25. Downes ( #Literature ) p. 1368.
  26. ^ Richard Furuta: Important papers in the history of document preparation systems: basic sources.  19th
  27. a b HTML 4.01 Specification - W3C Recommendation. 2.3.5 Style sheets. December 24, 1999, accessed July 28, 2015 .
  28. HTML 4.01 Specification - W3C Recommendation. 2.4.1 Separate structure and presentation. December 24, 1999, accessed July 28, 2015 .
  29. a b c d Goldfarb ( #Literature ) p. 68.
  30. Goossens / Mittelbach / Samarin 1994 and Mittelbach and Goossens 2004 p. 2 ( #Literature ).
  31. Mittelbach and Goossens ( #Literature ) p. 2.
  32. Mittelbach and Goossens 2004 ( #Literature ) pp. 2-4.
  33. Donald E. Knuth : The TeXbook . Illustrations by Duane Bibby. Addison-Wesley, Reading MA et al. a 1986, ISBN 0-201-13447-0 , p. 267 ff . (Brochure ISBN 0-201-13448-9 . In addition to macros, there are other expansion constructs such as conditional and reading out register contents , see Chapter 20).
  34. See the information package l2tabu .
  35. a b c Tim Bray: On Semantics and Markup. Procedural markup. April 9, 2003, accessed July 28, 2015 .
  36. Learn Postscript. gsave ... grestore. In: September 19, 2007, accessed on July 30, 2015 (English, mini-program as an example).
  37. ^ David Maxwell: Graphics State PostScript Commands. gsave. In: UBC Math Computing Lab Documentation. University of British Columbia , accessed July 30, 2015 .
  38. The example is a mixture of PostScript # A program example and : PostScript # "Hello world" and was tested with Ghostscript .
  39. Goldfarb ( #Literature ) p. 69
  40. Goossens / Mittelbach / Samarin 1994 ( #Literature ) p. 8
  41. ^ Richard Furuta: Important papers in the history of document preparation systems: basic sources. P. 25
  42. ^ Richard Furuta: Important papers in the history of document preparation systems: basic sources. Section 6.2: Page description languages . Quote: "Page description languages ​​describe the positioning of graphical marks on a printed page."
  43. Oakley / Norris ( #Literature ) P. 91f.
  44. HTML5 - A vocabulary and associated APIs for HTML and XHTML - W3C Recommendation. 1.10.1 Presentational markup. October 28, 2014, accessed on August 12, 2015 (English): "Presentational markup tends to be much more redundant, and thus results in larger document sizes."
  45. file ulem.sty . Retrieved July 17, 2018.
  46. HTML5 - A vocabulary and associated APIs for HTML and XHTML - W3C Recommendation. 1.10.1 Presentational markup. October 28, 2014, accessed on August 12, 2015 (English): "It is significantly easier to maintain a site written in such a way that the markup is style-independent. For example, changing the color of a site that uses <font color = ""> throughout requires changes across the entire site, whereas a similar change to a site based on CSS can be done by changing a single file. "
  47. Goldfarb ( #Literature ) p. 68 f.
  48. Dmitry Kirsanov ( #Literature ): HTML Unleashed. SGML and the HTML DTD. Introduction. (No longer available online.) June 16, 1997, archived from the original on June 30, 2015 ; accessed on October 6, 2018 : "SGML [...] think of it as a programming language to build working programs (HTML being one of them) [...]"
  49. Jukka Korpela ( Tampere University of Technology ): Programs vs. markup. or why HTML authoring is not programming. November 16, 2015, accessed July 13, 2014 .
  50. XML in 10 points. (No longer available online.) W3C September 22, 2014, archived from the original on December 20, 2016 ; Retrieved on October 6, 2018 (English): "Note: This document is no longer maintained but is left for historical purposes."
  51. Christoph PREVEZANOS: Technical Writing: For computer scientists, professionals, technicians and professional life . Section 2.1.5: XML Environments. Carl Hanser, Munich 2013, ISBN 978-3-446-43721-0 , pp. 13 ( limited preview in Google Book Search [accessed July 16, 2015] eBook ISBN 978-3-446-43759-3 ). Quote: “XML is not a word processor, not a programming language and not a specific program. Instead, it is a markup language with which texts can be structured and the elements can be declared. "
  52. Stephan Kepser ( University of Tübingen , SFB 441): A Simple Proof for the Turing-Completeness of XSLT and XQuery . In: Extreme Markup Languages ​​2004® (Montréal, Québec) (=  Proceedings of Extreme Markup Languages ). 2004 ( HTML version of the text of the lecture [accessed July 19, 2015]).