Entity (markup language)

from Wikipedia, the free encyclopedia

Entities ( english entity, entities ) are in markup languages ( English markup languages ) as SGML , XML , HTML , XHTML and HTML5 used recurring units of information to manage and reuse.

The syntax for entities that is widely used today is based on SGML . When developing XML and HTML Version 5, parts of SGML were taken over, including some options for defining entities.

The most common representative is the character entity, which is to be replaced by a single character. In particular, a mnemonic abbreviation (named entity) is replaced by the decimal or hexadecimal character coding (numeric entity, character reference).

Named Entity

Names (named character entities) are for people, numbers for machines. Computers can easily handle five-digit encodings - only humans have trouble with it.

Named entities improve the readability of documents by the user.

A named entity with name (entity name) and content (entity content) is declared by means of a document type definition (DTD). If the entity name is referenced in the document text, the parser replaces the reference with the entity content.

Examples:
  • Here it is agreed that all & amp; to be replaced by the decimal 38 character :
      <!ENTITY amp CDATA "&#38;"> <!-- ampersand/Kaufmännisches Und ("et"): & -->
(DTD format: HTML)
  • Document text with a clear meaning:
He is 6 & foot; 2 & inch; tall.
Three different DTDs are used for this document:
  • DTD for 7-bit ASCII environment
            <!ENTITY foot   "&#39;"> <!-- ' -->
            <!ENTITY inch   "&#34;"> <!-- " -->
  • DTD for multibyte Unicode environment
            <!ENTITY foot   "&#8242;"> <!-- ′ -->
            <!ENTITY inch   "&#8243;"> <!-- ″ -->
  • DTD for audiobook environment
            <!ENTITY foot   " foot ">
            <!ENTITY inch   " inch ">

Character Reference (Numeric Entity)

In the SGML standard, numeric entities were introduced as character references. Numeric entities are also defined as character references in XML . In the case of the numeric entity , the character code is entered as an entity in the document as:

  • &#nnn;, where nnn is the decimal coding of the character to be used, or
  • &#xhhhh;, where hhhh is the hexadecimal coding of the character to be used.

The parser replaces the character code with the encoded character.

Replacement of entities with characters

The replacement of a character entity in the source text does not necessarily have to be 1: 1 with another character. In European coded languages ​​(Latin, Greek), diacritical marks are common.

Example:
The "é" character can optionally be defined as
  1. <! ENTITY eacute "& # 233;">
  2. <! ENTITY eacute "& # xE9;">  -  ( hexadecimal )
  3. <! ENTITY eacute "é">
  4. <! ENTITY eacute "e & # x0301;">
  5. <! ENTITY Small_E_mit_Strich_drüber_nach_rechts_oben "e & # x02CA;">
In the first two definitions the named entity is replaced by a numeric entity, in the third with a single Unicode / ANSI character and in the fourth with a combination of two characters: an acute accent with the basic letter e .

It does not always have to be the case that a basic letter meets exactly one diacritical mark; several such modifications can be made above, below and next to the basic letter.

In non-European writing systems, there are also various ligatures , i.e. the most varied combinations of single letters that come together - Devanagari or Tamil are examples . In other cases (for example in Arabic ) the shape of the resulting character depends on the context, on the linguistic meaning - and not just on the coincidence of numerically coded individual characters, as can easily be converted by software. In German, a corresponding example would be the correct use of the long s and round s or the prohibition of ff , fi , fl ligatures across syllable boundaries.

However, not every combination of several elements to form a character is registered with its own Unicode number. For this reason, users must continue to be given the option of agreeing specific characters as their own character entities . An entity can also be a reference to a graphic ( bitmap as well as SVG ).

Example:
The entity is used in a collection of texts in Korean script&ko_37; . The publisher distributes the documents along with the following four DTDs.
  1. <!ENTITY ko_37 " &#12629;">
    <!ENTITY Encoding "UCS">  -  Unicode
  2. <!ENTITY ko_37 " yeo ">
    <!ENTITY Encoding "romanization">  -  Romanization
  3. <!ENTITY ko_37 "¤Å">
    <!ENTITY Encoding "EUC-KR">  -  EUC -KR
  4. <!ENTITY ko_37 "&#60;img src='ko_37.png'&#62;">
    <!ENTITY Encoding "graphic glyphs">  -  Replacement graphics
In the many useful texts, the characters are then displayed using the & ko_ nn ; written. At the beginning of each text there can be a note such as:
This document view is shown in & Encoding; (version: & koTXT-Version; - required: 1.2).
This informs the readers which DTD is currently being integrated and can help with display problems.

Future of character entities

With the gradual spread of UTF- 8, UTF-16, UCS -2 and UCS-4 in international IT applications, the need to encode characters using character entities is gradually decreasing . However, it will be many years before the last communication protocol and software application worldwide can handle multi-byte characters without errors.

Therefore, the need remains to be able to fall back to the us-ascii (7 bit) level for the exchange using numerical entities . However, the conversion is possible in both directions without loss, provided that the general entities are not touched and if there is a specific coding in the universal character set at all.

In the long term, the representation as a named entity of well-defined individual characters will only be important for the reading and writing of XML source text by human editors if characters occur outside the respective language world (be they foreign language or mathematical). It is to be expected that in the source text for viewing and changing the codings from problematic number ranges will be converted on-the-fly into named entities and, when saved, will be coded again into numerical entities or directly as characters.

The naming scheme is then only available locally at the processor and does not penetrate outside; In addition to the common English names defined by SGML, German, French or Russian entity names can also be displayed.

Named character entities were a meaningful and necessary concept in SGML in 1986 under the conditions at the time. Under slowly changing conditions and by means of user-friendly graphic input aids, this need no longer exists on modern systems, provided that Unicode characters are defined. This is the case with HTML - the most common application.

ISO standardized character names

SGML (1986)
Latin letters
isolat1   Added Latin 1
isolat2   Added Latin 2
isodia    Diacritical Marks
Graphics and Symbols
isonum    Numeric and Special Graphic
isopub    Publishing (Typographic)
isotech   General Technical
isobox    Box and Line Drawing
Added Mathematical Symbols
isoamsa   Arrow Relations
isoamsb   binary operators
isoamsc   delimiters
isoamsn   Negated Relations
isoamso   Ordinary
isoamsr   Relations
Greek characters
isogrk1   Greek Letters
isogrk2   Monotoniko Greek
isogrk3   Greek Symbols
isogrk4   Alternative Greek Symbols
Cyrillic Characters
isocyr1   Russian Cyrillic
isocyr2   Non-Russian Cyrillic
Only the names and a description of the sign were specified; Coding could only be assigned later using Unicode .
HTML 2 (1995)
  • Replacement characters for the HTML syntax: amp , lt , gt , quot
  • Named characters for ISO 8859-1 (i.e. codes 160 ... 255)
Their definition is identical to SGML: isolat1 (represented as www.w3.org/TR/REC-html40/HTMLlat1.ent).
HTML 4 (1999)
Like HTML 2, but definition of 152 additional encodings> 255 - Unicode required for representation (UTF-8).
Definitions available at
These URLs give the impression that an HTML browser would have to constantly reload the definitions from the Internet. It's not like this; the standard characters are hard-coded, all HTML display programs should "know" them.
XML (1998)
Only general entities (amp, lt, gt, apos, quot) are predefined as replacement characters in the XML syntax.
Users can define any entities themselves or integrate the DTD from SGML or HTML mentioned above.
XHTML (2000)
Like HTML 4, but also &apos;inherited from XML .
(see below )
MathML
Hundreds of special characters are defined, such as are required for mathematical formulas. Mostly own names are used, which are almost always longer than those in HTML and SGML.
XML (2010)
2007-2010 all common names were collected and combined in one draft. In a DTD, 2237 names are mapped to character encodings:
In particular, SGML (1986) and MathML are covered; this also includes HTML in its entirety. In individual cases, the most practicable variant was standardized, where different images existed on several character codes for the same purpose.

Multiple names can be used for the same character:

decimal
characters
unicode
entity definition
168
¨
U + 00A8
" the " SGML: isodia
" Dot " SGML: isotech
" uml " HTML.2, SGML: isodia
913
Α
U + 0391
" Agr " SGML: isogrk1
" Alpha " HTML.4
8598

U + 2196
" fool " SGML: isoamsa north west arrow
& # x2196; HTML
" UpperLeftArrow " MathML
" nwarrow " MathML

The character "Α" does not show whether it is a Greek capital alpha or a Latin A.

annotation

Occasionally the objection arises that mnemonic entities make the work unnecessarily complicated because the corresponding DTDs have to be agreed and provided and one should type the correct characters straight away or only work with the numeric entities.

Just an example in SGML: isocyr1 for comparison:

□ □ □ □ □ □ □
& R cy; & u cy; & s cy; & s cy; & k cy; & i cy; & j cy;
Russky
& # 1056; & # 1091; & # 1089; & # 1089; & # 1082; & # 1080; & # 1081;
& # x0420; & # x0443; & # x0441; & # x0441; & # x043A; & # x0438; & # x0439;
= Русский

It can make sense to automatically convert the named entities into numerical form after editing, to pass them on to others in this format - but to represent the numerical entities mnemonic again the next time they are changed by human editors.

The representation as entities also has the advantage that different characters with different meanings, which are very similar in the graphic representation (e.g. apostrophe, accent, apostrophe, quotation marks), can be clearly distinguished.

XHTML

XHTML contains exactly all of the definitions from HTML 4.0, and in any implementation all named entities must be known (and are, usually hard-coded). This further development affects the internal format and structure of the elements ( tags ), but not the useful text and not the entities.

However, in the mid-2000s there were increasing problems in communicating with web servers : They no longer provide documents with the MIME type text / html , but as application / xml , text / xml and others. Back then, this actually led to display problems when (older) browsers no longer recognize the text as HTML.

There are also XML applications that work with text passages and that have based on the comparable and familiar HTML elements. Current and most common example are written RSS - web feeds ( news ). Like HTML, they contain <p> , <span> , <div> and also <head> / <body> . The source text therefore looks like it is HTML. However, since it is not an HTML document at all, named entities cannot be used - unless the relevant DTD has been integrated or the display software (usually web browser ) does not apply the well-known definitions of its own.

Parameter entities

A special case in SGML , XML etc. are parameter entities . They may not be used in documents, but only within the DTD . Otherwise they have the same syntax, but instead of &there is %at the beginning.

Syntax of the declaration:

<!ENTITY % Name SYSTEM "externe.datei" >

Syntax of the reference (calling the entity):

%Name;

literature

Web links

Individual evidence

  1. Goldfarb u. a .: XML in Office 2003 , Pearson , 2004, pp. 320-322
  2. ISO 8879: 1986-10. In: www.din.de. Retrieved December 4, 2016 .
  3. Extensible Markup Language (XML) 1.0 (Fifth Edition). In: www.w3.org. Retrieved December 4, 2016 .
  4. www.w3.org/TR/REC-html40/HTMLlat1.ent HTMLlat1.ent ( English , ENT) w3.org. Retrieved March 29, 2019.
  5. A more easily readable resource under Character entity references in HTML 4 (also W3C ).
  6. Last: April 10, 2014, W3C Recommendation. The document thus had the status of a recommendation.