special character

from Wikipedia, the free encyclopedia

A special character is (in typography / typometry and digital data processing ) a character that is neither a letter nor a number .

Special characters include punctuation marks ( punctuation marks , word marks ) and scientific - technical symbols. Also diacritics are special characters such. B. acute or breve (é, ă).

Differing meanings and fluctuating meaning

The special characters also include the non-printing characters that serve as an orientation aid when designing a print template, such as spaces , and sometimes not.

It is somewhat unclear whether z. B. Umlauts are special characters, under the given definition this is related to the issue of whether "Ä" is an independent letter that can be distinguished from "A", cf. German alphabet # Controversial number of letters . In Finnish and Estonian, however, Ä is a separate letter. In some cases, digits are also included in the special characters.

Greek letters can be symbols if they are not used to form Greek words but as variables (e.g. in statistics σ for the standard deviation ) or constants (e.g. for the circle number π).

Input methods (on computer keyboards ) are often described as “entering special characters” (on websites with the title “special characters”, see #Weblinks and input method # Weblinks ), whereby the input of all characters is treated without ASCII code, including letters -German languages. Often the Danish lowercase letter ø is explicitly mentioned as an example.

Special characters and technology

In the early days of information technology , the limitation of character sets to 7 or 8 bits was due to technical reasons. In order to avoid the many associated problems - for example, another character had to be removed from ISO 8859-15 , an 8-bit extension of ASCII , to introduce the euro symbol - a higher number of bits per character is increasingly used today.

However, there is no clear connection between the term special characters and advances in coding technology. Of the 94 printable ASCII characters, 32 are special characters, i.e. exactly a third. Symbols for simpler mathematical statements are already available, and with regard to the punctuation marks, Unicode (see below) only has the typographical variants of the horizontal line ( quarter-square , half-square , square , minus sign ), the quotation marks and those already coded in ASCII (from a German perspective) Added the ellipses (which were previously available with TeX from 7-bit character sets). The terminology is not clear regarding the question of whether the majority of the characters newly encoded compared to ASCII are special characters (e.g. umlauts, see above).

Compared to ASCII letters and numbers, the use of ASCII special characters does not require any special technology. Most (or many) ASCII special characters (punctuation marks, mathematical characters) can be embedded in the source code of digital texts just as easily as letters and numbers. However, with various technologies ( file names , programming, URL coding , others to follow), certain ASCII special characters have a special syntactic function (for example called “reserved characters”), which makes them somewhat difficult to display . ASCII special characters are used for such purposes in order to make entering text as difficult as possible for users.

Another aspect is the keyboard layout . Even in the days of the typewriter , the German and American keyboards differed mainly in the arrangement and presence of special characters. By shortcuts the amount of insertable directly into the source character is extended to computer keyboards in the major operating systems. It is a question of terminology whether all additional characters available in this way are special characters .

Regardless of the concept of special characters, it should be noted that some technologies were originally designed for ASCII characters only, albeit more for programmers than for users.

In the 80-character code of the IBM punch card , numbers, letters and digits were represented in different ways.

Unicode

On modern systems, even very remote special characters can be used without much ado. Various methods have developed (out of necessity).

Unicode is considered to be the most modern and generic form of implementation. Every character in this world, whether it is a recycling symbol or a Chinese character , has a place in the Unicode tables and is mapped on a computer as a memory location comprising one or more bytes. Each Unicode character has its own number. The character tables include, for example:

  • U + 0935 for the character व.

HTML

Character entities

Character entities make it possible to represent thousands of different characters with HTML files encoded in ASCII. In any case, letter variants, symbols and punctuation marks can be displayed for which 7 bits are not sufficient. - The topic is dealt with more generally in the article Entities in markup languages .

Numeric character entities

In HTML you can convey a character with the Unicode position NUMthrough the code ( written in decimal ) in the browser view, alternatively through , if the hexadecimal notation is for , example or for the mathematical "less than" character "<", which has position 60 in ASCII as in Unicode. In this case one speaks of numeric character entities . You start with (the ampersand symbol, followed by the hash mark ) and end with ( semicolon ). Both ASCII characters and practically all characters that could be called “special characters” can be represented in this way. &#NUM;NUM &#xHNUM;HNUMNUM&#60;&#x3C; &#;

Named character entities and “HTML native” characters

Named character entities have been introduced whose “names” are easy to remember for individual characters that are used particularly frequently . For example, the “less than” sign can also be &lt;represented by, the “name” ltis an abbreviation for “less than”. The code starts again with &and ends with ;, but the pound sign is missing.

The above mainly concerns characters not encoded in ASCII . Of the 32 ASCII special characters , only three have to be treated like this:

  • the "less than" sign - s. O.
  • the “greater than” sign - counterpart to the previous one, the HTML “tags” are formed ( ) - can be represented by<ELTNAME ATTR>TEXT</ELTNAME>&gt;
  • the &which one entity introduces itself - represented by &amp;.

These characters are referred to as “HTML-specific” characters; they could also be called “reserved characters” (as in URL encoding ).

In connection with attribute values, it can also be useful to replace the "(“makeshift double quotation mark”) with &quot;and the '(“makeshift single quotation mark”) with &apos;(“apostrophe”). However, if high-quality typography is sought, these measures are not sufficient.

In any case, named character entities make it easier to create HTML files with a text editor. The characters represented in this way include letter variants (with diacritical marks), mathematical symbols (which can also be arrows and Greek letters), and typographic variants of punctuation marks (→  punctuation marks ). In 1995, "names" were introduced for characters beyond ASCII in ISO 8859-1 , and in 1999 more for individual Unicode characters, see Named character entities in the article Entities in Markup Languages .

Specification of the source code coding

In addition, HTML viewers (browsers) can be instructed to convert text that is not encoded in ASCII according to the intent by explicitly specifying the coding of the source text in the file header:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

As an alternative to ISO-8859 variants UTF-8can also be specified. In both cases, character entity references are unnecessary, just to , , (and / ) to look for. &<>"'

Both methods - using entities and specifying the character encoding - can be used simultaneously without any problems.

What is better?

The article Entities in Markup Languages ​​discusses the two possibilities presented, non-ASCII characters (be it letters, numerals or special characters) in the Future of Character Entities and Note sections . (As of mid-February 2016.)

Latex

LaTeX , originally developed by computer scientists Donald E. Knuth ( TeX ) - for the American Mathematical Society - and Leslie Lamport (LaTeX) is popular for creating scientific documents .

Special characters without ASCII code

Character encoding

As with HTML, you can specify the character encoding of the source text in order to include umlauts and diacritical marks directly in the source code of a document, here with the help of a preamble line

\usepackage[utf8]{inputenc}

alternatively about latin1rather than utf8when working with older source files according to ISO 8859-1 encoded. Without the inputenc package, files with ASCII extensions cannot be processed (by default, LaTeX processes source files as encoded in ASCII) - at least with Knuth's original TeX engine or with pdfTeX ( pdflatex). XeTeX ( xelatex) and LuaTeX interpret the source files in their default setting as encoded in UTF-8 . With UTF-8 (i.e. Unicode), in principle, any symbols required in different subject areas , such as those in mathematics (for which it was originally created), form a particularly large proportion of "special characters" (not encoded by a single ASCII position), directly inserted as a single character in the source code of a LaTeX document. Typographic variants encoded in ASCII punctuation characters are also available (in 8 bit only the manufacturer-specific, non-standardized Windows-1252 offered typographic dashes ).

Coding using ASCII combinations

Typographic quality has always been possible with LaTeX without expanding the character encoding. The half-square ( dash ) is obtained with the ASCII code --, the square (English dash) with ---and typographically satisfactory omission with dot\dots . The character originally intended as a Grave accent is shown to represent a single quotation mark at the top left; for double quotation marks, the single quotation marks are doubled. Letter variants with combining characters were originally represented by shifting letters and diacritical glyphs provided separately in character sets ; the latter appear in the code (outside of formulas) as a combination of a beginning backslash \ (in ASCII hexadecimal 5C) and another character, so that, for example, “Ä “ Is generated by. With the additional macro package german you could type in a shorter and more legible way , so that the dots are also placed a little lower typographically correct than in English. It is precisely such letter variants that can be easily inserted into a source code file with keyboards designed for Latin alphabets , so that these combination commands may have become obsolete due to ASCII extensions; On the other hand, source code files have to be exchanged when writing texts together, and source files are sent to English-language magazines or publishers, although files encoded in ASCII, ISO 8859-1 and UTF-8 can still get mixed up in such cases it may be advisable to continue using the combination commands. \"{A}"A

LaTeX also automatically uses ligatures , which, however, are often unsuitable in German texts and must then be specifically suppressed.

In addition, LaTeX character sets with a total of thousands of subject-specific symbols are available from the Comprehensive TeX Archive Network or via TeX distributions , combined with macro packages that offer a combination of a beginning backslash and ASCII letters as a command for each symbol (→ #Weblinks ). So these symbols have a position in a character set managed by a single creator (or a small team), not (necessarily) in a system managed by a standards body. For some single Unicode code points, several TeX or LaTeX packages offer different font styles (e.g. for the euro symbol ). Like the “named entities” in HTML, the letter sequences are chosen according to mnemonic criteria; in some cases the “names” match those in HTML, e.g. B. \cupas &cup;for the union set symbol .

As an advantage of the ASCII input of symbols compared to the direct insertion of Unicode characters by keyboard shortcuts or from a character table or a toolbar , it is occasionally stated that the author can concentrate largely on the content of the text while his fingers are largely in uninterrupted flow, as when playing the piano wander over the keyboard without conscious control in the 10-finger system . For commands that are often required, you can (unlike HTML with its rigidly prescribed syntax - with \newcommandor \renewcommand) introduce a shorter "Alias" command.

ASCII special characters

To make typing easier and to improve the legibility of the code, 10 of the ASCII special characters - \{}$&#^_~%"misappropriated" / "reserved" ( function characters ), e.g. B. for (result “m²”), what you type in HTML or for . In order to display them as originally with ASCII , you can " mask " them with the backslash, except for and (which can be generated by longer commands depending on the context) , for example you type for the dollar symbol $ . m$^2$m&sup2;m<sup>2</sup>\~\$

In LaTeX, some commands look for the following left square brackets [or the star *. In special cases this causes difficulties, for example if you want to start a new line with square brackets. Instead of \\[typing better . \\{}[

Punycode

In order to be able to represent umlauts and other special characters in domain names , the Punycode procedure was developed, which together with Nameprep results in the standard for internationalized domain names (IDN). Non-ASCII characters are replaced by hyphens and their representation is appended to the end of the word.

See also

literature

Web links

Wiktionary: special characters  - explanations of meanings, word origins, synonyms, translations

HTML and Unicode

Latex

Wikibooks: LaTeX Compendium: Special Characters  - Learning and Teaching Materials

Individual evidence

  1. Wolfgang Beinert : special characters. In: Typolexikon . August 22, 2006, accessed February 7, 2016 .
  2. special characters. In: Duden online . Retrieved February 7, 2016 .
  3. a b Jo Appel, Manfred Leubner, Wolfgang Manekeller, Ute Mielow, Helga Rühling, Annelore Schliz, Annemarie Weighardt: Gabler Büro Lexikon . Springer-Verlag, 2013, p. 259 f . ( [P. 259] - "in addition to letters and numbers there are various other characters; [p. 260] these so-called S. include, for example, arithmetic operation characters (+ - /) and commercial characters (&%).").
  4. a b c Lutz J. Heinrich, Armin Heinzl, Friedrich Roithmayr: Wirtschaftsinformatik-Lexikon . Walter de Gruyter, 2004, p. 612 ( limited preview in Google Book Search [accessed on February 7, 2016] "A character that is neither a letter, nor a number, nor a space. E.g. e for S. are characters for arithmetic operations, punctuation marks, abbreviation symbols, control characters.") .
  5. ^ A b c Detlef Jürgen Brauner, Robert Raible-Beste, Martin M. Weigert: Multimedia-Lexikon . Walter de Gruyter, 1998, p. 319 ( limited preview in Google Book Search [accessed on February 7, 2016] "all characters except the letters of the alphabet, ie digits, punctuation marks, ligatures, accents, etc.").
  6. a b Ursula Rautenberg, Dirk Wetzel: book . Walter de Gruyter, 2001, p. 22 ( limited preview in Google Book Search [accessed on February 7, 2016] “What is different from these image-bearing letters is the non-printing dummy material (iconic characters as typographical 'zero characters'), with which, for example, word and Line spacing can be generated ").
  7. hotkey. In: Duden online . Retrieved February 7, 2016 .
  8. "Reference: HTML / Character Reference". In: wiki. SelfHTML .org. Retrieved February 1, 2016 .