American Standard Code for Information Interchange

from Wikipedia, the free encyclopedia

The American Standard Code for Information Interchange ( ASCII , alternatively US-ASCII , often pronounced [ ˈæski ], German  " American Standard Code for Information Interchange " ) is a 7-bit character coding ; it corresponds to the US version of ISO 646 and serves as the basis for later codings for character sets based on more bits .

The ASCII code was first approved by the American Standards Association (ASA) on June 17, 1963 as the ASA X3.4-1963 standard, and was substantially updated in 1967/1968 and last updated by its successor institutions in 1986 ( ANSI X3.4-1986) and is still used today. The character encoding defines 128 characters, consisting of 33 non-printable and the following 95 printable characters, starting with the space :

 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~

The printable characters include the Latin alphabet in upper and lower case, the ten Arabic numerals, and some punctuation marks ( punctuation marks , word marks ) and other special characters . The set of characters largely corresponds to that of a keyboard or typewriter for the English language . In computers and other electronic devices that display text, this is usually stored in accordance with ASCII or backwards compatible ( ISO 8859 , Unicode ).

The non-printable control characters contain output characters such as line feed or tab characters , protocol characters such as end of transmission or confirmation, and separators such as data record separators.

Coding

Letters as 7-bit code
ASCII Dec Hex Binary
A 65 41 (0) 100 0001
B 66 42 (0) 100 0010
C 67 43 (0) 100 0011
... ... ... ...
Z 90 5A (0) 101 1010

A bit pattern of 7 bits is assigned to each character  . Since each bit can take on two values, there are 2 7 = 128 different bit patterns that can also be interpreted as the whole numbers 0–127 ( hexadecimal 00h – 7Fh).

The eighth bit, which is not used for ASCII, can be used for error correction purposes ( parity bit ) on the communication lines or for other control tasks. Today it is almost always used to expand ASCII to an 8-bit code. These extensions are largely compatible with the original ASCII , so that all characters defined in ASCII are also encoded in the various extensions using the same bit pattern. The simplest extensions are encodings with language-specific characters that are not included in the basic Latin alphabet, cf. below .

composition

ASCII character table, hexadecimal numbering
code … 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F
0 ... NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1… DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2… SP ! " # $ % & ' ( ) * + , - . /
3… 0 1 2 3 4th 5 6th 7th 8th 9 : ; < = > ?
4… @ A. B. C. D. E. F. G H I. J K L. M. N O
5… P Q R. S. T U V W. X Y Z [ \ ] ^ _
6… ` a b c d e f G H i j k l m n O
7… p q r s t u v w x y z { | } ~ DEL

The first 32 ASCII character codes (from 00 hex to 1F hex ) are for control characters (control character) reserved; see there for the explanation of the abbreviations in the table above. These characters do not represent characters, but serve (or were used) to control devices that use ASCII (such as printers). Control characters are, for example, the carriage return for the line break or Bell (the bell); their definition is historically based.

Code 20 hex (SP) is the space (engl. Space or blank ) which is used in a text as a blank and separate words on the keyboard and by the space key is generated.

The codes 21 hex to 7E hex stand for printable characters that include letters, digits and punctuation marks ( punctuation marks , word characters ). The letters are only lower case and upper case of the Latin alphabet . Letter variants used in non-English languages ​​- for example the German umlauts - are not included in the ASCII character set. Typographically correct dashes and quotation marks are also missing , the typography is limited to the typewriter type . The purpose was information exchange , not printing .

Code 7F hex (all seven bits set to one) is a special character that is also known as a deletion character ( DEL ) . In the past, this code was used like a control character in order to be able to delete an already punched character on punched tape or punched cards by setting all the bits, i.e. by punching out all seven markings. This was the only way to erase, as holes once they have existed cannot be undone. Areas without holes (i.e. with the code 00 hex ) were mainly found at the beginning and end of a perforated strip ( NUL ) .

For this reason there were only 126 characters in the actual ASCII, because the bit patterns 0 (0000000) and 127 (1111111) did not correspond to any character codes. Code 0 was later interpreted in the C programming language as the "end of the character string"; various graphic symbols have been assigned to the character 127.

history

Teletype

An early form of the character encoding was Morse code . It was ousted from the telegraph networks with the introduction of teleprinters and replaced by the Baudot code and Murray code . It was only a small step from the 5-bit Murray code to the 7-bit ASCII - ASCII was also first used for certain American teleprinter models , such as the Teletype ASR33 .

Dec Hex ASCII 1963 ASCII 1965 ASCII today
0-63 00-3F see normal composition
64 40 @ ` @
65-91 41-5B see normal composition
92 5C \ ~ \
93 5D see normal composition
94 5E ^
95 5F _
96 60 unoccupied @ `
97-122 61-7A unoccupied a - z
123 7B unoccupied {
124 7C unoccupied ¬ |
125 7D unoccupied }
126 7E ESC | ~
127 7F see normal composition

The first version, still without lowercase letters and with small deviations from today's ASCII for the control and special characters, was created in 1963.

The second form of the ASCII standard followed in 1965. Although the standard was approved, it was never published and therefore never applied. The reason for this was that it was reported to the ASA that the ISO (the International Standards Organization) was standardizing a character set that was similar to but slightly contradicting this standard.

In 1968 the version of the ASCII standard that is still valid today was established. This version gave birth to the Caesar encryption ROT47 as an extension of ROT13 . While ROT13 only rotates the Latin alphabet by half its length, ROT47 rotates all ASCII characters between 33 ( !) and 126 ( ~).

computer

At the beginning of the computer age, ASCII developed into the standard code for characters. For example, many terminals ( VT100 ) and printers were only controlled with ASCII.

For the coding of Latin characters, the 8-bit coding EBCDIC , incompatible with ASCII , is used almost exclusively on mainframes , which IBM developed parallel to ASCII for its System / 360 , at that time a serious competitor. The use of the alphabet is more difficult in EBCDIC, because there it is divided into two separate code areas. IBM itself used ASCII for internal documents. ASCII was supported by President Lyndon B. Johnson's 1968 arrangement to use it in government offices.

Use for other languages

With the International Alphabet 5 (IA5), a 7-bit coding based on ASCII was standardized as ISO 646 in 1963. The reference version (ISO 646-IRV) corresponds to ASCII except for one position. In order to be able to display letters and special characters in different languages ​​(for example the German umlauts), 12 character positions were provided for redefinition ( #$@[\]^`{|}~). Simultaneous display is not possible. Failure to adapt the software to the variant used for the display often led to unintentionally funny results, e.g. B. When the Apple II was switched on, "APPLE ÜÄ" appeared instead of "APPLE] [".

Since there are characters that are used in programming, especially e.g. B. the various brackets, programming languages ​​have been upgraded for internationalization using substitute combinations ( digraphs ). Only characters from the invariant part of ISO 646 were used for coding. The combinations are language-specific. For example, Pascal (* and *)the curly brackets correspond to ( {}), while C <% and %>provides for it.

Extensions

Use of the remaining 128 positions in the byte

To overcome the incompatibilities of national 7-bit variants of ASCII, various manufacturers first developed their own ASCII-compatible 8-bit codes (i.e. those that match ASCII in the first 128 positions). The code page 437 called Code has long been the most widely used, he came on the IBM PC under English MS-DOS , and is still in the DOS window of English Microsoft Windows used. In their German installations, the Western European code page 850 has been the standard since MS-DOS 3.3 .

Eight bits were also used in later standards such as ISO 8859 . There are several variants, for example ISO 8859-1 for the Western European languages, which was adopted in Germany as DIN 66303 . German-language versions of Windows (except DOS windows) use the Windows-1252 encoding based on ISO 8859-1 - this is why the German umlauts, for example, look incorrect if text files were created under DOS and viewed under Windows.

Beyond 8 bits

Many older programs that used the eighth bit for their own purposes couldn't handle it. Over time, they have often been adapted to the new requirements.

Even 8-bit codes, in which one byte stood for one character, offered too little space to accommodate all characters of human writing culture at the same time. This made several different specialized extensions necessary. In addition, there are some ASCII-compatible codes, especially for the East Asian region, which either switch between different code tables or require more than one byte for each non-ASCII character. However, none of these 8-bit extensions is "ASCII", because that only describes the uniform 7-bit code.

In order to meet the requirements of the various languages, Unicode (identical in its character set to ISO 10646 ) was developed. It uses up to 32 bits per character and could thus differentiate between over four billion different characters, but is restricted to around one million permitted code points . This means that all characters previously used by humans can be displayed, provided they have been included in the Unicode standard. UTF-8 is an 8-bit encoding of Unicode that is backwards compatible with ASCII. One character can take up one to four 8-bit words . Seven-bit variants no longer have to be used, but Unicode can also be encoded in seven bits with the help of UTF-7 . UTF-8 became the standard for many operating systems. For example, Apple's macOS and some Linux distributions use UTF-8 by default, and more than 90% of the websites are created in UTF-8.

Formatting marks compared to markup languages

ASCII contains only a few characters that are generally used for formatting or structuring text; these emerged from the control commands of the teleprinters . These include in particular the line feed, the carriage return, the horizontal tab character , the form feed and the vertical tab character. In typical ASCII text files , in addition to the printable characters, there is usually only the carriage return or the line feed to mark the end of the line; in DOS and Windows systems both are usually used one after the other, with older Apple and Commodore computers (without Amiga ) only the carriage return and on Unix-like and Amiga systems only the line feed. The use of additional characters for text formatting is handled differently. Markup languages ​​such as HTML are now more commonly used to format text .

Compatible character encodings

Most of the character encodings are designed in such a way that they use the same code as ASCII for characters between 0… 127 and the range above 127 for other characters.

Fixed length codings (selection)

There is a fixed number of bytes for one character. In most encodings, this is one byte per character - a single byte character set or SBCS for short. With the East Asian scripts there are two or more bytes per character, which means that these encodings are no longer ASCII-compatible. The compatible SBCS character sets correspond to the ASCII extensions discussed above:

MS-DOS code pages
437 English
708 Arabic (ASMO)
720 Arabic (Microsoft)
737 Greek
775 Baltic
850 Western European
852 Central European
855 Cyrillic
857 Turkish
858 Western European with euro
860 Portuguese
861 Icelandic
862 Hebrew
863 Canadian French
864 Arabic (IBM)
865 Nordic
866 Russian
869 Greek
Windows code pages
0874 Thai
0932 Japanese
0936 Simplified Chinese
0949 Korean
0950 Traditional Chinese
1250 Central European
1251 Cyrillic
1252 Western European
1253 Greek
1254 Turkish
1255 Hebrew
1256 Arabic
1257 Baltic
1258 Vietnamese
ISO 8859
-1 Latin-1 , Western European
-2 Latin-2 , Central European
-3 Latin-3 , Southern European
-4 Latin-4 , Northern European
-5 Cyrillic
-6 Arabic
-7 Greek
-8th Hebrew
-9 Latin-5 , Turkish
-10 Latin-6 , Nordic
-11 Thai
-12 (does not exist)
-13 Latin-7 , Baltic
-14 Latin-8 , Celtic
-15 Latin-9 , Western European
-16 Latin-10 , Southeast European

Variable length codings

In order to be able to encode more characters, the characters 0 to 127 are encoded in one byte, other characters are encoded by several bytes with values ​​greater than 127:

ASCII table

In addition to the hexadecimal codes, the following table also shows the decimal and octal codes .

Dec Hex Oct ASCII
0 00 000 NUL
1 01 001 SOH
2 02 002 STX
3 03 003 ETX
4th 04 004 EOT
5 05 005 ENQ
6th 06 006 ACK
7th 07 007 BEL
8th 08 010 BS
9 09 011 HT
10 0A 012 LF
11 0B 013 VT
12 0C 014 FF
13 0D 015 CR
14th 0E 016 SO
15th 0F 017 SI
16 10 020 DLE
17th 11 021 DC1
18th 12 022 DC2
19th 13 023 DC3
20th 14th 024 DC4
21st 15th 025 NAK
22nd 16 026 SYN
23 17th 027 ETB
24 18th 030 CAN
25th 19th 031 EM
26th 1A 032 SUB
27 1B 033 ESC
28 1C 034 FS
29 1D 035 GS
30th 1E 036 RS
31 1F 037 US
Dec Hex Oct ASCII
32 20th 040 SP
33 21st 041 !
34 22nd 042 "
35 23 043 #
36 24 044 $
37 25th 045 %
38 26th 046 &
39 27 047 '
40 28 050
41 29 051  )
42 2A 052 *
43 2 B 053 +
44 2C 054 ,
45 2D 055 -
46 2E 056 .
47 2F 057 /
48 30th 060 0
49 31 061 1
50 32 062 2
51 33 063 3
52 34 064 4
53 35 065 5
54 36 066 6
55 37 067 7
56 38 070 8
57 39 071 9
58 3A 072 :
59 3B 073 ;
60 3C 074 <
61 3D 075 =
62 3E 076 >
63 3F 077 ?
Dec Hex Oct ASCII
64 40 100 @
65 41 101 A
66 42 102 B
67 43 103 C
68 44 104 D
69 45 105 E
70 46 106 F
71 47 107 G
72 48 110 H
73 49 111 I
74 4A 112 J
75 4B 113 K
76 4C 114 L
77 4D 115 M
78 4E 116 N
79 4F 117 O
80 50 120 P
81 51 121 Q
82 52 122 R
83 53 123 S
84 54 124 T
85 55 125 U
86 56 126 V
87 57 127 W
88 58 130 X
89 59 131 Y
90 5A 132 Z
91 5B 133 [
92 5C 134 \
93 5D 135 ]
94 5E 136 ^
95 5F 137 _
Dec Hex Oct ASCII
96 60 140 `
97 61 141 a
98 62 142 b
99 63 143 c
100 64 144 d
101 65 145 e
102 66 146 f
103 67 147 g
104 68 150 h
105 69 151 i
106 6A 152 j
107 6B 153 k
108 6C 154 l
109 6D 155 m
110 6E 156 n
111 6F 157 o
112 70 160 p
113 71 161 q
114 72 162 r
115 73 163 s
116 74 164 t
117 75 165 u
118 76 166 v
119 77 167 w
120 78 170 x
121 79 171 y
122 7A 172 z
123 7B 173 {
124 7C 174 |
125 7D 175 }
126 7E 176 ~
127 7F 177 DEL

Eponyms

The asteroid (3568) ASCII , discovered in 1936, was named after the character encoding in 1988.

See also

expenditure

  • American Standards Association: American Standard Code for Information Interchange. ASA X3.4-1963. American Standards Association, New York 1963 ( PDF 11 pages ( Memento from May 26, 2016 in the Internet Archive ))
  • American Standards Association: American Standard Code for Information Interchange. ASA X3.4-1965. American Standards Association, New York 1965 (approved but not published)
  • United States of America Standards Institute: USA Standard Code for Information Interchange. USAS X3.4-1967. United States of America Standards Institute, 1967.
  • United States of America Standards Institute: USA Standard Code for Information Interchange. USAS X3.4-1968. United States of America Standards Institute, 1968.
  • American National Standards Institute: American National Standard for Information Systems. ANSI X3.4-1977. 1977.
  • American National Standards Institute: American National Standard for Information Systems. Coded character sets. 7-bit American National Standard Code for Information Interchange (7-bit ASCII). ANSI X3.4-1986. 1986.
  • Further revisions:
    • ANSI X3.4-1986 (R1992)
    • ANSI X3.4-1986 (R1997)
    • ANSI INCITS 4-1986 (R2002)
    • ANSI INCITS 4-1986 (R2007)
    • ANSI INCITS 4-1986 (R2012)

literature

  • Jacques André: Caractères numériques: introduction. In: Cahiers GUTenberg. Volume 26, May 1997, ISSN  1257-2217 , pp. 5-44, (French).
  • Yannis Haralambous: Fonts & encodings. From Unicode to advanced typography and everything in between. Translated by P. Scott Horne. O'Reilly, Beijing et al. a. 2007, ISBN 978-0-596-10242-5 (English).
  • Peter Karow: Digital Fonts. Presentation and formats. 2nd improved edition. Springer, Berlin a. a. 1992, ISBN 3-540-54917-X .
  • Mai-Linh Thi Truong, Jürgen Siebert, Erik Spiekermann (Eds.): FontBook. Digital Typeface Compendium (= FontBook 4). 4th revised and expanded edition. FSI FontShop International, Berlin 2006, ISBN 3-930023-04-0 (in English).

Web links

Individual evidence

  1. American Standards Association (Ed.): American Standard Code for Information Interchange . 1963 ( scans ).
  2. ^ Fred W. Smith: New American Standard Code for Information Interchange . In: Western Union Technical Review . April 1964, p. 50-58 ( worldpowersystems.com ).
  3. United States of America Standards Institute (ed.): USA Standard Code for Information Interchange USAS X3.4-1967 . 1967.
  4. American National Standards Institute (ed.): American National Standard for Information Systems - Coded Character Sets - 7-Bit American Standard Code for Information Interchange (7-Bit ASCII) ANSI X3.4-1986 . 1986 ( unicode.org [PDF; 1.7  MB ] ANSI INCITS 4-1986 [R2002]).
  5. a b ASA / USASI / ANSI + ISO ( Memento from January 16, 2010 in the Internet Archive )
  6. Basics of technical informatics for technical informatics, HAW Hamburg ( Memento from September 28, 2007 in the Internet Archive ) (PDF)
  7. w3techs.com
  8. Minor Planet Circ. 12973 (PDF)