American Standard Code for Information Interchange
The American Standard Code for Information Interchange ( ASCII , alternatively US-ASCII , often pronounced [ ˈæski ], German " American Standard Code for Information Interchange " ) is a 7-bit character coding ; it corresponds to the US version of ISO 646 and serves as the basis for later codings for character sets based on more bits .
The ASCII code was first approved by the American Standards Association (ASA) on June 17, 1963 as the ASA X3.4-1963 standard, and was substantially updated in 1967/1968 and last updated by its successor institutions in 1986 ( ANSI X3.4-1986) and is still used today. The character encoding defines 128 characters, consisting of 33 non-printable and the following 95 printable characters, starting with the space :
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
The printable characters include the Latin alphabet in upper and lower case, the ten Arabic numerals, and some punctuation marks ( punctuation marks , word marks ) and other special characters . The set of characters largely corresponds to that of a keyboard or typewriter for the English language . In computers and other electronic devices that display text, this is usually stored in accordance with ASCII or backwards compatible ( ISO 8859 , Unicode ).
The non-printable control characters contain output characters such as line feed or tab characters , protocol characters such as end of transmission or confirmation, and separators such as data record separators.
Coding
ASCII | Dec | Hex | Binary |
---|---|---|---|
A |
65 | 41 | (0) 100 0001 |
B |
66 | 42 | (0) 100 0010 |
C |
67 | 43 | (0) 100 0011 |
... | ... | ... | ... |
Z |
90 | 5A | (0) 101 1010 |
A bit pattern of 7 bits is assigned to each character . Since each bit can take on two values, there are 2 7 = 128 different bit patterns that can also be interpreted as the whole numbers 0–127 ( hexadecimal 00h – 7Fh).
The eighth bit, which is not used for ASCII, can be used for error correction purposes ( parity bit ) on the communication lines or for other control tasks. Today it is almost always used to expand ASCII to an 8-bit code. These extensions are largely compatible with the original ASCII , so that all characters defined in ASCII are also encoded in the various extensions using the same bit pattern. The simplest extensions are encodings with language-specific characters that are not included in the basic Latin alphabet, cf. below .
composition
code | … 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 ... | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1… | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2… | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3… | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | : | ; | < | = | > | ? |
4… | @ | A. | B. | C. | D. | E. | F. | G | H | I. | J | K | L. | M. | N | O |
5… | P | Q | R. | S. | T | U | V | W. | X | Y | Z | [ | \ | ] | ^ | _ |
6… | ` | a | b | c | d | e | f | G | H | i | j | k | l | m | n | O |
7… | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
The first 32 ASCII character codes (from 00 hex to 1F hex ) are for control characters (control character) reserved; see there for the explanation of the abbreviations in the table above. These characters do not represent characters, but serve (or were used) to control devices that use ASCII (such as printers). Control characters are, for example, the carriage return for the line break or Bell (the bell); their definition is historically based.
Code 20 hex (SP) is the space (engl. Space or blank ) which is used in a text as a blank and separate words on the keyboard and by the space key is generated.
The codes 21 hex to 7E hex stand for printable characters that include letters, digits and punctuation marks ( punctuation marks , word characters ). The letters are only lower case and upper case of the Latin alphabet . Letter variants used in non-English languages - for example the German umlauts - are not included in the ASCII character set. Typographically correct dashes and quotation marks are also missing , the typography is limited to the typewriter type . The purpose was information exchange , not printing .
Code 7F hex (all seven bits set to one) is a special character that is also known as a deletion character ( DEL ) . In the past, this code was used like a control character in order to be able to delete an already punched character on punched tape or punched cards by setting all the bits, i.e. by punching out all seven markings. This was the only way to erase, as holes once they have existed cannot be undone. Areas without holes (i.e. with the code 00 hex ) were mainly found at the beginning and end of a perforated strip ( NUL ) .
For this reason there were only 126 characters in the actual ASCII, because the bit patterns 0 (0000000) and 127 (1111111) did not correspond to any character codes. Code 0 was later interpreted in the C programming language as the "end of the character string"; various graphic symbols have been assigned to the character 127.
history
Teletype
An early form of the character encoding was Morse code . It was ousted from the telegraph networks with the introduction of teleprinters and replaced by the Baudot code and Murray code . It was only a small step from the 5-bit Murray code to the 7-bit ASCII - ASCII was also first used for certain American teleprinter models , such as the Teletype ASR33 .
Dec | Hex | ASCII 1963 | ASCII 1965 | ASCII today |
---|---|---|---|---|
0-63 | 00-3F | see normal composition | ||
64 | 40 | @ |
` |
@
|
65-91 | 41-5B | see normal composition | ||
92 | 5C | \ |
~ |
\
|
93 | 5D | see normal composition | ||
94 | 5E |
↑
|
^
|
|
95 | 5F |
←
|
_
|
|
96 | 60 | unoccupied | @ |
`
|
97-122 | 61-7A | unoccupied |
a - z
|
|
123 | 7B | unoccupied |
{
|
|
124 | 7C | unoccupied | ¬ |
|
|
125 | 7D | unoccupied |
}
|
|
126 | 7E | ESC |
| |
~
|
127 | 7F | see normal composition |
The first version, still without lowercase letters and with small deviations from today's ASCII for the control and special characters, was created in 1963.
The second form of the ASCII standard followed in 1965. Although the standard was approved, it was never published and therefore never applied. The reason for this was that it was reported to the ASA that the ISO (the International Standards Organization) was standardizing a character set that was similar to but slightly contradicting this standard.
In 1968 the version of the ASCII standard that is still valid today was established. This version gave birth to the Caesar encryption ROT47 as an extension of ROT13 . While ROT13 only rotates the Latin alphabet by half its length, ROT47 rotates all ASCII characters between 33 ( !
) and 126 ( ~
).
computer
At the beginning of the computer age, ASCII developed into the standard code for characters. For example, many terminals ( VT100 ) and printers were only controlled with ASCII.
For the coding of Latin characters, the 8-bit coding EBCDIC , incompatible with ASCII , is used almost exclusively on mainframes , which IBM developed parallel to ASCII for its System / 360 , at that time a serious competitor. The use of the alphabet is more difficult in EBCDIC, because there it is divided into two separate code areas. IBM itself used ASCII for internal documents. ASCII was supported by President Lyndon B. Johnson's 1968 arrangement to use it in government offices.
Use for other languages
With the International Alphabet 5 (IA5), a 7-bit coding based on ASCII was standardized as ISO 646 in 1963. The reference version (ISO 646-IRV) corresponds to ASCII except for one position. In order to be able to display letters and special characters in different languages (for example the German umlauts), 12 character positions were provided for redefinition ( #$@[\]^`{|}~
). Simultaneous display is not possible. Failure to adapt the software to the variant used for the display often led to unintentionally funny results, e.g. B. When the Apple II was switched on, "APPLE ÜÄ" appeared instead of "APPLE] [".
Since there are characters that are used in programming, especially e.g. B. the various brackets, programming languages have been upgraded for internationalization using substitute combinations ( digraphs ). Only characters from the invariant part of ISO 646 were used for coding. The combinations are language-specific. For example, Pascal (*
and *)
the curly brackets correspond to ( {}
), while C <%
and %>
provides for it.
Extensions
Use of the remaining 128 positions in the byte
To overcome the incompatibilities of national 7-bit variants of ASCII, various manufacturers first developed their own ASCII-compatible 8-bit codes (i.e. those that match ASCII in the first 128 positions). The code page 437 called Code has long been the most widely used, he came on the IBM PC under English MS-DOS , and is still in the DOS window of English Microsoft Windows used. In their German installations, the Western European code page 850 has been the standard since MS-DOS 3.3 .
Eight bits were also used in later standards such as ISO 8859 . There are several variants, for example ISO 8859-1 for the Western European languages, which was adopted in Germany as DIN 66303 . German-language versions of Windows (except DOS windows) use the Windows-1252 encoding based on ISO 8859-1 - this is why the German umlauts, for example, look incorrect if text files were created under DOS and viewed under Windows.
Beyond 8 bits
Many older programs that used the eighth bit for their own purposes couldn't handle it. Over time, they have often been adapted to the new requirements.
Even 8-bit codes, in which one byte stood for one character, offered too little space to accommodate all characters of human writing culture at the same time. This made several different specialized extensions necessary. In addition, there are some ASCII-compatible codes, especially for the East Asian region, which either switch between different code tables or require more than one byte for each non-ASCII character. However, none of these 8-bit extensions is "ASCII", because that only describes the uniform 7-bit code.
In order to meet the requirements of the various languages, Unicode (identical in its character set to ISO 10646 ) was developed. It uses up to 32 bits per character and could thus differentiate between over four billion different characters, but is restricted to around one million permitted code points . This means that all characters previously used by humans can be displayed, provided they have been included in the Unicode standard. UTF-8 is an 8-bit encoding of Unicode that is backwards compatible with ASCII. One character can take up one to four 8-bit words . Seven-bit variants no longer have to be used, but Unicode can also be encoded in seven bits with the help of UTF-7 . UTF-8 became the standard for many operating systems. For example, Apple's macOS and some Linux distributions use UTF-8 by default, and more than 90% of the websites are created in UTF-8.
Formatting marks compared to markup languages
ASCII contains only a few characters that are generally used for formatting or structuring text; these emerged from the control commands of the teleprinters . These include in particular the line feed, the carriage return, the horizontal tab character , the form feed and the vertical tab character. In typical ASCII text files , in addition to the printable characters, there is usually only the carriage return or the line feed to mark the end of the line; in DOS and Windows systems both are usually used one after the other, with older Apple and Commodore computers (without Amiga ) only the carriage return and on Unix-like and Amiga systems only the line feed. The use of additional characters for text formatting is handled differently. Markup languages such as HTML are now more commonly used to format text .
Compatible character encodings
Most of the character encodings are designed in such a way that they use the same code as ASCII for characters between 0… 127 and the range above 127 for other characters.
Fixed length codings (selection)
There is a fixed number of bytes for one character. In most encodings, this is one byte per character - a single byte character set or SBCS for short. With the East Asian scripts there are two or more bytes per character, which means that these encodings are no longer ASCII-compatible. The compatible SBCS character sets correspond to the ASCII extensions discussed above:
- ISO 8859 with 15 different character encodings to cover all European languages, Turkish , Arabic , Hebrew and Thai (see table on the right)
- MacRoman , MacCyrillic and other proprietary character sets for Apple Mac computers prior to Mac OS X
- DOS code pages (e.g. 437, 850) and Windows code pages (e.g. Windows-1252 )
- KOI8-R for Russian and KOI8-U for Ukrainian
- ARMSCII-8 and ARMSCII-8a for Armenian
- GEOSTD for Georgian
- ISCII for all Indian languages
- TSCII for Tamil
|
|
|
Variable length codings
In order to be able to encode more characters, the characters 0 to 127 are encoded in one byte, other characters are encoded by several bytes with values greater than 127:
- UTF-8 and GB 18030 for Unicode
- ISO 6937 for European languages with Latin script
- Big5 for Traditional Chinese ( Republic of China (Taiwan) , overseas Chinese )
- EUC (Extended UNIX Coding) for several East Asian languages
- GB (Guojia Biaozhun) for Simplified Chinese ( PRC )
ASCII table
In addition to the hexadecimal codes, the following table also shows the decimal and octal codes .
|
|
|
|
Eponyms
The asteroid (3568) ASCII , discovered in 1936, was named after the character encoding in 1988.
See also
expenditure
- American Standards Association: American Standard Code for Information Interchange. ASA X3.4-1963. American Standards Association, New York 1963 ( PDF 11 pages ( Memento from May 26, 2016 in the Internet Archive ))
- American Standards Association: American Standard Code for Information Interchange. ASA X3.4-1965. American Standards Association, New York 1965 (approved but not published)
- United States of America Standards Institute: USA Standard Code for Information Interchange. USAS X3.4-1967. United States of America Standards Institute, 1967.
- United States of America Standards Institute: USA Standard Code for Information Interchange. USAS X3.4-1968. United States of America Standards Institute, 1968.
- American National Standards Institute: American National Standard for Information Systems. ANSI X3.4-1977. 1977.
- American National Standards Institute: American National Standard for Information Systems. Coded character sets. 7-bit American National Standard Code for Information Interchange (7-bit ASCII). ANSI X3.4-1986. 1986.
- Further revisions:
- ANSI X3.4-1986 (R1992)
- ANSI X3.4-1986 (R1997)
- ANSI INCITS 4-1986 (R2002)
- ANSI INCITS 4-1986 (R2007)
- ANSI INCITS 4-1986 (R2012)
literature
- Jacques André: Caractères numériques: introduction. In: Cahiers GUTenberg. Volume 26, May 1997, ISSN 1257-2217 , pp. 5-44, (French).
- Yannis Haralambous: Fonts & encodings. From Unicode to advanced typography and everything in between. Translated by P. Scott Horne. O'Reilly, Beijing et al. a. 2007, ISBN 978-0-596-10242-5 (English).
- Peter Karow: Digital Fonts. Presentation and formats. 2nd improved edition. Springer, Berlin a. a. 1992, ISBN 3-540-54917-X .
- Mai-Linh Thi Truong, Jürgen Siebert, Erik Spiekermann (Eds.): FontBook. Digital Typeface Compendium (= FontBook 4). 4th revised and expanded edition. FSI FontShop International, Berlin 2006, ISBN 3-930023-04-0 (in English).
Web links
- RFC 20 . - ASCII format for Network Interchange . October 16, 1969 (ANSI X 3.4-1968 - English).
- ITU T.50 (09/1992) International Alphabet No.5 (English)
- ISO / IEC 646: 1991 (English)
- ASA X3.4-1963 (English)
- Notes on the control characters (English)
- ASCII table with explanations (German)
- Conversion from and to decimals, octals, hexadecimal and binary ASCII notation (English)
Individual evidence
- ↑ American Standards Association (Ed.): American Standard Code for Information Interchange . 1963 ( scans ).
- ^ Fred W. Smith: New American Standard Code for Information Interchange . In: Western Union Technical Review . April 1964, p. 50-58 ( worldpowersystems.com ).
- ↑ United States of America Standards Institute (ed.): USA Standard Code for Information Interchange USAS X3.4-1967 . 1967.
- ↑ American National Standards Institute (ed.): American National Standard for Information Systems - Coded Character Sets - 7-Bit American Standard Code for Information Interchange (7-Bit ASCII) ANSI X3.4-1986 . 1986 ( unicode.org [PDF; 1.7 MB ] ANSI INCITS 4-1986 [R2002]).
- ↑ a b ASA / USASI / ANSI + ISO ( Memento from January 16, 2010 in the Internet Archive )
- ↑ Basics of technical informatics for technical informatics, HAW Hamburg ( Memento from September 28, 2007 in the Internet Archive ) (PDF)
- ↑ w3techs.com
- ↑ Minor Planet Circ. 12973 (PDF)