Punycode
Punycode is a coding method standardized in RFC 3492 for converting Unicode character strings into ASCII -compatible character strings, which consist of the characters to , to and the hyphen ( ). Punycode was designed to uniquely and reversibly represent internationalized domain names consisting of Unicode characters using ASCII characters.
a
z
0
9
-
Reason for introduction
The most important reason for the introduction of Punycode was the fact that in the established Domain Name System only names consisting of the 26 Latin letters , the digits 0 to 9 and a hyphen-minus ("keyboard hyphen") are allowed . This was sufficient for the English language , but most other languages contain additional characters - the German language, for example, the umlaut letters ä
, ö
and ü
and that ß
. In order to be able to process any text from such languages, the procedure Internationalizing Domain Names in Applications was introduced in 2003 , which uses Punycode as a coding procedure.
If a text is to be transferred to a system that can only handle ASCII, it is first converted into ASCII using Punycode. It should be noted that in many cases the resulting text is elongated. Conversely, if this text is to be taken from the ASCII system, it is translated back into its original form using Punycode. If a text does not contain any special characters, it will not be changed by this procedure.
The Punycode conversion procedure was determined taking into account the following aspects:
- completeness
- Any name can be implemented
- Uniqueness
- Exactly one implementation is assigned to each name
- Reversibility
- Any converted name can be converted back
- Efficiency
- The converted name is not much longer than the source name
- simplicity
- The process is relatively easy to implement
- Readability
- Names made up of essentially Latin letters often remain legible, as the characters are not changed
a
untilz
Rules of Conversion
String | Punycode | IDNA | |
---|---|---|---|
abcdef | abcdef- | abcdef | * |
abæcdöef | abcdef-qua4k | xn - abcdef-qua4k | |
beautiful | schn-7qa | xn - schn-7qa | |
ย จ ฆ ฟ ค ฏ ข | 22cdfh1b8fsa | xn - 22cdfh1b8fsa | |
☺ | 74h | xn - 74h | |
74h | 74h- | 74h | * |
xn-- | xn --- | n. def. | |
* Punycode is not used |
In the following, the letters a
to z
and the digits 0
to are the base characters 9
. Together with the hyphen minus ( -
) as a separator, these 37 characters represent the only valid characters in a text encoded according to Punycode.
Contains the string to be converted
- base characters only, a minus sign is added;
- both base and non-base characters, all base characters are listed while maintaining their order and finally the encoded non-base characters are appended separated by a hyphen minus;
- only non-basic characters, the conversion result is only their code sequence, without separators.
In order to make the resulting character string as compact as possible, the special characters are not encoded "one-to-one", but according to the Punycode method. The non-base characters are first sorted according to their numerical value (Unicode code point, e.g. "ä" → 228, "ж" → 1078). The difference between the values of the individual characters is coded to a number together with the respective position in the original character string. This number is then represented by the 36 basic characters and appended to the text. The details of this procedure are specified in RFC 3492 , which also contains a reference implementation in the C programming language for coding and decoding as well as numerous examples.
When creating domain names according to the Standard Internationalizing Domain Names in Applications (IDNA), the prefix “ xn--
” is placed in front of the presence of non-base characters , and otherwise (only base characters) Punycode is not used.
It should also be noted that when creating an IDNA domain name, before encoding according to Punycode, the domain name is normalized according to certain rules (e.g. it is converted into lowercase letters and certain Unicode characters are mapped to others that are considered "equivalent" ). This normalization is not part of Punycode and i. d. Usually not clearly reversible.
Web links
- A. Costello: RFC 3492 . - Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA) . [Errata: RFC 3492 ]. March 2003. (English).
- Online punycode converter . Retrieved March 13, 2017.
- Online punycode and bootstring converter . Retrieved April 28, 2019.