UTF-32

UTF-32 is a method for coding of Unicode characters, wherein each character with four bytes (32 bits) is encoded. It can therefore be described as the simplest coding, since all other UTF codings use variable byte lengths. In the current Unicode Standard 5.1, UTF-32 is a subset of UCS-4 .

advantages

UTF-32 shows its particular advantages with random access to a certain code point, since the address of the nth character can be determined using the simplest pointer arithmetic in ${\ displaystyle {\ mathcal {O}} (1)}$ . It is also possible to use the size of a document in bytes to immediately calculate the number of code points it contains (namely by simply dividing by 4). However, this property is put into perspective by the fact that a Unicode character (extended grapheme cluster) often does not correspond to a code point (e.g. for ligatures or Korean).

disadvantage

The main disadvantage of UTF-32 is the high memory requirement. Texts that mainly consist of Latin letters - compared to the widespread UTF-8 - or the ISO-8859 character sets - take up about four times as much storage space. Therefore it is hardly used for external storage. Another disadvantage is the lack of backward compatibility with ASCII , as it is e.g. B. is given with UTF-8.

Strictly speaking, all UTF codes do not encode characters , but so-called Unicode code points . There are composite characters in Unicode that require more than one code point (e.g. characters with unusual or multiple accents, such as those found in Vietnamese, for example ). If such characters are to be processed correctly, random access to individual characters is not possible even in a UTF-32-encoded character string .