Byte Order Mark

from Wikipedia, the free encyclopedia

As Order Byte Mark ( BOM ; German  byte order mark ) is a characteristic byte sequence at the start of a data stream referred to the Unicode - characters U + FEFF ( English zero width no-break space ) coded. This byte sequence serves as an identifier for defining the byte order and coding form in UCS / Unicode character strings , especially text files .

In UTF-16 and UTF-32

In the coding UTF-16 and UTF-32 has -order bytes are specified, since the individual characters in each case at least in 16 or 32  bits are encoded large values and thus more bytes need (UTF-16: 2 bytes, UTF-32 : 4 bytes). The (also: die) byte order mark indicates the order in which the bytes are to be evaluated. This marking is always particularly important when data is exchanged between different systems.

In UTF-16, the BOM exists

In UTF-32, the BOM exists

  • in big-endian notation from the sequence 00 00 FE FF
  • in little-endian notation from FF FE 00 00 .

Since the character U + FFFE is always defined as invalid, the order of the first bytes can be used to clearly determine the evaluation order for all subsequent bytes.

In UTF-8

The UTF-8 coding of the BOM consists of the byte sequence EF BB BF , which usually appear as ISO-8859-1 characters ï »¿in text editors and browsers that do not support UTF-8 . With UTF-8 the problem of the byte order does not arise, but a BOM at the string - or the beginning of the file is allowed to mark the use of UTF-8 as encoding.

A reliable differentiation between UTF-8 and the ISO-8859 character sets is not guaranteed by using a BOM, since all byte sequences are permitted in the 8-bit character sets, including the UTF-8 coding of the BOM; But if the alternative is specifically UTF-8 or ISO 8859-1, then the pragmatic assumption is quite common that the character string ï »¿is not meant, and consequently that UTF-8 encoding is present.

In Java , the byte order mark is not automatically recognized when reading UTF-8 texts . It is up to the application software to remove the 0xFEFF character generated from it if necessary.

additional

If a byte order mark is used, problems can arise with programs that do not expect or do not know a BOM:

  • In Unix-like environments, the shebang mechanism is often used in script files , in which the character sequence  “#!” must also appear at the beginning of the file; If there is an unexpected BOM instead, there are problems.
  • Compilers such as B. gcc (before version 4.4) report excess characters at the beginning of the file when using a BOM
  • In PHP with standard settings, the BOM results in the output of characters to the browser, so that no HTTP headers can be changed without "output buffering" .

Table overview

Coding hexadecimal representation decimal representation Representation according to Windows-1252
UTF-8 EF BB BF 239 187 191 
UTF-16 ( BE ) FE FF 254 255 þÿ
UTF-16 ( LE ) FF FE 255 254 ÿþ
UTF-32 (BE) 00 00 FE FF 0 0 254 255 ␀␀þÿ
UTF-32 (LE) FF FE 00 00 255 254 0 0 ÿþ␀␀
UTF-7 2B 2F 76 and a character from:
[ 38 | 39 | 2B | 2F ]
43 47 118 and a character from:
[ 56 | 57 | 43 | 47 ]
+/v and a character from:
[ 8 | 9 | + | / ]
UTF-1 F7 64 4C 247 100 76 ÷dL
UTF-EBCDIC DD 73 66 73 221 115 102 115 Ýsfs
SCSU 0E FE FF(other possible
byte sequences are not recommended)
14 254 255 ␎þÿ
BOCU-1 FB EE 28 optionally followed by FF 251 238 40 optionally followed by 255 ûî( optionally followed by ÿ
GB 18030 84 31 95 33 132 49 149 51 „1•3

See also

Web links

Individual evidence

  1. http://bugs.sun.com/view_bug.do?bug_id=4508058
  2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415
  3. http://bugs.php.net/bug.php?id=22108#1067598726
  4. STD 63: UTF-8, a transformation of ISO 10646 Byte Order Mark (BOM)
  5. Only the most significant 6 bits of the fourth byte. The two lowest bits are determined by the following character
  6. UTS # 6: Signature Byte Sequence for SCSU
  7. UTN # 6: Signature Byte Sequence