Byte Order Mark

As Order Byte Mark ( BOM ; German byte order mark ) is a characteristic byte sequence at the start of a data stream referred to the Unicode - characters U + FEFF ( English zero width no-break space ) coded. This byte sequence serves as an identifier for defining the byte order and coding form in UCS / Unicode character strings , especially text files .

In UTF-16 and UTF-32

In the coding UTF-16 and UTF-32 has -order bytes are specified, since the individual characters in each case at least in 16 or 32 bits are encoded large values and thus more bytes need (UTF-16: 2 bytes, UTF-32 : 4 bytes). The (also: die) byte order mark indicates the order in which the bytes are to be evaluated. This marking is always particularly important when data is exchanged between different systems.

In UTF-16, the BOM exists

in big-endian - Notation from the two-byte sequence FE FF
in little-endian notation, it is the reverse of FF FE .

In UTF-32, the BOM exists

in big-endian notation from the sequence 00 00 FE FF
in little-endian notation from FF FE 00 00 .

Since the character U + FFFE is always defined as invalid, the order of the first bytes can be used to clearly determine the evaluation order for all subsequent bytes.

In UTF-8

The UTF-8 coding of the BOM consists of the byte sequence EF BB BF , which usually appear as ISO-8859-1 characters ï »¿in text editors and browsers that do not support UTF-8 . With UTF-8 the problem of the byte order does not arise, but a BOM at the string - or the beginning of the file is allowed to mark the use of UTF-8 as encoding.

A reliable differentiation between UTF-8 and the ISO-8859 character sets is not guaranteed by using a BOM, since all byte sequences are permitted in the 8-bit character sets, including the UTF-8 coding of the BOM; But if the alternative is specifically UTF-8 or ISO 8859-1, then the pragmatic assumption is quite common that the character string ï »¿is not meant, and consequently that UTF-8 encoding is present.

In Java , the byte order mark is not automatically recognized when reading UTF-8 texts . It is up to the application software to remove the 0xFEFF character generated from it if necessary.

additional

If a byte order mark is used, problems can arise with programs that do not expect or do not know a BOM:

In Unix-like environments, the shebang mechanism is often used in script files , in which the character sequence “#!” must also appear at the beginning of the file; If there is an unexpected BOM instead, there are problems.

Compilers such as B. gcc (before version 4.4) report excess characters at the beginning of the file when using a BOM

In PHP with standard settings, the BOM results in the output of characters to the browser, so that no HTTP headers can be changed without "output buffering" .

Table overview

Coding	hexadecimal representation	decimal representation	Representation according to Windows-1252
UTF-8	`EF BB BF`	`239 187 191`	`ï»¿`
UTF-16 ( BE )	`FE FF`	`254 255`	`þÿ`
UTF-16 ( LE )	`FF FE`	`255 254`	`ÿþ`
UTF-32 (BE)	`00 00 FE FF`	`0 0 254 255`	`␀␀þÿ`
UTF-32 (LE)	`FF FE 00 00`	`255 254 0 0`	`ÿþ␀␀`
UTF-7	`2B 2F 76` and a character from: `[ 38 \| 39 \| 2B \| 2F ]`	`43 47 118` and a character from: `[ 56 \| 57 \| 43 \| 47 ]`	`+/v` and a character from: `[ 8 \| 9 \| + \| / ]`
UTF-1	`F7 64 4C`	`247 100 76`	`÷dL`
UTF-EBCDIC	`DD 73 66 73`	`221 115 102 115`	`Ýsfs`
SCSU	`0E FE FF`(other possible byte sequences are not recommended)	`14 254 255`	`␎þÿ`
BOCU-1	`FB EE 28` optionally followed by `FF`	`251 238 40` optionally followed by `255`	`ûî(` optionally followed by `ÿ`
GB 18030	`84 31 95 33`	`132 49 149 51`	`„1•3`

Web links

The Unicode Standard, chapter 2.6 Encoding Schemes (English, PDF, 1.10 MiB)
The Unicode Standard, chapter 2.13 Special Characters and Noncharacters , section Byte Order Mark (BOM) (English, PDF, 1.10 MiB)
The Unicode Standard, chapter 16.8 Specials , section Byte Order Mark (BOM): U + FEFF (English, PDF, 415 KiB)
Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM (English)

Individual evidence

↑ http://bugs.sun.com/view_bug.do?bug_id=4508058
↑ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415
↑ http://bugs.php.net/bug.php?id=22108#1067598726
↑ STD 63: UTF-8, a transformation of ISO 10646 Byte Order Mark (BOM)
↑ Only the most significant 6 bits of the fourth byte. The two lowest bits are determined by the following character
↑ UTS # 6: Signature Byte Sequence for SCSU
↑ UTN # 6: Signature Byte Sequence

[1] ttp://bugs.sun.com/view_bug.do?bug_id=4508058

[2] ttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415

[3] ttp://bugs.php.net/bug.php?id=22108#1067598726

[4] STD 63: UTF-8, a transformation of ISO 10646 Byte Order Mark (BOM)

[5] Only the most significant 6 bits of the fourth byte. The two lowest bits are determined by the following character

[6] UTS # 6: Signature Byte Sequence for SCSU

[7] UTN # 6: Signature Byte Sequence