Byte Order Mark
As Order Byte Mark ( BOM ; German byte order mark ) is a characteristic byte sequence at the start of a data stream referred to the Unicode - characters U + FEFF ( English zero width no-break space ) coded. This byte sequence serves as an identifier for defining the byte order and coding form in UCS / Unicode character strings , especially text files .
In UTF-16 and UTF-32
In the coding UTF-16 and UTF-32 has -order bytes are specified, since the individual characters in each case at least in 16 or 32 bits are encoded large values and thus more bytes need (UTF-16: 2 bytes, UTF-32 : 4 bytes). The (also: die) byte order mark indicates the order in which the bytes are to be evaluated. This marking is always particularly important when data is exchanged between different systems.
In UTF-16, the BOM exists
- in big-endian - Notation from the two-byte sequence FE FF
- in little-endian notation, it is the reverse of FF FE .
In UTF-32, the BOM exists
- in big-endian notation from the sequence 00 00 FE FF
- in little-endian notation from FF FE 00 00 .
Since the character U + FFFE is always defined as invalid, the order of the first bytes can be used to clearly determine the evaluation order for all subsequent bytes.
In UTF-8
The UTF-8 coding of the BOM consists of the byte sequence EF BB BF , which usually appear as ISO-8859-1 characters ï »¿in text editors and browsers that do not support UTF-8 . With UTF-8 the problem of the byte order does not arise, but a BOM at the string - or the beginning of the file is allowed to mark the use of UTF-8 as encoding.
A reliable differentiation between UTF-8 and the ISO-8859 character sets is not guaranteed by using a BOM, since all byte sequences are permitted in the 8-bit character sets, including the UTF-8 coding of the BOM; But if the alternative is specifically UTF-8 or ISO 8859-1, then the pragmatic assumption is quite common that the character string ï »¿is not meant, and consequently that UTF-8 encoding is present.
In Java , the byte order mark is not automatically recognized when reading UTF-8 texts . It is up to the application software to remove the 0xFEFF character generated from it if necessary.
additional
If a byte order mark is used, problems can arise with programs that do not expect or do not know a BOM:
- In Unix-like environments, the shebang mechanism is often used in script files , in which the character sequence “#!” must also appear at the beginning of the file; If there is an unexpected BOM instead, there are problems.
- Compilers such as B. gcc (before version 4.4) report excess characters at the beginning of the file when using a BOM
- In PHP with standard settings, the BOM results in the output of characters to the browser, so that no HTTP headers can be changed without "output buffering" .
Table overview
Coding | hexadecimal representation | decimal representation | Representation according to Windows-1252 |
---|---|---|---|
UTF-8 |
EF BB BF
|
239 187 191
|

|
UTF-16 ( BE ) |
FE FF
|
254 255
|
þÿ
|
UTF-16 ( LE ) |
FF FE
|
255 254
|
ÿþ
|
UTF-32 (BE) |
00 00 FE FF
|
0 0 254 255
|
␀␀þÿ
|
UTF-32 (LE) |
FF FE 00 00
|
255 254 0 0
|
ÿþ␀␀
|
UTF-7 |
2B 2F 76 and a character from:[ 38 | 39 | 2B | 2F ]
|
43 47 118 and a character from:[ 56 | 57 | 43 | 47 ]
|
+/v and a character from:[ 8 | 9 | + | / ]
|
UTF-1 |
F7 64 4C
|
247 100 76
|
÷dL
|
UTF-EBCDIC |
DD 73 66 73
|
221 115 102 115
|
Ýsfs
|
SCSU |
0E FE FF (other possible byte sequences are not recommended) |
14 254 255
|
␎þÿ
|
BOCU-1 |
FB EE 28 optionally followed by FF
|
251 238 40 optionally followed by 255
|
ûî( optionally followed by ÿ
|
GB 18030 |
84 31 95 33
|
132 49 149 51
|
„1•3
|
See also
Web links
- The Unicode Standard, chapter 2.6 Encoding Schemes (English, PDF, 1.10 MiB)
- The Unicode Standard, chapter 2.13 Special Characters and Noncharacters , section Byte Order Mark (BOM) (English, PDF, 1.10 MiB)
- The Unicode Standard, chapter 16.8 Specials , section Byte Order Mark (BOM): U + FEFF (English, PDF, 415 KiB)
- Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM (English)
Individual evidence
- ↑ http://bugs.sun.com/view_bug.do?bug_id=4508058
- ↑ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415
- ↑ http://bugs.php.net/bug.php?id=22108#1067598726
- ↑ STD 63: UTF-8, a transformation of ISO 10646 Byte Order Mark (BOM)
- ↑ Only the most significant 6 bits of the fourth byte. The two lowest bits are determined by the following character
- ↑ UTS # 6: Signature Byte Sequence for SCSU
- ↑ UTN # 6: Signature Byte Sequence