BOM of Unicode: Byte Order Marks
Always prefix a Unicode plain text file with a byte order mark, which informs an application receiving the file that the file is byte-ordered. Available byte order marks are listed in the following table. Because Unicode plain text is a sequence of 16-bit code values, it is sensitive to the byte ordering used when the text is written.
Note A byte order mark is not a control character that selects the byte order of the text.
Byte order mark Description
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
Note Microsoft uses UTF-16, little endian byte order.
The byte order mark (BOM) is a Unicode character, U+FEFF byte order mark (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:
What byte order, or endianness, the text stream is stored in;
The fact that the text stream is Unicode, to a high level of confidence;
Which of several Unicode encodings that text stream is encoded as.
BOM use is optional, and, if used, should appear at the start of the text stream.
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.
The preferred place to specify byte order is in a file header, but text files do not have headers. Therefore, Unicode has defined a character (U+FEFF) and a noncharacter (U+FFFE) as byte order marks. They are mirror byte images of each other.
Since the sequence U+FEFF is exceedingly rare at the beginning of a regular non-Unicode text file, it can serve as an implicit marker or signature to identify the file as a Unicode file. Applications that read both Unicode and non-Unicode text files should use the presence of this sequence as an indicator that the file is most likely a Unicode file. Compare this technique to using the MS-DOS EOF marker to terminate text files.