Reading UTF-8 encoded documents in java
Reading the contents of a file in Java is a very straightforward operation. The java tutorial fully explains how you can find your way to read different types of IO streams.
However, when you read an UTF-8
encoded file your fight will start. Most of the UTF-8 and UTF-16 encoded files contain a character at the beginning called BOM (byte-order mark). The BOM consists of a character (U+FEFF
) at the beginning of the stream used to define the byte order and the encoding of the stream. UTF encoded files may or may not contain the BOM.
The problem with Java comes when you read a Unicode file with a byte-order mark. Java doesn't detect the BOM by itself so when you represent the data contained in the file, the BOM contaminates it. There is an open BUG in Sun's bug database with the explanation and some workaround users have made.
The solution is as simple as as reading the first bytes of an UTF encoded file and check if those a are a BOM or not. If they are, just remove them and continue reading the file. This is not so complicated because there are not so many types of BOM (5) and they are easy to identify.
Bytes | Encoding Form |
---|---|
00 00 FE FF | UTF-32, big-endian |
FF FE 00 00 | UTF-32, little-endian |
FE FF | UTF-16, big-endian |
FF FE | UTF-16, little-endian |
EF BB BF | UTF-8 |
When one of these characters is identified you just have to remove it so your program can interpret the file as it should be. A user has contributed with two helper classes that can do the job for you.
Comments in "Reading UTF-8 encoded documents in java"