lunes, 12 noviembre 2007
Reading UTF-8 encoded documents in java
« Viewing and storing images from an IP Camera | Main | BeansBinding Performance (Issue 37) »Reading the contents of a file in Java is a very straight forward operation. The java tutorial explains completely how you can find your way to read different types of streams.
However when you read an UTF-8 encoded file your fight will start. Most of the UTF-8 and UTF-16 encoded files contain a character at the beginning called BOM (byte-order mark). The BOM consists of a character (U+FEFF) at the beginning of the stream used to define the byte order and the encoding of the stream. UTF encoded files may or may not containg the BOM.
The problem with Java comes when you read a unicode file with a byte-order mark. Java doesn't detect the BOM by itself so when you represent the data contained in the file, the BOM contaminates it. There is an open BUG in sun's bug database with the explanation and some workaround users have made.
The solution is as simple as as reading the first bytes of an UTF encoded file and check if those a are a BOM or not. If they are, just remove them and continue reading the file. This is not so complicated because there are not so many types of BOM (5) and they are easy to identify.
| Bytes | Encoding Form |
|---|---|
| 00 00 FE FF | UTF-32, big-endian |
| FF FE 00 00 | UTF-32, little-endian |
| FE FF | UTF-16, big-endian |
| FF FE | UTF-16, little-endian |
| EF BB BF | UTF-8 |
When one of this characters is identified you just have to remove it so your program can interpret the file as it should be. A user has contributed with two helper classes that can do the job for you.
Related links:
Technorati Tags: utf UTF-8 UTF-16 UTF-32 UNICODE JAVA j2se filereader inputstream bug BOM byte-order mark endian
Posted by at 9:55 AM in Java
[Trackback URL for this entry]