Reading UTF-8 encoded documents in java 2

Reading the contents of a file in Java is a very straightforward operation. The java tutorial explains completely how you can find your way to read different types of streams.

However when you read an UTF-8 encoded file your fight will start. Most of the UTF-8 and UTF-16 encoded files contain a character at the beginning called BOM (byte-order mark). The BOM consists of a character  (U+FEFF) at the beginning of the stream used to define the byte order and the encoding of the stream. UTF encoded files may or may not contain the BOM.

The problem with Java comes when you read a unicode file with a byte-order mark. Java doesn’t detect the BOM by itself so when you represent the data contained in the file, the BOM contaminates it. There is an open BUG in sun’s bug database with the explanation and some workaround users have made.

The solution is as simple as as reading the first bytes of an UTF encoded file and check if those a are a BOM or not. If they are, just remove them and continue reading the file. This is not so complicated because there are not so many types of BOM (5) and they are easy to identify.

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian

When one of this characters is identified you just have to remove it so your program can interpret the file as it should be. A user has contributed with two helper classes that can do the job for you.

Related links:

Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Reading UTF-8 encoded documents in java

  • Rehansang

    Yes is debatable, but I was csneidoring this because:* it simpler to convert an existing ASCII application to UTF-8 that to UTF-32* sometime in the, far, future UTF-32 could be not enough, it did happened with UTF-16 before * with some luck you can convince ASCII applications to work with UTF-8 strings without breaking them. This would be clearly impossible with UTF-32* on average UTF-8 is consumes far less space than UTF-32.And about processing: I wouldn’t even try to write my own UTF-8 string parsing routines there are good, free and open-source solutions for this.