Reading UTF-8 encoded documents in java

2007-11-12 in Java / Legacy tagged Encoding / Java / Parsing / UTF by Marc Nuri | Last updated: 2021-02-18

Reading the contents of a file in Java is a very straightforward operation. The java tutorial fully explains how you can find your way to read different types of IO streams.

However, when you read an UTF-8 encoded file your fight will start. Most of the UTF-8 and UTF-16 encoded files contain a character at the beginning called BOM (byte-order mark). The BOM consists of a character (U+FEFF) at the beginning of the stream used to define the byte order and the encoding of the stream. UTF encoded files may or may not contain the BOM.

The problem with Java comes when you read a Unicode file with a byte-order mark. Java doesn't detect the BOM by itself so when you represent the data contained in the file, the BOM contaminates it. There is an open BUG in Sun's bug database with the explanation and some workaround users have made.

The solution is as simple as as reading the first bytes of an UTF encoded file and check if those a are a BOM or not. If they are, just remove them and continue reading the file. This is not so complicated because there are not so many types of BOM (5) and they are easy to identify.

Bytes	Encoding Form
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian
EF BB BF	UTF-8

When one of these characters is identified you just have to remove it so your program can interpret the file as it should be. A user has contributed with two helper classes that can do the job for you.

Related Links

Comments in "Reading UTF-8 encoded documents in java"

Pierre

2009-03-09 18:36

Just what I needed to fix a problem I was having. Thanks!

Rehansang

2013-01-04 04:21

Yes is debatable, but I was considering this because:* it simpler to convert an existing ASCII application to UTF-8 that to UTF-32* sometime in the, far, future UTF-32 could be not enough, it did happened with UTF-16 before * with some luck you can convince ASCII applications to work with UTF-8 strings without breaking them. This would be clearly impossible with UTF-32* on average UTF-8 is consumes far less space than UTF-32.And about processing: I wouldn't even try to write my own UTF-8 string parsing routines there are good, free and open-source solutions for this.

Pierre
2009-03-09 18:36
Just what I needed to fix a problem I was having. Thanks!
Rehansang
2013-01-04 04:21
Yes is debatable, but I was considering this because:* it simpler to convert an existing ASCII application to UTF-8 that to UTF-32* sometime in the, far, future UTF-32 could be not enough, it did happened with UTF-16 before * with some luck you can convince ASCII applications to work with UTF-8 strings without breaking them. This would be clearly impossible with UTF-32* on average UTF-8 is consumes far less space than UTF-32.And about processing: I wouldn't even try to write my own UTF-8 string parsing routines there are good, free and open-source solutions for this.

Recent Posts

Categories

Archives

Reading UTF-8 encoded documents in java

Related Links

Comments in "Reading UTF-8 encoded documents in java"