A logo showing the text blog.marcnuri.com
Español
Home»Java»Reading UTF-8 encoded documents in java

Recent Posts

  • Fabric8 Kubernetes Client 7.2 is now available!
  • Connecting to an MCP Server from JavaScript using AI SDK
  • Connecting to an MCP Server from JavaScript using LangChain.js
  • The Future of Developer Tools: Adapting to Machine-Based Developers
  • Connecting to a Model Context Protocol (MCP) Server from Java using LangChain4j

Categories

  • Artificial Intelligence
  • Front-end
  • Go
  • Industry and business
  • Java
  • JavaScript
  • Legacy
  • Operations
  • Personal
  • Pet projects
  • Tools

Archives

  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • August 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • February 2020
  • January 2020
  • December 2019
  • October 2019
  • September 2019
  • July 2019
  • March 2019
  • November 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • December 2017
  • July 2017
  • January 2017
  • December 2015
  • November 2015
  • December 2014
  • March 2014
  • February 2011
  • November 2008
  • June 2008
  • May 2008
  • April 2008
  • January 2008
  • November 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007

Reading UTF-8 encoded documents in java

2007-11-12 in Java / Legacy tagged Encoding / Java / Parsing / UTF by Marc Nuri | Last updated: 2021-02-18

Reading the contents of a file in Java is a very straightforward operation. The java tutorial fully explains how you can find your way to read different types of IO streams.

However, when you read an UTF-8 encoded file your fight will start. Most of the UTF-8 and UTF-16 encoded files contain a character at the beginning called BOM (byte-order mark). The BOM consists of a character (U+FEFF) at the beginning of the stream used to define the byte order and the encoding of the stream. UTF encoded files may or may not contain the BOM.

The problem with Java comes when you read a Unicode file with a byte-order mark. Java doesn't detect the BOM by itself so when you represent the data contained in the file, the BOM contaminates it. There is an open BUG in Sun's bug database with the explanation and some workaround users have made.

The solution is as simple as as reading the first bytes of an UTF encoded file and check if those a are a BOM or not. If they are, just remove them and continue reading the file. This is not so complicated because there are not so many types of BOM (5) and they are easy to identify.

BytesEncoding Form
00 00 FE FFUTF-32, big-endian
FF FE 00 00UTF-32, little-endian
FE FFUTF-16, big-endian
FF FEUTF-16, little-endian
EF BB BFUTF-8

When one of these characters is identified you just have to remove it so your program can interpret the file as it should be. A user has contributed with two helper classes that can do the job for you.

Related Links

  • Unicode FAQ
  • BOM at wikipedia
  • Helper Classes
Twitter iconFacebook iconLinkedIn iconPinterest iconEmail icon

Comments in "Reading UTF-8 encoded documents in java"

  • Avatar for Pierre
    Pierre
    2009-03-09 18:36
    Just what I needed to fix a problem I was having. Thanks!
  • Avatar for Rehansang
    Rehansang
    2013-01-04 04:21
    Yes is debatable, but I was considering this because:* it simpler to convert an existing ASCII application to UTF-8 that to UTF-32* sometime in the, far, future UTF-32 could be not enough, it did happened with UTF-16 before * with some luck you can convince ASCII applications to work with UTF-8 strings without breaking them. This would be clearly impossible with UTF-32* on average UTF-8 is consumes far less space than UTF-32.And about processing: I wouldn't even try to write my own UTF-8 string parsing routines there are good, free and open-source solutions for this.

Post navigation
How to check open TCP/IP ports in WindowsViewing and storing images from an IP Camera
© 2007 - 2025 Marc Nuri