StUF – receiving data from a provider where UTF-8 is in fact ISO-8859

Recently when receiving information from a StUF webservice created by a large Dutch provider of government IT systems, we had an issue with characters having their high bit set.

Although the web-service pretended to send their information as UTF-8, in fact they were encoding using a form of ISO_8859.

The most likely character set they used is ISO-8859-1 (since that is the default encoding for the HTTP protocol), but it might also be ISO-8859-15 which is an adaption of ISO-8859-1 trading some typographic characters for the euro-sign and some characters from French and some characters used for transliteration of Russian, Finnish and Estonian.
(note that the printable characters of both ISO-8859-1 and ISO-8859-15 can be displayed by the Windows-1252 code page)

Since it is not possible to reliably “guess” the right encoding (there are way to many possibilities, even IsTextUnicode that is used by Notepad fails, see below), the only way is to use a fixed reencoding that depends on the StUF data provider.Links to posts that describe problems with IsTextUnicode:

Raymond Chen in The Old New Thing:

Michael Kaplan in Sorting it all Out:

If the XML specified the right encoding, then it is possible to reliably detect and use it: http://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file, however, that is not the case here: the providing party lies.

So lets look at the actual data, how the StUF provider sends it over the line, and what is actually meant.

This is a snippet of the content we received (it is a SOAP response, but I cut down all the non-essential stuff):

000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 <?xml version="1
000010: 2E 30 22 20 65 6E 63 6F  64 69 6E 67 3D 22 55 54 .0" encoding="UT
000020: 46 2D 38 22 3F 3E 0D 0A  3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam
000030: 65 6E 3E 41 61 72 74 20  49 7A 61 E4 6B 3C 2F 76 en>Aart Iza.k</v
000040: 6F 6F 72 6E 61 6D 65 6E  3E 0D 0A                oornamen>..

What they did mean to pass is the name Aart Izaäk, which has ISO-8859-1 and ISO-8859-15 character code E4.
But in stead, the passed the three-byte UTF-8 character with byte sequence E4 6B 3C.
That is an invalid sequence, because the non-first bytes of a byte sequence must have the high bit set and the second highest bit clear (see the table of valid bytes in this UTF-8 wikipedia article).

What they should have done is pass the bytes as C3 A4, which is the valid UTF-8 encoding for ä:

000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 <?xml version="1
000010: 2E 30 22 20 65 6E 63 6F  64 69 6E 67 3D 22 55 54 .0" encoding="UT
000020: 46 2D 38 22 3F 3E 0D 0A  3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam
000030: 65 6E 3E 41 61 72 74 20  49 7A 61 C3 A4 6B 3C 2F en>Aart Iza..k</
000040: 76 6F 6F 72 6E 61 6D 65  6E 3E 0D 0A             voornamen>..

Actually, in .NET when you write a UTF-8 encoded stream, it will prepend it with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
The BOM used here is EF BB BF indicating the UTF-8 encoding.
You do not strictly need a BOM for XML-files, as the encoding of the file should be the same as the encoding specified in the XML header. But it does not do harm either:

000000: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E ...<?xml version
000010: 3D 22 31 2E 30 22 20 65  6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
000020: 22 55 54 46 2D 38 22 3F  3E 0D 0A 3C 76 6F 6F 72 "UTF-8"?>..<voor
000030: 6E 61 6D 65 6E 3E 41 61  72 74 20 49 7A 61 C3 A4 namen>Aart Iza..
000040: 6B 3C 2F 76 6F 6F 72 6E  61 6D 65 6E 3E 0D 0A 00 k</voornamen>...

An important thing to note is that the .NET StreamReader does not reject wrong UTF-8, in stead it is processed and wrong UTF-8 encodings are replaced by a U+FFFD code point.
This is the Unicode special character called “replacement character”. It marks a character that the Unicode decoder could not decode correctly.

A really great reference with Unicode code points is utf8-chartable.de having all the Unicode version 5.1.0 code points including information like UTF-8 byte sequence, HTML encodings, etc.
It shows the U+FFFD code point at this page.

I will go into more detail on how to work with these encoding issues in C#/.NET, but the below conversions will show what I mean.

First the conversion from UTF-8 to UTF-16:

Original: ISO-8859-1 in a UTF-8 disguise:
000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 <?xml version="1
000010: 2E 30 22 20 65 6E 63 6F  64 69 6E 67 3D 22 55 54 .0" encoding="UT
000020: 46 2D 38 22 3F 3E 0D 0A  3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam
000030: 65 6E 3E 41 61 72 74 20  49 7A 61 E4 6B 3C 2F 76 en>Aart Iza.k</v
Converted from UTF-8 to UTF-16
Note the FD FF byte sequence at offset 000078 that marks the U+FFFD code point:
000000: FF FE 3C 00 3F 00 78 00  6D 00 6C 00 20 00 76 00 ..<.?.x.m.l. .v.
000010: 65 00 72 00 73 00 69 00  6F 00 6E 00 3D 00 22 00 e.r.s.i.o.n.=.".
000020: 31 00 2E 00 30 00 22 00  20 00 65 00 6E 00 63 00 1...0.". .e.n.c.
000030: 6F 00 64 00 69 00 6E 00  67 00 3D 00 22 00 55 00 o.d.i.n.g.=.".U.
000040: 54 00 46 00 2D 00 38 00  22 00 3F 00 3E 00 0D 00 T.F.-.8.".?.>...
000050: 0A 00 3C 00 76 00 6F 00  6F 00 72 00 6E 00 61 00 ..<.v.o.o.r.n.a.
000060: 6D 00 65 00 6E 00 3E 00  41 00 61 00 72 00 74 00 m.e.n.>.A.a.r.t.
000070: 20 00 49 00 7A 00 61 00  FD FF 6B 00 3C 00 2F 00  .I.z.a...k.<./.
000080: 76 00 6F 00 6F 00 72 00  6E 00 61 00 6D 00 65 00 v.o.o.r.n.a.m.e.
000090: 6E 00 3E 00 0D 00 0A 00                          n.>.....

Then the conversion from UTF-8 to UTF-8:

Original: ISO-8859-1 in a UTF-8 disguise:
000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 <?xml version="1
000010: 2E 30 22 20 65 6E 63 6F  64 69 6E 67 3D 22 55 54 .0" encoding="UT
000020: 46 2D 38 22 3F 3E 0D 0A  3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam
000030: 65 6E 3E 41 61 72 74 20  49 7A 61 E4 6B 3C 2F 76 en>Aart Iza.k</v
000040: 6F 6F 72 6E 61 6D 65 6E  3E 0D 0A                oornamen>..
Converted from UTF-8 to UTF-8
Note the EF BF BD byte sequence at offset 00003E that marks the U+FFFD code point:
000000: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E .....Aart Iza..
000040: BD 6B 3C 2F 76 6F 6F 72  6E 61 6D 65 6E 3E 0D 0A .k..

Above you can see that the UTF-16 encoded XML also is prepended with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
In this case it is FF FE indicating a Little-Endian byte ordering used on Intel x86 instruction set architecture.

In a future blog post, I’ll show how to repair this wrong encoding in C#/.NET

–jeroen

Posted in Development, Encoding, ISO-8859, ISO8859, Unicode, UTF-8, UTF8, XML, XML/XSD

StUF – receiving data from a provider where UTF-8 is in fact ISO-8859

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Moondru Mudichu 21-07-2016 – Polimer tv Serial

Conman who lived a life of luxury is jailed

The 10 Tennessee Cities With The Largest Black Population For 2021

Download – The Last Ship 1ª Temporada RMVB Dublado – MEGA

Nalgonda District Police Office Mobile Numbers List in Telangana State

VIDEO2BRAIN - GETTING STARTED WITH ILLUSTRATOR CS6

QUIZ: Are You Smart Enough To Be A US Marine?

Top 10 FBB OnlyFans & Muscle Girl OnlyFans in 2023

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

Black Angus Grilled Artichokes

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

ОЧІ В ОЧІ – Синоніми – Single [iTunes Plus M4A]

ZARIA CUMMINGS

99 God Status for Whatsapp, Facebook

Cheltenham man avoids prison after glassing girlfriend

Storage DRS Fault won't clear

Pass through scenario in SAP PI with no mapping for File to IDoc and Idoc to...

NOTES ZA GENERAL CHEMISTRY ZA NGAIZA

Group Policy Update Monitor False alerts