Recently when receiving information from a StUF webservice created by a large Dutch provider of government IT systems, we had an issue with characters having their high bit set.
Although the web-service pretended to send their information as UTF-8, in fact they were encoding using a form of ISO_8859.
The most likely character set they used is ISO-8859-1 (since that is the default encoding for the HTTP protocol), but it might also be ISO-8859-15 which is an adaption of ISO-8859-1 trading some typographic characters for the euro-sign and some characters from French and some characters used for transliteration of Russian, Finnish and Estonian.
(note that the printable characters of both ISO-8859-1 and ISO-8859-15 can be displayed by the Windows-1252 code page)
Since it is not possible to reliably “guess” the right encoding (there are way to many possibilities, even IsTextUnicode that is used by Notepad fails, see below), the only way is to use a fixed reencoding that depends on the StUF data provider.Links to posts that describe problems with IsTextUnicode:
Raymond Chen in The Old New Thing:
Michael Kaplan in Sorting it all Out:
- Why I don’t like the IsTextUnicode API
- Behind ‘How to break Windows Notepad’
- More on that which breaks Windows Notepad
If the XML specified the right encoding, then it is possible to reliably detect and use it: http://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file, however, that is not the case here: the providing party lies.
So lets look at the actual data, how the StUF provider sends it over the line, and what is actually meant.
This is a snippet of the content we received (it is a SOAP response, but I cut down all the non-essential stuff):
000000: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 000010: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 000020: 46 2D 38 22 3F 3E 0D 0A 3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam 000030: 65 6E 3E 41 61 72 74 20 49 7A 61 E4 6B 3C 2F 76 en>Aart Iza.k</v 000040: 6F 6F 72 6E 61 6D 65 6E 3E 0D 0A oornamen>..
What they did mean to pass is the name Aart Izaäk, which has ISO-8859-1 and ISO-8859-15 character code E4.
But in stead, the passed the three-byte UTF-8 character with byte sequence E4 6B 3C.
That is an invalid sequence, because the non-first bytes of a byte sequence must have the high bit set and the second highest bit clear (see the table of valid bytes in this UTF-8 wikipedia article).
What they should have done is pass the bytes as C3 A4, which is the valid UTF-8 encoding for ä:
000000: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 000010: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 000020: 46 2D 38 22 3F 3E 0D 0A 3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam 000030: 65 6E 3E 41 61 72 74 20 49 7A 61 C3 A4 6B 3C 2F en>Aart Iza..k</ 000040: 76 6F 6F 72 6E 61 6D 65 6E 3E 0D 0A voornamen>..
Actually, in .NET when you write a UTF-8 encoded stream, it will prepend it with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
The BOM used here is EF BB BF indicating the UTF-8 encoding.
You do not strictly need a BOM for XML-files, as the encoding of the file should be the same as the encoding specified in the XML header. But it does not do harm either:
000000: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E ...<?xml version 000010: 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding= 000020: 22 55 54 46 2D 38 22 3F 3E 0D 0A 3C 76 6F 6F 72 "UTF-8"?>..<voor 000030: 6E 61 6D 65 6E 3E 41 61 72 74 20 49 7A 61 C3 A4 namen>Aart Iza.. 000040: 6B 3C 2F 76 6F 6F 72 6E 61 6D 65 6E 3E 0D 0A 00 k</voornamen>...
An important thing to note is that the .NET StreamReader does not reject wrong UTF-8, in stead it is processed and wrong UTF-8 encodings are replaced by a U+FFFD code point.
This is the Unicode special character called “replacement character”. It marks a character that the Unicode decoder could not decode correctly.
A really great reference with Unicode code points is utf8-chartable.de having all the Unicode version 5.1.0 code points including information like UTF-8 byte sequence, HTML encodings, etc.
It shows the U+FFFD code point at this page.
I will go into more detail on how to work with these encoding issues in C#/.NET, but the below conversions will show what I mean.
First the conversion from UTF-8 to UTF-16:
Original: ISO-8859-1 in a UTF-8 disguise: 000000: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 000010: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 000020: 46 2D 38 22 3F 3E 0D 0A 3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam 000030: 65 6E 3E 41 61 72 74 20 49 7A 61 E4 6B 3C 2F 76 en>Aart Iza.k</v Converted from UTF-8 to UTF-16 Note the FD FF byte sequence at offset 000078 that marks the U+FFFD code point: 000000: FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 ..<.?.x.m.l. .v. 000010: 65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00 e.r.s.i.o.n.=.". 000020: 31 00 2E 00 30 00 22 00 20 00 65 00 6E 00 63 00 1...0.". .e.n.c. 000030: 6F 00 64 00 69 00 6E 00 67 00 3D 00 22 00 55 00 o.d.i.n.g.=.".U. 000040: 54 00 46 00 2D 00 38 00 22 00 3F 00 3E 00 0D 00 T.F.-.8.".?.>... 000050: 0A 00 3C 00 76 00 6F 00 6F 00 72 00 6E 00 61 00 ..<.v.o.o.r.n.a. 000060: 6D 00 65 00 6E 00 3E 00 41 00 61 00 72 00 74 00 m.e.n.>.A.a.r.t. 000070: 20 00 49 00 7A 00 61 00 FD FF 6B 00 3C 00 2F 00 .I.z.a...k.<./. 000080: 76 00 6F 00 6F 00 72 00 6E 00 61 00 6D 00 65 00 v.o.o.r.n.a.m.e. 000090: 6E 00 3E 00 0D 00 0A 00 n.>.....
Then the conversion from UTF-8 to UTF-8:
Original: ISO-8859-1 in a UTF-8 disguise: 000000: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 000010: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 000020: 46 2D 38 22 3F 3E 0D 0A 3C 76 6F 6F 72 6E 61 6D F-8"?>..<voornam 000030: 65 6E 3E 41 61 72 74 20 49 7A 61 E4 6B 3C 2F 76 en>Aart Iza.k</v 000040: 6F 6F 72 6E 61 6D 65 6E 3E 0D 0A oornamen>.. Converted from UTF-8 to UTF-8 Note the EF BF BD byte sequence at offset 00003E that marks the U+FFFD code point: 000000: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E .....Aart Iza.. 000040: BD 6B 3C 2F 76 6F 6F 72 6E 61 6D 65 6E 3E 0D 0A .k..
Above you can see that the UTF-16 encoded XML also is prepended with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
In this case it is FF FE indicating a Little-Endian byte ordering used on Intel x86 instruction set architecture.
In a future blog post, I’ll show how to repair this wrong encoding in C#/.NET
–jeroen
Posted in Development, Encoding, ISO-8859, ISO8859, Unicode, UTF-8, UTF8, XML, XML/XSD
