A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.
ftfy (fixes text for you) fixes it, but:
How did the single quote become “’”?
Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.
The “’” are these (a full text file with all Unicode code points is at http://unicode.org/Public/UNIDATA/UnicodeData.txt):
- â: Unicode Character ‘LATIN SMALL LETTER A WITH CIRCUMFLEX’ (U+00E2).
- €: Unicode Character ‘EURO SIGN’ (U+20AC).
- ™: Unicode Character ‘TRADE MARK SIGN’ (U+2122).
But if you look into a different encoding, then it becomes much clearer, not with the various ISO-8859 based, but Windows based Code Pages:
- â: 0xE2 in Windows-1250, Windows-1252, Windows-1254, Windows-1256 and Windows-1258.
- €: 0x80 in Windows-1250, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.
- ™: 0x99 in Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.
What most likely happened, is that the ‘RIGHT SINGLE QUOTATION MARK’ got translated into UTF-8, interpreted as a Windows Code Page, and then outputted (in this case as an Android screen, so most likely with another intermediate Unicode step).
My conclusion is that someone with a Windows configured to one of the below regions didn’t have their full development infrastructure support all the roundtrips of Unicode to single-byte character set transliterations:
- Windows-1250: Central Europe or Eastern Europe.
- Windows-1252: region using the default Latin Alphabet.
- Windows-1254: Turkish.
- Windows-1256: Arabic.
- Windows-1258: Vietnamese.
This shows how many regions can get into trouble not having proper end-to-end testing in place to catch these errors.
The UTF-8 Character Debug Tool can help big time here, but is a bit cryptic and only covers Windows-1252, hence my explanation above.
–jeroen
Filed under: Development, Encoding, ISO-8859, ISO8859, Software Development, Unicode, UTF-8, UTF8, Windows-1252
