Quantcast
Channel: Encoding – The Wiert Corner – irregular stream of stuff
Viewing all articles
Browse latest Browse all 160

Encoding is hard…so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

$
0
0

A while ago (in fact more than a year), I posted Encoding is hard…  go G+ with the below picture.

ftfy (fixes text for you) fixes it, but:

How did the single quote become “’”?

Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.

The “’” are these (a full text file with all Unicode code points is at http://unicode.org/Public/UNIDATA/UnicodeData.txt):

But if you look into a different encoding, then it becomes much clearer, not with the various ISO-8859 based, but Windows based Code Pages:

What most likely happened, is that the ‘RIGHT SINGLE QUOTATION MARK’ got translated into UTF-8, interpreted as a Windows Code Page, and then outputted (in this case as an Android screen, so most likely with another intermediate Unicode step).

My conclusion is that someone with a Windows configured to one of the below regions didn’t have their full development infrastructure support all the roundtrips of Unicode to single-byte character set transliterations:

This shows how many regions can get into trouble not having proper end-to-end testing in place to catch these errors.

The UTF-8 Character Debug Tool can help big time here, but is a bit cryptic and only covers Windows-1252, hence my explanation above.

–jeroen

A single quote becoming

A single quote becoming “’”

 


Filed under: Development, Encoding, ISO-8859, ISO8859, Software Development, Unicode, UTF-8, UTF8, Windows-1252

Viewing all articles
Browse latest Browse all 160

Trending Articles