Encoding is hard…so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.

ftfy (fixes text for you) fixes it, but:

How did the single quote become “â€™”?

Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.

The “â€™” are these (a full text file with all Unicode code points is at http://unicode.org/Public/UNIDATA/UnicodeData.txt):

But if you look into a different encoding, then it becomes much clearer, not with the various ISO-8859 based, but Windows based Code Pages:

â: 0xE2 in Windows-1250, Windows-1252, Windows-1254, Windows-1256 and Windows-1258.
€: 0x80 in Windows-1250, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.
™: 0x99 in Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.

What most likely happened, is that the ‘RIGHT SINGLE QUOTATION MARK’ got translated into UTF-8, interpreted as a Windows Code Page, and then outputted (in this case as an Android screen, so most likely with another intermediate Unicode step).

My conclusion is that someone with a Windows configured to one of the below regions didn’t have their full development infrastructure support all the roundtrips of Unicode to single-byte character set transliterations:

Windows-1250: Central Europe or Eastern Europe.
Windows-1252: region using the default Latin Alphabet.
Windows-1254: Turkish.
Windows-1256: Arabic.
Windows-1258: Vietnamese.

This shows how many regions can get into trouble not having proper end-to-end testing in place to catch these errors.

The UTF-8 Character Debug Tool can help big time here, but is a bit cryptic and only covers Windows-1252, hence my explanation above.

–jeroen

A single quote becoming “â€™”

Filed under: Development, Encoding, ISO-8859, ISO8859, Software Development, Unicode, UTF-8, UTF8, Windows-1252

Encoding is hard…so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...