Quantcast
Channel: Encoding – The Wiert Corner – irregular stream of stuff
Viewing all articles
Browse latest Browse all 151

.NET/C# – converting UTF8 to ASCII (yes, you *can* loose information with this) using System.Text.Encoding

$
0
0

Quite a while ago, we needed to exchange some text files with .NET and a DOS application (yes, they still exist!).

Since the .NET app already exchanged text files with other applications using UTF8, we had to reencode those into plain ASCII.
(yes, I am aware there are dozens of codepages we could encode to, we decided to stick with 7-bit ASCII, and warned the client about possible information loss).

A couple of months later, we neede to exchange information with an app doing Windows-1252, and then even later on to a web-app needing ISO 8859-1 (both are Western European encodings).
So I decided to refactor the UTF8 to ASCII conversion app into something more maintainable.

But first let me show you how you can dump all of the .NET supported encodings:

using System;
using System.Text;

namespace bo.ShowEncodings
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("System.Text.Encoding.GetEncodings();");
            EncodingInfo[] encodings = Encoding.GetEncodings();
            foreach (EncodingInfo encodingInfo in encodings)
            {
                Console.WriteLine("CodePage=0x{0:X4}({0}), Name={1}, DisplayName={2};",
                    encodingInfo.CodePage, encodingInfo.Name, encodingInfo.DisplayName);
            }
        }
    }
}

So, each encoding has both a CodePage (which sometimes – like 1252 – is easy to remember) and a name (which – like iso-8859-1 – is sometimes easier to remember).
And then there are some encodings that are used so often, so that they have special functions to create.

Let’s move on to the UTF8 to ASCII encoding program.

using System;
using System.IO;
using bo.Text;

namespace bo.utf82ascii
{
    class MainProgram
    {
        [STAThread]
        static void Main(string[] args)
        {
            Stream input;
            Stream output;

            if (args.Length > 0)
                input = new FileStream(args[0], FileMode.Open, FileAccess.Read);
            else
                input = Console.OpenStandardInput();

            if (args.Length > 1)
                output = new FileStream(args[1], FileMode.OpenOrCreate, FileAccess.Write);
            else
                output = Console.OpenStandardOutput();

            Utf8ToAsciiProcessor newUtf8ToAsciiProcessor = new Utf8ToAsciiProcessor();
            newUtf8ToAsciiProcessor.Process(input, output);
        }
    }
}

So, basically the Utf8ToAsciiProcessor does all the work. And that’s the thing that I refactored into a few smaller classes.
Lets go top down and dig deeper until we get to the core class.

First of all a few processors so you can see the basic ways on how to the correct encodings:

using System;
using System.Text;

namespace bo.Text
{
    public class Utf8ToAsciiProcessor : Utf8ToDefaultProcessor
    {
        public override Encoding GetOutputEncoding()
        {
            return Encoding.ASCII; // built-in construction of the Encoding
        }
    }

    public class Utf8ToWindows1252Processor : Utf8ToDefaultProcessor
    {
        public override Encoding GetOutputEncoding()
        {
            return Encoding.GetEncoding(1252);  // construction by CodePage number
        }
    }

    public class Utf8ToIso8859_1Processor : Utf8ToDefaultProcessor
    {

        public override Encoding GetOutputEncoding()
        {
            return Encoding.GetEncoding("iso-8859-1"); // construction by encoding name
        }

    }

}

All of these depend in an encoding from UTF8 to another encoding, so that is the Utf8ToDefaultProcessor intermediate class:

using System;
using System.Text;

namespace bo.Text
{
    public class Utf8ToDefaultProcessor: DefaultToDefaultProcessor
    {
        public override Encoding GetInputEncoding()
        {
            return Encoding.UTF8;
        }
    }
}

Then most of the logic is in the DefaultToDefaultProcessor base class, which, besides the virtual methods GetInputEncoding and SetInputEncoding has some more virtual methods that you can override to refine the behaviour even further.
The crux is in the creation of the StreamReader and StreamWriter: that is where the encodings are inserted.

using System;
using System.IO;
using System.Text;

namespace bo.Text
{
    public class DefaultToDefaultProcessor
    {
        public virtual void Process(Stream inputStream, Stream outputStream)
        {
            StreamReader inputStreamReader = CreateStreamReader(inputStream);
            StreamWriter outputStreamWriter = CreateStreamWriter(outputStream);
            Process(inputStreamReader, outputStreamWriter);
            outputStreamWriter.Flush();
        }

        public virtual void Process(StreamReader input, StreamWriter output)
        {
            string inputLine;
            inputLine = input.ReadLine();
            while (inputLine != null)
            {
                string outputLine = ProcessLine(inputLine);
                output.WriteLine(outputLine);
                inputLine = input.ReadLine();
            }
        }

        public virtual string ProcessLine(string inputLine)
        {
            return inputLine;
        }

        public virtual StreamReader CreateStreamReader(Stream input)
        {
            return new StreamReader(input, GetInputEncoding());
        }

        public virtual StreamWriter CreateStreamWriter(Stream output)
        {
            return new StreamWriter(output, GetOutputEncoding());
        }

        public virtual Encoding GetInputEncoding()
        {
            return Encoding.Default;
        }

        public virtual Encoding GetOutputEncoding()
        {
            return Encoding.Default;
        }
    }
}

With that, the core is set.
Another refactoring has been done in order to create these really simple looking console apps:

using System;
using bo.Text;

namespace bo.Utf82Windows1252
{
    class Program : BaseReencoder
    {
        public override DefaultToDefaultProcessor CreateProcessor()
        {
            return new Utf8ToWindows1252Processor();
        }

        [STAThread]
        static void Main(string[] args)
        {
            Program program = new Program();
            program.Run(args);
        }
    }
}

 

using System;
using bo.Text;

namespace bo.Utf82Iso8859_1
{
    class Program : BaseReencoder
    {
        public override DefaultToDefaultProcessor CreateProcessor()
        {
            return new Utf8ToIso8859_1Processor();
        }

        [STAThread]
        static void Main(string[] args)
        {
            Program program = new Program();
            program.Run(args);
        }
    }
}

Both depend on the abstract BaseReencoder class to which has the core logic of the bo.utf82ascii.Program has been refactored into:

using System;
using System.IO;

namespace bo.Text
{
    public abstract class BaseReencoder
    {
        public abstract DefaultToDefaultProcessor CreateProcessor();

        public virtual void Run(string[] args)
        {
            Stream input;
            Stream output;

            if (args.Length > 0)
                input = new FileStream(args[0], FileMode.Open, FileAccess.Read);
            else
                input = Console.OpenStandardInput();

            if (args.Length > 1)
                output = new FileStream(args[1], FileMode.OpenOrCreate, FileAccess.Write);
            else
                output = Console.OpenStandardOutput();

            DefaultToDefaultProcessor processor = CreateProcessor();
            processor.Process(input, output);
        }
    }
}

So with all these tiny examples, you have a sound fundament for any text reencoding you want to perform.


Posted in .NET, ASCII, C#, Development, Encoding, Software Development, Unicode, UTF-8, UTF8

Viewing all articles
Browse latest Browse all 151

Trending Articles