|.NET Converting HTML to Plain Text in C#||Thursday 22nd December 2016|
I've got a requirement to process emails automatically and insert them into a legacy system. The system being of questionable quality only supports plain-text emails, and since the rise of stupid big company logos being included on emails, people are increasingly sending their email in HTML.|
Some sort of conversion is in order. Seems like a simple requirement to start off with - just extract the text. Right?
The next solution would be to parse the HTML (or SGML) and have your system understand it. There are some libraries out there that do this for various languages. There are two popular approaches for .NET. The first more "out-of-the-box" solution is to harness mshtml.dll - Internet Explorer's Trident engine. This will render the document and you should be able to access that output (without actually spawning IE). The first hurdle is you're having to interop with COM - which is made easier with Primary Interop Assemblies but still a little messy to develop and more importantly, distribute.
The remaining popular solution is the Html Agility Pack. This is a handy library that allows you to use HTML documents in the same way as you would XmlDocument. It's more forgiving than the XML parser in .NET and will load up badly formed HTML and deal with non-XHTML syntax. This are other similar libraries for other languages/frameworks, and they all work great for targeting specific bits of data on a web-page etc. But for bulk conversion of text? That's tricky as whilst you can get your "//text()" nodes, you have to reassemble them into something useful.
My solution to the problem is slightly different and has its own draw-backs. And that is to use the HtmlUtilities.ConvertToText() method. Which, "parses the HTML-formatted data, no scripts are run and no secondary downloads occur". Ideal, apart from it's a part of the WinRT API - geared for Windows Store apps, and not desktop (or server) applications.
Fortunately, whilst the WinRT API is for Windows Store, Microsoft does support it with desktop applications, you just have to go through a few hoops to get there. This blog by Andrei Marukovich covers how. In summary:
You have to edit your *.csproj (or whatever proj) file in a text pad and add in the line <TargetPlatformVersion>8.0</TargetPlatformVersion> into the top/main ProjectGroup node. You will then be able to add a reference to the "Windows" assembly in a new tab on the Add References dialogue. You will also need to reference C:/Program Files (x86)/Reference Assemblies/Microsoft/Framework/.NETCore/v4.5/System.Runtime.WindowsRuntime.dll.
You can then simply call Windows.Data.Html.HtmlUtilities.ConvertToText() and receive a text view of your HTML. Formatting such as paragraphs are preserved. It handles tables too, but once they're collapsed adjacent columns are stuck together with no spaces which can be confusing.