Converting RTF to HTML using C#

Creating a C# class to handle the clean conversion of RTF clinical reports to HTML. Handling the whitelisting of valid HTML tags & removing unwanted empty paragraphs

Infographic depicting the clean conversion of RTF to HTML

Table of Contents

Introduction

I’m developing a downtime solution which will allow healthcare facilities to continue giving patient care when the main hospital ERP/EHR is unavailable, either due to planned or unplanned downtime.

To ensure the application runs as intended, I need to normalise all medical data & reports to a standard HTML format. Many clinical systems still output care plans & power chart forms as text formatted as RTF (Rich Text Format). In this article, we’ll write a C# class that handles the clean conversion of RTF to HTML.

Getting started

Before starting, you’ll need to install the following NUGET packages, as we’ll leverage their capabilities in our solution.

1dotnet add package HtmlAgilityPack --version 1.12.4
2dotnet add package HtmlSanitizer --version 9.1.949-beta
3dotnet add package RtfPipe --version 2.0.7677.4303

Creating the class

Within your project, add a new class file called Bradley.Software.RTF.Processing.cs and create the stub of the class as shown below:

 1using HtmlAgilityPack;
 2using HtmlSanitizer = Ganss.Xss.HtmlSanitizer;
 3using RtfPipe;
 4using System.Text;
 5
 6namespace Bradley.Software.RTF.Processing;
 7
 8public class Bradley_RFT_to_HTML_Processor
 9{
10    private readonly HtmlDocument _doc;
11
12    public Bradley_RFT_to_HTML_Processor(string html)
13    {
14        if (string.IsNullOrWhiteSpace(html))
15            throw new ArgumentException(
16                "HTML cannot be null or empty", nameof(html));
17        _doc = new HtmlDocument();
18        _doc.LoadHtml(html);
19    }
20
21    public override string ToString() => _doc.DocumentNode.OuterHtml;
22}

Next, we need to add a function that ingests the raw RTF content and converts it to HTML. We need to register the legacy encodings for ANSI code pages, as many RTF files use this encoding to represent extended character sets. Registering this legacy encoding provider allows RTF Pipe to parse the RTF content correctly.

We also discard any text to the right of the last } character. RTF uses the curly braces to enclose formatting commands. The start of an RTF document looks something like {\rtf1\fbidis\ansi ... so the last } should be the closing bracket for the whole document. Any text after the last } can’t therefore be part of the document and can be rejected for this purpose.

 1    public static Bradley_RFT_to_HTML_Processor RTF_Raw_String(string rtfData)
 2    {
 3        if (string.IsNullOrWhiteSpace(rtfData))
 4            throw new ArgumentException(
 5                "RTF data cannot be null or empty", nameof(rtfData));
 6
 7        // RTF content commonly uses ANSI code pages (here cp1252);
 8        // register legacy encodings and normalise incoming text
 9        // through cp1252 so extended characters parse correctly.
10        Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
11        rtfData = Encoding.GetEncoding(1252).GetString(
12            Encoding.Default.GetBytes(rtfData));
13
14        // Discard everything to the right of the last }
15        var lastClosingBraceIndex = rtfData.LastIndexOf('}');
16        if (lastClosingBraceIndex >= 0)
17            rtfData = rtfData.Substring(0, lastClosingBraceIndex + 1);  
18
19        var html = Rtf.ToHtml(rtfData);
20#if DEBUG
21        Console.WriteLine("Converted HTML:\n" + html);
22#endif
23        return new Bradley_RFT_to_HTML_Processor(html);
24    }

Using the class

To use the class from your main program, you need to include the namespace with a using statement before calling the function as shown below. As you can see, we chain the functions together so that we convert the raw RTF string first, then sanitise, add a CSS class to all tables, and finally remove any empty paragraphs.

1using Bradley_RFT_to_HTML_Processor;
2
3...
4
5var html = Bradley_RFT_to_HTML_Processor
6 .RTF_Raw_String(rtf)
7 .Sanitize()
8 .AddTableClass("table")
9 .RemoveEmptyParagraphs().ToString();

Next steps

The next step is to encapsulate this functionality into a API that can be run as a server-less function on AWS Lambda.