Optimizing & Reducing Clinical Database Storage Requirements

Leveraging Golang to minimize storage costs by identifying and removing legacy HTML inefficiencies -- #TagSoup #Slop

Using Golang to remove HTML tag soup / slop

Table of Contents

Introduction

Reducing infrastructure costs should be a priority for everyone involved, including developers, managers, and the executive team. All parties should take storage and retention costs into account.

I recently reviewed and enhanced the storage requirements for an extensive clinical database containing over sixty million clinical letters in a Microsoft SQL Server database. By reducing the size of the database, we improved both backup and restore times, as there was less data to process. Before the reduction, backups took over twenty-four hours to complete, and we never attempted test restores due to the estimated five-day restoration time.

Each clinic letter was stored as an HTML file in the database using a VARBINARY(MAX) data type. On further inspection, I found that the HTML was of poor quality. Each HTML letter had the hospital logo embedded as a base64 encoded JPEG.

Storing the logo this way was very wasteful and provided the most significant opportunity for improvement. The system had saved the logo sixty million times instead of being stored once on a CDN (Content Delivery Network) and linked to from the letter using a standard IMG tag.

Other issues found with the stored HTML content included:

In this article, we will demonstrate, using the GoLang programming language, how we reduced the average letter size from 215,215 to 2096 bytes, saving 12.78 Terabytes on the whole database.

Getting started

The code samples below use Go version 1.24, if you would like help to adapt the code for your use case then please get in touch. Alternatively, you can grab the code from here. The code makes use of several packages; goquery to parse and interact with the HTML, Minify to remove unneeded whitespace & comments, and brotli to perform the final compression.

 1func main() {
 2	app := Application{}
 3	app.loadHTMLFile()
 4	app.parseHTML()
 5	app.replaceEmbeddedLogo()
 6	app.removeStyleAttributes()
 7	app.removeDuplicateMetaTags()
 8	app.removeEmptyParagraphTags()
 9	app.minifyHTML()
10	app.addDoctype()
11	app.writeHTMLOutput()
12	app.compressHTML()
13}

Let’s look at some of the above functions in more detail.

Replacing the embedded logo with a static reference

Every HTML clinic letter had the hospital logo embedded into the file using base64 encoding. As you can see from the code snippet shown below, the IMG tag also had unnecessary in-lined style tags.

However, the biggest saving was to replace this image reference with a static reference out to a ‘JPG’ file on a CDN.

1<img
2  width="718" height="89"
3  id="template_header"
4  style="margin: 0px; padding: 0px; border: 0px currentColor; border-image: none;"
5  src="data:image/jpg;base64,VBORw0KGgoAAAANSUhEUgAAFKAAAAKUCAIAAAAS9hOhAAAA ... ">

To achieve replacing the image, we first find the first image reference in the HTML and replace the HTML code with a simple IMG tag that loads the JPG file on demand. We also give the image the a shorter id of logo so that it can be further styled via a central CSS file.

1func (app *Application) replaceEmbeddedLogo() {
2	app.htmlDocument.Find (
3        "img[src^='data:image/jpg;']").First().ReplaceWithHtml (
4        `<img id="logo" src="https://nhs.io/logo.jpg" width="718" height="89" >`)
5}

Removing in-line styles

Nearly every paragraph and div within the letter had in-line styles, which when extrapolated over sixty million documents amounted to a lot of wasted space. These in-line styles should be moved to a central, external CSS file.

1<div id="template_department"
2     style="color: rgb(0, 114, 188); font-weight: bold;">
3        CardioThoracic Services
4</div>

As such, it was safe to remove all the in-line styles like:

1func (app *Application) removeStyleAttributes() {
2    app.htmlDocument.Find("*").Each(
3        func(i int, s *goquery.Selection) {
4        s.RemoveAttr("style")
5    })
6}

Removing duplicate meta tags

The save feature of the system must have been broken, as it would add an additional http-equiv tag on each save. Even worse, it was adding them in the wrong place. It was adding them to the body section instead of the head section (which already had the reference).

 1<html>
 2    <head>
 3        <meta http-equiv="X-UA-Compatible" content="IE=edge" />
 4        <meta charset="utf-8" />
 5    </head>
 6    <body>
 7        <meta http-equiv="X-UA-Compatible" content="IE=edge">
 8        <meta http-equiv="X-UA-Compatible" content="IE=edge">
 9        ...
10        ...
11        <meta http-equiv="X-UA-Compatible" content="IE=edge">
12        <!-- BEGIN STANDARD HEADER - DO NOT MODIFY -->

I therefore made the decision to remove all the tags as they weren’t needed for the visual rendering of the letters.

1func (app *Application) removeDuplicateMetaTags() {
2    app.htmlDocument.Find("meta[http-equiv]").Remove()
3}

Removing empty paragraphs

Each letter had multiple paragraphs that where essentially empty.

1<p style="margin: 0px;">&nbsp;</p>

So we remove them like:

1func (app *Application) removeEmptyParagraphTags() {
2    app.htmlDocument.Find("p").Each(
3        func(i int, s *goquery.Selection) {
4            if strings.TrimSpace(s.Text()) == "" {
5                s.Remove()
6            }
7        },
8    )
9}

Minifying the HTML

The final step before compressing the HTML file with Google’s brotli compression algorithm is to minify the HTML. Minification is the process of removing unnecessary or redundant data without affecting how a resource is processed by the browser. In our case, this was removing all the whitespace and embedded developer comments.

 1func (app *Application) minifyHTML() {
 2    var html string
 3
 4    app.configureHTMLMinifier()
 5
 6    html, app.errorEncountered = app.htmlDocument.Html()
 7    if app.errorEncountered != nil {
 8        panic(app.errorEncountered.Error())
 9    }
10
11    // remove all whitespace and developer comments
12    app.compressedHTML, app.errorEncountered =
13        app.htmlMinifier.String("text/html", html)
14
15    if app.errorEncountered != nil {
16        panic(app.errorEncountered.Error())
17    }
18
19    // remove invalid control code u+00A0
20    app.compressedHTML = strings.ReplaceAll(app.compressedHTML, "\u00A0", "")
21
22    log.Println("Smushed file size :", len(app.compressedHTML), "bytes")
23}

Grab the example code from GitHub.