Optimizing & Reducing Clinical Database Storage Requirements
Leveraging Golang to minimize storage costs by identifying and removing legacy HTML inefficiencies -- #TagSoup #Slop
Table of Contents
Introduction
Reducing infrastructure costs should be a priority for everyone involved, including developers, managers, and the executive team. All parties should take storage and retention costs into account.
I recently reviewed and enhanced the storage requirements for an extensive clinical database containing over sixty million clinical letters in a Microsoft SQL Server database. By reducing the size of the database, we improved both backup and restore times, as there was less data to process. Before the reduction, backups took over twenty-four hours to complete, and we never attempted test restores due to the estimated five-day restoration time.
Each clinic letter was stored as an HTML
file in the database using a VARBINARY(MAX)
data type.
On further inspection, I found that the HTML
was of poor quality.
Each HTML
letter had the hospital logo embedded as a base64 encoded JPEG
.
Storing the logo this way was very wasteful and provided the most significant opportunity for improvement.
The system had saved the logo sixty million times instead of being stored once on a CDN
(Content Delivery Network)
and linked to from the letter using a standard IMG
tag.
Other issues found with the stored HTML content included:
- Lots of tag soup – a pejorative term for HTML written that is syntactically or structurally incorrect. For example, the HTML contained many in-lined styles, adding bloat.
- Empty paragraphs tags.
- Invalid control characters.
- Embedded boiler plate developer comments.
In this article, we will demonstrate, using the GoLang programming language, how we reduced the average letter size from 215,215 to 2096 bytes, saving 12.78 Terabytes on the whole database.
Getting started
The code samples below use Go version 1.24, if you would like help to adapt the code for your use case then please get in touch. Alternatively, you can grab the code from here. The code makes use of several packages; goquery to parse and interact with the HTML
, Minify to remove unneeded whitespace & comments, and brotli to perform the final compression.
1func main() {
2 app := Application{}
3 app.loadHTMLFile()
4 app.parseHTML()
5 app.replaceEmbeddedLogo()
6 app.removeStyleAttributes()
7 app.removeDuplicateMetaTags()
8 app.removeEmptyParagraphTags()
9 app.minifyHTML()
10 app.addDoctype()
11 app.writeHTMLOutput()
12 app.compressHTML()
13}
Let’s look at some of the above functions in more detail.
Replacing the embedded logo with a static reference
Every HTML
clinic letter had the hospital logo embedded into the file using base64 encoding.
As you can see from the code snippet shown below, the IMG
tag also had unnecessary in-lined style tags.
However, the biggest saving was to replace this image reference with a static reference out to a ‘JPG’ file on a CDN
.
1<img
2 width="718" height="89"
3 id="template_header"
4 style="margin: 0px; padding: 0px; border: 0px currentColor; border-image: none;"
5 src=" ... ">
To achieve replacing the image, we first find the first image reference in the HTML
and replace the HTML
code with a simple IMG
tag that loads the JPG
file on demand. We also give the image the a shorter id
of logo
so that it can be further styled via a central CSS
file.
1func (app *Application) replaceEmbeddedLogo() {
2 app.htmlDocument.Find (
3 "img[src^='data:image/jpg;']").First().ReplaceWithHtml (
4 `<img id="logo" src="https://nhs.io/logo.jpg" width="718" height="89" >`)
5}
Removing in-line styles
Nearly every paragraph and div
within the letter had in-line styles, which when extrapolated over sixty million documents amounted to a lot of wasted space. These in-line styles should be moved to a central, external CSS
file.
1<div id="template_department"
2 style="color: rgb(0, 114, 188); font-weight: bold;">
3 CardioThoracic Services
4</div>
As such, it was safe to remove all the in-line styles like:
1func (app *Application) removeStyleAttributes() {
2 app.htmlDocument.Find("*").Each(
3 func(i int, s *goquery.Selection) {
4 s.RemoveAttr("style")
5 })
6}
Removing duplicate meta tags
The save feature of the system must have been broken, as it would add an additional http-equiv
tag on each save.
Even worse, it was adding them in the wrong place.
It was adding them to the body
section instead of the head
section (which already had the reference).
1<html>
2 <head>
3 <meta http-equiv="X-UA-Compatible" content="IE=edge" />
4 <meta charset="utf-8" />
5 </head>
6 <body>
7 <meta http-equiv="X-UA-Compatible" content="IE=edge">
8 <meta http-equiv="X-UA-Compatible" content="IE=edge">
9 ...
10 ...
11 <meta http-equiv="X-UA-Compatible" content="IE=edge">
12 <!-- BEGIN STANDARD HEADER - DO NOT MODIFY -->
I therefore made the decision to remove all the tags as they weren’t needed for the visual rendering of the letters.
1func (app *Application) removeDuplicateMetaTags() {
2 app.htmlDocument.Find("meta[http-equiv]").Remove()
3}
Removing empty paragraphs
Each letter had multiple paragraphs that where essentially empty.
1<p style="margin: 0px;"> </p>
So we remove them like:
1func (app *Application) removeEmptyParagraphTags() {
2 app.htmlDocument.Find("p").Each(
3 func(i int, s *goquery.Selection) {
4 if strings.TrimSpace(s.Text()) == "" {
5 s.Remove()
6 }
7 },
8 )
9}
Minifying the HTML
The final step before compressing the HTML
file with Google’s brotli compression algorithm is to minify the HTML
. Minification is the process of removing unnecessary or redundant data without affecting how a resource is processed by the browser. In our case, this was removing all the whitespace and embedded developer comments.
1func (app *Application) minifyHTML() {
2 var html string
3
4 app.configureHTMLMinifier()
5
6 html, app.errorEncountered = app.htmlDocument.Html()
7 if app.errorEncountered != nil {
8 panic(app.errorEncountered.Error())
9 }
10
11 // remove all whitespace and developer comments
12 app.compressedHTML, app.errorEncountered =
13 app.htmlMinifier.String("text/html", html)
14
15 if app.errorEncountered != nil {
16 panic(app.errorEncountered.Error())
17 }
18
19 // remove invalid control code u+00A0
20 app.compressedHTML = strings.ReplaceAll(app.compressedHTML, "\u00A0", "")
21
22 log.Println("Smushed file size :", len(app.compressedHTML), "bytes")
23}
Grab the example code from GitHub.