Using SoX to normalise and join audio segments for digital dictation

Using SoX, the Swiss Army knife of audio manipulation to normalise, convert and join audio segments taken from digital dictation

An old Philips tape cassette used for audio dictation

The Problem Statement

Consultants and health professionals use a Philips SpeechMike Microphone to dictate their discharge letters. The voice recognition software translates their speech and converts it to text, which is then inserted into the document creation system. When the user saves their letter, the document management system (DMS) needs to collect the audio segments generated by the voice recognition software and attach the audio to the document so that medical secretaries can replay to listen to and verify the audio to text translation.

Currently, the collection of the audio segments can fail, which leaves the secretaries with no audio recording to validate the text translation against. The collection of the audio segments can fail because of one of the following reasons:

This article explains how we leveraged the SoX library to normalise the audio before attempting to join the segments. It also covers how we converted the audio from stereo to mono to reduce the overall storage requirements.

Using SoX to get audio meta data

As part of the solution I’m developing for this, I log some key audio meta data against the segment being submitted. Using the –info command line parameter SoX can return meta data such as Sample Rate, Precision and Bit Rate.

 1sox --info 91817201-202406100959-1.wav
 2
 3Input File     : '91817201-202406100959-1.wav'
 4Channels       : 2
 5Sample Rate    : 48000
 6Precision      : 25-bit
 7Duration       : 00:00:08.91 = 427680 samples ~ 668.25 CDDA sectors
 8File Size      : 3.42M
 9Bit Rate       : 3.07M
10Sample Encoding: 32-bit Floating Point PCM

Using SoX to normalise each audio segment

The following example demonstrates how to normalise all audio segments to the same bit rate 48,000. Using the -c parameter converts the audio from stereo to mono by reducing the number of channels to one.

1sox --norm 91817201-202406100959-1.wav -r 48000 -c 1 segment_1.wav
2sox --norm 91817201-202406100957-2.wav -r 48000 -c 1 segment_2.wav

Using SoX to combine the segments to a single final recording

To combine all the audio segments into a single WAV file, we provide the list of segment file names in the order we wish to join them.

1sox --combine concatenate segment_1.wav segment_2.wav final.wav