How to Build Python Transcriber Using Mozilla Deepspeech

Voice Assistants are one of the hottest techs right now. Siri, Alexa, Google Assistant, all aim to help you talk to computers and not just touch and type. Automated Speech Recognition (ASR) and Natural Language Understanding (NLU/NLP) are the key technologies enabling it. If you are a just-a-programmer like me, you might be itching to get a piece of the action and hack something. You are at the right place; read on.

Though these technologies are hard and the learning curve is steep, but are becoming increasingly accessible. Last month, Mozilla released DeepSpeech along with models for US English. It has smaller and faster models than ever before, and even has a TensorFlow Lite model that runs faster than real-time on a single core of a Raspberry Pi 4. There are several interesting aspects, but right now I am going to focus on its refreshingly simple batch and stream APIs in C, .NET, Java, JavaScript, and Python for converting speech to text. By the end of this blog post, you will build a voice transcriber. No kidding :-)

UPDATE: Mozilla DeepSpeech is no longer maintained, and its new home is Coqui STT, which has same API. So you can move to Coqui STT with minor changes in the code. The rest of this article has been updated to use Coqui STT. Models can be downloaded from Coqui Model repository, for example, English STT v1.0.0 (Large Vocabulary) that is used in this article.

Mozilla DeepSpeech Python

You need a computer with Python 3.7 installed, a good internet connection, and elementary Python programming skills. Even if you do not know Python, read along, it is not so hard. If you don’t want to install anything, you can try out DeepSpeech APIs in the browser using this Google Colab.

Let’s do the needed setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


# Create virtual environment named ds082
$ python3 -m venv ./some/pyenv/dir/path/coqui-stt

# Switch to virtual environment
$ source ./some/pyenv/dir/path/coqui-stt/bin/activate

# Install DeepSpeech
$ pip3 install stt==1.4.0

# Download and unzip en-US models, this will take a while
$ mkdir -p ./some/workspace/path/coqui-stt
$ cd ./some/workspace/path/coqui-stt

$ mkdir coqui-stt-1.0.0-models
$ wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/model.tflite
$ wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/large_vocabulary.scorer
$ mv model.tflite large_vocabulary.scorer coqui-stt-1.0.0-models/
$ ls -l ./coqui-stt-1.0.0-models/

# Download and unzip some audio samples to test setup
$ curl -LO https://github.com/coqui-ai/STT/releases/download/v1.4.0/audio-1.4.0.tar.gz
$ tar -xvzf audio-1.4.0.tar.gz
x audio/
x audio/2830-3980-0043.wav
x audio/4507-16021-0012.wav
x audio/8455-210777-0068.wav
x audio/Attribution.txt
x audio/License.txt

$ ls -l ./audio/

# Test Coqui STT (DeepSpeech)
$ stt --help

$ stt --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/2830-3980-0043.wav

$ stt --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/4507-16021-0012.wav

$ stt --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/8455-210777-0068.wav

# Test --json option to get transcription-timestamp mapping
$ stt --json --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/8455-210777-0068.wav 

Examine the output of the last three commands, and you will see results “experience proof less”, “why should one halt on the way”, and “your power is sufficient i said” respectively. You are all set.

DeepSpeech Python API

The API is quite simple. You first need to create a model object using the model files you downloaded:

1
2
3
4


$ python3
>>> import stt
>>> model_file_path = 'coqui-stt-1.0.0-models/model.tflite'
>>> model = stt.Model(model_file_path)

You should add a language model for better accuracy:

1
2
3
4
5
6
7
8
9


>>> scorer_file_path = 'coqui-stt-1.0.0-models/large_vocabulary.scorer'
>>> model.enableExternalScorer(scorer_file_path)

>>> lm_alpha = 0.75
>>> lm_beta = 1.85
>>> model.setScorerAlphaBeta(lm_alpha, lm_beta)

>>> beam_width = 500
>>> model.setBeamWidth(beam_width)

Once you have the model object, you can use either batch or streaming speech-to-text API.

Batch API

To use the batch API, the first step is to read the audio file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


>>> import wave
>>> filename = 'audio/8455-210777-0068.wav'
>>> w = wave.open(filename, 'r')
>>> rate = w.getframerate()
>>> frames = w.getnframes()
>>> buffer = w.readframes(frames)
>>> print(rate)
16000
>>> print(model.sampleRate())
16000
>>> type(buffer)
<class 'bytes'>

As you can see that the speech sample rate of the wav file is 16000hz, same as the model’s sample rate. But the buffer is a byte array, whereas the DeepSpeech model expects a 16-bit int array. Let’s convert it:

1
2
3
4


>>> import numpy as np
>>> data16 = np.frombuffer(buffer, dtype=np.int16)
>>> type(data16)
<class 'numpy.ndarray'>

Run speech-to-text in batch mode to get the text:

1
2
3


>>> text = model.stt(data16)
>>> print(text)
your power is sufficient i said

Streaming API

Now let’s accomplish the same using streaming API. It consists of 3 steps: open session, feed data, close session.

Open a streaming session:

1

>>> stt_stream = model.createStream()

Repeatedly feed chunks of speech buffer, and get interim results if desired:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


>>> buffer_len = len(buffer)
>>> offset = 0
>>> batch_size = 16384
>>> text = ''
>>> while offset < buffer_len:
...     end_offset = offset + batch_size
...     chunk = buffer[offset:end_offset]
...     data16 = np.frombuffer(chunk, dtype=np.int16)
...     stt_stream.feedAudioContent(data16)
...     text = stt_stream.intermediateDecode()
...     print(text)
...     offset = end_offset
...
your power
your power is suff
your power is sufficient i said
your power is sufficient i said

Close stream and get the final result:

1
2
3


>>> text = stt_stream.finishStream()
>>> print(text)
your power is sufficient i said

Python Transcriber

A transcriber consists of two parts: a producer that captures voice from microphone, and a consumer that converts this speech stream to text. These two execute in parallel. The audio recorder keeps producing chunks of the speech stream. The speech recognizer listens to this stream, consumes these chunks upon arrival and updates the transcribed text.

To capture audio, we will use PortAudio, a free, cross-platform, open-source, audio I/O library. You have to download and install it. On macOS, you can install it using brew:

1

$ brew install portaudio

PyAudio is Python bindings for PortAudio, and you can install it with pip:

1

$ pip3 install pyaudio

PyAudio has two modes: blocking, where data has to read (pulled) from the stream; and non-blocking, where a callback function is passed to PyAudio for feeding (pushing) the audio data stream. The non-blocking mechanism suits transcriber. The data buffer processing code using DeepSpeech streaming API has to be wrapped in a call back:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


text_so_far = ''
def process_audio(in_data, frame_count, time_info, status):
    global text_so_far
    data16 = np.frombuffer(in_data, dtype=np.int16)
    stt_stream.feedAudioContent(data16)
    text = stt_stream.intermediateDecode()
    if text != text_so_far:
        print('Interim text = {}'.format(text))
        text_so_far = text
    return (in_data, pyaudio.paContinue)

Now you have to create a PyAudio input stream with this callback:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


audio = pyaudio.PyAudio()
stream = audio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    frames_per_buffer=1024,
    stream_callback=process_audio
)

print('Please start speaking, when done press Ctrl-C ...')
stream.start_stream()

Finally, you need to print the final result and clean up when a user ends recording by pressing Ctrl-C:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


try: 
    while stream.is_active():
        time.sleep(0.1)
except KeyboardInterrupt:
    # PyAudio
    stream.stop_stream()
    stream.close()
    audio.terminate()
    print('Finished recording.')
    # DeepSpeech
    text = stt_stream.finishStream()
    print('Final text = {}'.format(text))

That’s all it takes, just 70 lines of Python code to put it all together: ds-transcriber.py.

Recap

In this article, you had a quick introduction to batch and stream APIs of Coqui STT (successor of Mozilla DeepSpeech) and learned how to marry it with PyAudio to create a speech transcriber. The ASR model used here is for US English speakers, accuracy will vary for other accents. By replacing the model for other languages or accents, the same code will work for that language/accent.