Voice Assistants are one of the hottest techs right now. Siri, Alexa, Google Assistant, all aim to help you talk to computers and not just touch and type. Automated Speech Recognition (ASR) and Natural Language Understanding (NLU/NLP) are the key technologies enabling it. If you are a just-a-programmer like me, you might be itching to get a piece of the action and hack something. You are at the right place; read on.
Though these technologies are hard and the learning curve is steep, but are becoming increasingly accessible. Last month, Mozilla released DeepSpeech along with models for US English. It has smaller and faster models than ever before, and even has a TensorFlow Lite model that runs faster than real-time on a single core of a Raspberry Pi 4. There are several interesting aspects, but right now I am going to focus on its refreshingly simple batch and stream APIs in C, .NET, Java, JavaScript, and Python for converting speech to text. By the end of this blog post, you will build a voice transcriber. No kidding :-)
UPDATE: Mozilla DeepSpeech is no longer maintained, and its new home is Coqui STT, which has same API. So you can move to Coqui STT with minor changes in the code. The rest of this article has been updated to use Coqui STT. Models can be downloaded from Coqui Model repository, for example, English STT v1.0.0 (Large Vocabulary) that is used in this article.
Mozilla DeepSpeech Python
You need a computer with Python 3.7 installed, a good internet connection, and elementary Python programming skills. Even if you do not know Python, read along, it is not so hard. If you don’t want to install anything, you can try out DeepSpeech APIs in the browser using this Google Colab.
Let’s do the needed setup:
|
|
Examine the output of the last three commands, and you will see results “experience proof less”, “why should one halt on the way”, and “your power is sufficient i said” respectively. You are all set.
DeepSpeech Python API
The API is quite simple. You first need to create a model object using the model files you downloaded:
|
|
You should add a language model for better accuracy:
|
|
Once you have the model object, you can use either batch or streaming speech-to-text API.
Batch API
To use the batch API, the first step is to read the audio file:
|
|
As you can see that the speech sample rate of the wav file is 16000hz, same as the model’s sample rate. But the buffer is a byte array, whereas the DeepSpeech model expects a 16-bit int array. Let’s convert it:
|
|
Run speech-to-text in batch mode to get the text:
|
|
Streaming API
Now let’s accomplish the same using streaming API. It consists of 3 steps: open session, feed data, close session.
Open a streaming session:
|
|
Repeatedly feed chunks of speech buffer, and get interim results if desired:
|
|
Close stream and get the final result:
|
|
Python Transcriber
A transcriber consists of two parts: a producer that captures voice from microphone, and a consumer that converts this speech stream to text. These two execute in parallel. The audio recorder keeps producing chunks of the speech stream. The speech recognizer listens to this stream, consumes these chunks upon arrival and updates the transcribed text.
To capture audio, we will use PortAudio, a free, cross-platform, open-source, audio I/O library. You have to download and install it. On macOS, you can install it using brew:
|
|
PyAudio is Python bindings for PortAudio, and you can install it with pip:
|
|
PyAudio has two modes: blocking, where data has to read (pulled) from the stream; and non-blocking, where a callback function is passed to PyAudio for feeding (pushing) the audio data stream. The non-blocking mechanism suits transcriber. The data buffer processing code using DeepSpeech streaming API has to be wrapped in a call back:
|
|
Now you have to create a PyAudio input stream with this callback:
|
|
Finally, you need to print the final result and clean up when a user ends recording by pressing Ctrl-C:
|
|
That’s all it takes, just 70 lines of Python code to put it all together: ds-transcriber.py.
Recap
In this article, you had a quick introduction to batch and stream APIs of Coqui STT (successor of Mozilla DeepSpeech) and learned how to marry it with PyAudio to create a speech transcriber. The ASR model used here is for US English speakers, accuracy will vary for other accents. By replacing the model for other languages or accents, the same code will work for that language/accent.