One way in which artificial intelligence has become part of everyday life is certainly through virtual assistants, thanks to the native integration with the vast majority of smartphones on the market and the millions of smart devices sold in recent years, like the Amazon Echo or Google Home.
Going beyond the variations in the physical form that they take, all virtual assistants that use voice as a mode of interaction have one feature: they activate when a keyword is detected, such as “Alexa” or “Hey, Siri”. Our assistants listen to voices and noises in the surrounding environment and wait for the keyword to be pronounced, activating when it happens and then responding to our requests.
This technique is called keyword spotting, and it can be implemented in various ways. Let’s take a look at how.
From voice to model
To ensure that a device can be activated via voice command, you first need a system that allows real-time analysis of the audio picked up by the microphone. For this reason, the voice activation system is typically implemented inside the device itself.
The approach used the most often is to rely on a certain type of neural network called CNN (Convolutional Neural Networks), which are usually used in image analysis. CNNs work with data presented in the form of matrices (also called 2-dimensional arrays), and in order to do this the audio signal is converted into a spectrogram.
A spectrogram can represent an audio signal in three dimensions: namely, time, amplitude, and frequency. We can imagine it as a photograph of the audio signal, where the color of each pixel represents the amplitude of the signal at a certain frequency in a certain instant.
The neural network will be able to analyze the sound and find patterns that correspond to the word we are looking for. It is done by applying a series of consecutive filters (called convolutions) to the signal, hence the name of the neural network. When the keyword is detected, the assistant is activated and sends the audio to more complex speech recognition systems, which will transcribe the audio into text. Usually, such systems do not run directly on the device, but in the cloud systems of the respective vendors.
Broadly speaking, this is what happens. In detail, many techniques are used to improve the performance of the recognition system in real world situations. We can measure the model’s performance based on the number of false negatives (the assistant does not activate when it should) and false positives (the assistant activates when it is not addressed). This will be clear to those who use a virtual assistant on a regular basis: better performance of the model translates directly into a better user experience with the product.
Think, for example, of Alexa, which can be found in millions of homes around the world. The developers had to take steps to exclude all mentions of the word “Alexa” in commercials, to prevent it from activating while the TV is on.
Another example is the system with which Apple has built the Siri recognition system that, because it runs on a smartphone, focuses on preserving battery life. Through the use of ad-hoc hardware, it was possible to achieve very low power consumption despite continuous recording. Everything is described in this very interesting article taken from the blog of the Apple Machine Learning team.
From model to product
A keyword spotting system can be useful in various application areas, not only in virtual assistants. Those wishing to integrate a voice activation system into their product can rely on a series of projects and datasets made available by the global community. It’s now routine in the world of machine learning that new experiments which try to obtain the best possible performance are published every day. Taking a look at sites like Kaggle or Papers with code, we can find plenty of example models for keyword spotting.
When choosing a framework, the first considerations that must be made are the technologies used and the target device. Based on these, one could lean towards one machine learning framework rather than another. For example, it is possible to run a TensorFlow model not only on a desktop or server, but also on less powerful devices such as smartphones and embedded systems (with TensorFlow Lite), or even in the browser with TensorFlow.js. This possibility also allows us to resolve an important issue such as that of user privacy: the data necessary for training the model remains on the user’s device, instead of being sent to a server.
During the development of Elly, our virtual assistant, we came across two systems that caught our attention. The first is Google’s speech-commands model, a TensorFlow.js model that can recognize 18 keywords from a set of English words such as ‘up ‘, ‘down’, ‘left’, ‘right’, and numbers from ‘zero’ to ‘ten’, etc. The words in the supplied dataset immediately make us think of a user interaction without a mouse or keyboard, but it is not limited to this. In fact, through a technique called transfer learning, it is possible to build your own model capable of recognizing a new word, thus adapting it to our purpose. In the project’s GitHub repository you will find everything you need to build a new model starting from new audio samples.
Collecting audio samples for training a new model is another interesting point that is worthy of our attention. Collecting the records could slow down the early prototyping stages, where perhaps the recognition model does not need to be perfect.
To work around this problem, you can use a different approach such as that taken by Microsoft with Speech Studio. With this portal we can generate a model simply by choosing a word from the vocabulary list, without the need to provide audio samples for training. The model can be generated in two ways, one faster and free of charge, the other slower and paid for, but which promises greater reliability. The model generated by Speech Studio can be used via the SDK, which supports C #, Python and Objective-C / Swift.
With a voice assistant, where the voice is the main interaction mode, a keyword spotting system is typically the first of a series of tools that analyze the voice and interpret the commands given by the user. Having the ability to run the model directly on the user’s device allows us to solve two main problems: privacy and latency.
From a privacy perspective, since only the recordings in which the keyword was detected are sent, the data sent is limited to only those interactions with the assistant, without sending all the audio taken from the microphone all day. This also allows us avoid latency problems with activating the voice assistant, eliminating the transmission of the audio stream to a server.
Response speed is an aspect that has a great impact on the user experience for those who interact with our assistant: a pleasant user experience begins from the first interaction.
Having a good keyword spotting model is therefore essential and we can’t wait to share the results we are having with Elly!
Keep following us.