While developing our virtual assistant, the vocal component played a fundamental role right from the start . Vocal interaction is the main element in building a more natural User eXperience, which consists mainly of two elements: vocal synthesis (text-to-speech), with which we can say something to our interlocutor, and voice recognition (speech-to-text), to extract text from speech. We’re hoping to do something better than this:
Algorithms that can read the screen or transcribe text have been around for a long time. Today, artificial intelligence is improving these techniques, increasing the quality of the transcribed text and allows the generation of voices to sound more natural, mimicking the intonation and cadence of human ones.
Developing these technologies from scratch requires considerable effort, and that is why the leading Cloud providers currently on the market provide ready-to-use speech services, eliminating the need for creating models that would require many hours of training in order to be reliable.
These services, which are similar in many ways, present some differences in terms of costs, supported languages, speed of execution, and more. So I thought it might be interesting to share the experiments done with Azure, Google Cloud, and AWS services with you, to give you an idea of what factors to consider when choosing between the three.
A fundamental aspect of usability is the response time of these services, especially in the case of analysis and real-time voice synthesis like the ones needed for a vocal assistant.
Regarding speech-to-text, each of the services allows recognition in different audio formats to meet the needs of different users. Speech recognition can be done in two ways: “batch” recognition, which is tailored for long audio files, or “streaming” recognition, which is used for real-time speech analysis.
To verify the response time of the services, we did a test in which eight Italian phrases of similar length were first synthesized and then reconverted to text. An average response time was calculated.
Here are the results:
|Synthesis (WAV)||Total (s)||Average(s)|
|Synthesis (MP3)||Total (s)||Average (s)|
For Amazon Transcribe, the “streaming” recognition function was implemented following the company’s instruction manual, whereas Azure and Google SDK provide it out-of-the-box.
|Recognition (WAV)||Total (min)||Average (min)|
|Recognition (MP3)||Total (min)||Average (min)|
Amazon Polly was the fastest speech synthesis service, whereas Google Text-to-Speech was the fastest in speech recognition. Notice that only the Azure SDK supports MP3 audio file recognition, for the other services the files must first be decoded into WAV format.
One of the factors we need to take into consideration is certainly the number of supported languages, which helps reach a wider audience.
Although Italian is supported by all three providers, as of today, only Microsoft and Google provide Italian neural voices, that is to say voices generated with artificial intelligence and which sound more natural to the listener. Moreover, Google provides different optimized models for recognizing speech from specific sources, like phone calls or video.
|Service||STT Languages||TTS Languages||NTTS Languages|
During our tests, Azure’s Italian neural voices sounded the most natural to us. A demonstration tool for all of the providers is available and can be used to compare the voices, although you’ll have to register first in order to get the AWS demo. Try it for yourself:
Azure Cognitive Services
Google Cloud Text-to-Speech
The advantage of using a cloud service is definitely paying only for what is actually used. All of the services we’ve seen so far follow a consumption-based pricing model, charging either by the number of synthesized characters or by seconds of transcribed audio.
Furthermore, it is possible to use the services free of charge up to a certain threshold, which is definitely a nice thing during the early development phases.
Here is a table summing up the prices for speech recognition:
|Azure Speech-to-text||$ 0,000277 al secondo|
|Google Cloud Speech-to-text||$ 0,006/15 secondi ($0,004 con logging)|
|Amazon Transcribe||$ 0,0004 ogni 1s (min 15s)|
Speech synthesis prices are the same across the providers, whose pricing is different only regarding how many free monthly characters they offer:
(for 1 million characters)
|Voci TTS standard||$ 4,00|
|Voci NTTS (neurali)||$ 16,00|
|Azure Text-to-Speech||5 Million||500.000|
|Google Cloud Text-to-Speech||4 Million||1 Million|
|Amazon Polly||4 Million||1 Million|
In addition to the characteristics we’ve already discussed, there are other aspects to factor in when approaching a third-party service, and these are no exception. For example: which libraries are available for your preferred programming language?
For one thing, streaming recognition for Amazon Transcribe is only available in the Java SDK, but can be implemented in any programming language following the instructions.
Another important aspect is the possibility of using these tools within your own infrastructure. Microsoft has made a Docker image available (currently in preview) that provides all the speech services functions that are also found in the Cloud. On the other hand, with Google, only speech-to-text is available via platforms like Anthos or GKE.
Azure Speech Container How-to
Google Cloud Speech-to-Text On-Prem
You can find updated listings for prices and supported functions on the pages of the respective providers:
Azure Cognitive Services Speech-to-text
Azure Cognitive Services Text-to-speech
We took just a quick look at what is available in the world of artificial intelligence. These tools, which are constantly improving, simplify the use of speech and create a lot of opportunities. We just have to make the most of them when building the applications of the future.
If you’re interested in knowing how we use them, stay tuned!