{"id":2458,"date":"2021-02-08T05:00:00","date_gmt":"2021-02-08T04:00:00","guid":{"rendered":"https:\/\/ellycode.com\/?p=2458"},"modified":"2021-02-02T16:29:12","modified_gmt":"2021-02-02T15:29:12","slug":"have-a-chat-with-our-application","status":"publish","type":"post","link":"https:\/\/ellycode.com\/en\/blog-en\/have-a-chat-with-our-application\/","title":{"rendered":"Have a chat with our application"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">While developing our virtual assistant, the vocal component played a fundamental role right from the start . <strong>Vocal interaction is the main element in building a more natural User eXperience<\/strong>, which consists mainly of two elements: vocal synthesis (text-to-speech), with which we can say something to our interlocutor, and voice recognition (speech-to-text), to extract text from speech. We\u2019re hoping to do something better than this:<\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<span class=\"embed-youtube\" style=\"text-align:center; display: block;\"><iframe loading=\"lazy\" class=\"youtube-player\" width=\"1080\" height=\"608\" src=\"https:\/\/www.youtube.com\/embed\/uyV0IVItlM4?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent\" allowfullscreen=\"true\" style=\"border:0;\" sandbox=\"allow-scripts allow-same-origin allow-popups allow-presentation allow-popups-to-escape-sandbox\"><\/iframe><\/span>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Algorithms that can read the screen or transcribe text have been around for a long time. Today, artificial intelligence is improving these techniques, increasing the quality of the transcribed text and allows the generation of voices to sound more natural, mimicking the intonation and cadence of human ones.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Developing these technologies from scratch requires considerable effort, and that is why <strong>the leading Cloud providers currently on the market provide ready-to-use speech services<\/strong>, eliminating the need for creating models that would require many hours of training in order to be reliable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These services, which are similar in many ways, present some differences in terms of costs, supported languages, speed of execution, and more. So I thought it might be interesting to share the experiments done with <strong>Azure<\/strong>, <strong>Google Cloud<\/strong>, and <strong>AWS <\/strong>services with you, to give you an idea of what factors to consider when choosing between the three.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-performance\">Performance<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A fundamental aspect of usability is the response time of these services, especially in the case of analysis and real-time voice synthesis like the ones needed for a vocal assistant.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" data-attachment-id=\"2482\" data-permalink=\"https:\/\/ellycode.com\/en\/blog-en\/have-a-chat-with-our-application\/attachment\/performance\/\" data-orig-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/performance.jpg?fit=1950%2C1300&amp;ssl=1\" data-orig-size=\"1950,1300\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"performance\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/performance.jpg?fit=300%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/performance.jpg?fit=1024%2C683&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/performance.jpg?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-2482\" srcset=\"https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/performance-980x653.jpg 980w, https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/performance-480x320.jpg 480w\" sizes=\"auto, (min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1024px, 100vw\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Regarding speech-to-text, each of the services allows recognition in different audio formats to meet the needs of different users. Speech recognition can be done in two ways: \u201cbatch\u201d recognition, which is tailored for long audio files, or \u201cstreaming\u201d recognition, which is used for real-time speech analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To verify the response time of the services, we did a test in which eight Italian phrases of similar length were first synthesized and then reconverted to text. An average response time was calculated. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here are the results:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th><strong>Synthesis (WAV)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Total (s)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Average(s)<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Amazon Polly<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>05.720<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0.715<\/em><\/td><\/tr><tr><td>Azure Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>34.039<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>4.254<\/em><\/td><\/tr><tr><td>Google Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>08.753<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>1.094<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th><strong>Synthesis (MP3)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Total (s)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Average (s)<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Amazon Polly<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>04.753<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0.594<\/em><\/td><\/tr><tr><td>Azure Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>29.988<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>3.748<\/em><\/td><\/tr><tr><td>Google Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>07.312<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0.914<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For Amazon Transcribe, the \u201cstreaming\u201d recognition function was implemented following the company&#8217;s instruction manual, whereas Azure and Google SDK provide it out-of-the-box.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Recognition (WAV)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Total (min)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Average (min)<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Amazon Polly<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>3:08.516<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0:23.560<\/em><\/td><\/tr><tr><td>Azure Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>1:38.468<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0:12.308<\/em><\/td><\/tr><tr><td>Google Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0:39.995<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0:0<em>4.999<\/em><\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Recognition (MP3)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Total (min)<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Average (min)<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Amazon Polly<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>&#8211;<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>&#8211;<\/em><\/td><\/tr><tr><td>Azure Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>3:16.011<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>0:24.501<\/em><\/td><\/tr><tr><td>Google Text-to-speech<\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>&#8211;<\/em><\/td><td class=\"has-text-align-center\" data-align=\"center\"><em>&#8211;<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Amazon Polly was the fastest speech synthesis service, whereas Google Text-to-Speech was the fastest in speech recognition. Notice that only the Azure SDK supports MP3 audio file recognition, for the other services the files must first be decoded into WAV format.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-language-support\">Language support<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">One of the factors we need to take into consideration is certainly <strong>the number of supported languages<\/strong>, which <strong>helps reach a wider audience<\/strong>. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"699\" data-attachment-id=\"2489\" data-permalink=\"https:\/\/ellycode.com\/en\/blog-en\/have-a-chat-with-our-application\/attachment\/multilanguage\/\" data-orig-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/multilanguage.jpg?fit=1933%2C1319&amp;ssl=1\" data-orig-size=\"1933,1319\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"multilanguage\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/multilanguage.jpg?fit=300%2C205&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/multilanguage.jpg?fit=1024%2C699&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/multilanguage.jpg?resize=1024%2C699&#038;ssl=1\" alt=\"\" class=\"wp-image-2489\" srcset=\"https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/multilanguage-980x669.jpg 980w, https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/multilanguage-480x328.jpg 480w\" sizes=\"auto, (min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1024px, 100vw\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Although Italian is supported by all three providers, as of today, only <strong>Microsoft and Google provide Italian neural voices<\/strong>, that is to say voices <strong>generated with artificial intelligence and which sound more natural to the listener<\/strong>. Moreover, Google provides different optimized models for recognizing speech from specific sources, like phone calls or video.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Service<\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>STT Languages<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>TTS Languages<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>NTTS<\/strong> Languages<\/th><\/tr><\/thead><tbody><tr><td>Amazon<\/td><td class=\"has-text-align-center\" data-align=\"center\">31 (11)*<\/td><td class=\"has-text-align-center\" data-align=\"center\">29<\/td><td class=\"has-text-align-center\" data-align=\"center\">5<\/td><\/tr><tr><td>Azure<\/td><td class=\"has-text-align-center\" data-align=\"center\">86<\/td><td class=\"has-text-align-center\" data-align=\"center\">49<\/td><td class=\"has-text-align-center\" data-align=\"center\">54<\/td><\/tr><tr><td>Google<\/td><td class=\"has-text-align-center\" data-align=\"center\">136<\/td><td class=\"has-text-align-center\" data-align=\"center\">40<\/td><td class=\"has-text-align-center\" data-align=\"center\">25<\/td><\/tr><\/tbody><\/table><figcaption>* Languages that support streaming recognition.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">During our tests, Azure\u2019s Italian neural voices sounded the most natural to us. A demonstration tool for all of the providers is available and can be used to compare the voices, although you\u2019ll have to register first in order to get the AWS demo. Try it for yourself:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/azure.microsoft.com\/it-it\/services\/cognitive-services\/text-to-speech\/\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Cognitive Services<\/a><br><a href=\"https:\/\/cloud.google.com\/text-to-speech?hl=it#section-2\" target=\"_blank\" rel=\"noreferrer noopener\">Google Cloud Text-to-Speech<\/a><br><a href=\"https:\/\/eu-west-1.console.aws.amazon.com\/polly\/home\/SynthesizeSpeech\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Amazon Polly<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-pricing\">Pricing<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The advantage of using a cloud service is definitely paying only for what is actually used. All of the services we\u2019ve seen so far follow a consumption-based pricing model, charging either by the number of synthesized characters or by seconds of transcribed audio. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" data-attachment-id=\"2492\" data-permalink=\"https:\/\/ellycode.com\/en\/blog-en\/have-a-chat-with-our-application\/attachment\/pricing\/\" data-orig-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/pricing.jpg?fit=1950%2C1300&amp;ssl=1\" data-orig-size=\"1950,1300\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"pricing\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/pricing.jpg?fit=300%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/pricing.jpg?fit=1024%2C683&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/pricing.jpg?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-2492\" srcset=\"https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/pricing-980x653.jpg 980w, https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/pricing-480x320.jpg 480w\" sizes=\"auto, (min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1024px, 100vw\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, <strong>it is possible to use the services free of charge up to a certain threshold<\/strong>, which is definitely a nice thing during the early development phases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is a table summing up the prices for speech recognition:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Service<\/th><th class=\"has-text-align-right\" data-align=\"right\">Pricing<\/th><\/tr><\/thead><tbody><tr><td>Azure Speech-to-text<\/td><td class=\"has-text-align-right\" data-align=\"right\">$ 0,000277 al secondo<\/td><\/tr><tr><td>Google Cloud Speech-to-text<\/td><td class=\"has-text-align-right\" data-align=\"right\">$ 0,006\/15 secondi ($0,004 con logging)<\/td><\/tr><tr><td>Amazon Transcribe<\/td><td class=\"has-text-align-right\" data-align=\"right\">$ 0,0004 ogni 1s (min 15s)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Speech synthesis prices are the same across the providers<\/strong>, whose pricing is different only regarding how many free monthly characters they offer:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Type<\/th><th class=\"has-text-align-right\" data-align=\"right\">Pricing<br>(for 1 million characters)<\/th><\/tr><\/thead><tbody><tr><td>Voci TTS standard<\/td><td class=\"has-text-align-right\" data-align=\"right\">$ 4,00<\/td><\/tr><tr><td>Voci NTTS (neurali)<\/td><td class=\"has-text-align-right\" data-align=\"right\">$ 16,00<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Service<\/th><th class=\"has-text-align-right\" data-align=\"right\">Standard Voices<br>(<strong>characters<\/strong>)<\/th><th class=\"has-text-align-right\" data-align=\"right\">Neural Voices<br>(<strong>characters<\/strong>)<\/th><\/tr><\/thead><tbody><tr><td>Azure Text-to-Speech<\/td><td class=\"has-text-align-right\" data-align=\"right\">5 Million<\/td><td class=\"has-text-align-right\" data-align=\"right\">500.000<\/td><\/tr><tr><td>Google Cloud Text-to-Speech<\/td><td class=\"has-text-align-right\" data-align=\"right\">4 Million<\/td><td class=\"has-text-align-right\" data-align=\"right\">1 Million<\/td><\/tr><tr><td>Amazon Polly<\/td><td class=\"has-text-align-right\" data-align=\"right\">4 Million<\/td><td class=\"has-text-align-right\" data-align=\"right\">1 Million<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-further-considerations\">Further considerations<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In addition to the characteristics we\u2019ve already discussed, there are other aspects to factor in when approaching a third-party service, and these are no exception. For example: which libraries are available for your preferred programming language?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"640\" data-attachment-id=\"2497\" data-permalink=\"https:\/\/ellycode.com\/en\/blog-en\/have-a-chat-with-our-application\/attachment\/programming\/\" data-orig-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/programming.jpg?fit=960%2C640&amp;ssl=1\" data-orig-size=\"960,640\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"programming\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/programming.jpg?fit=300%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/programming.jpg?fit=960%2C640&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/01\/programming.jpg?resize=960%2C640&#038;ssl=1\" alt=\"\" class=\"wp-image-2497\" srcset=\"https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/programming.jpg 960w, https:\/\/ellycode.com\/wp-content\/uploads\/2021\/01\/programming-480x320.jpg 480w\" sizes=\"auto, (min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 960px, 100vw\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For one thing, streaming recognition for Amazon Transcribe is only available in the Java SDK, but can be implemented in any programming language following the instructions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another important aspect is the possibility of using these tools within your own infrastructure. <strong>Microsoft has made a Docker image available (currently in preview) that provides all the speech services functions that are also found in the Cloud<\/strong>. On the other hand, with Google, only speech-to-text is available via platforms like Anthos or GKE.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/cognitive-services\/speech-service\/speech-container-howto\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Speech Container How-to<\/a><br><a href=\"https:\/\/cloud.google.com\/speech-to-text\/on-prem\" target=\"_blank\" rel=\"noreferrer noopener\">Google Cloud Speech-to-Text On-Prem<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can find updated listings for prices and supported functions on the pages of the respective providers:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/azure.microsoft.com\/it-it\/services\/cognitive-services\/speech-to-text\">Azure Cognitive Services S<\/a><a href=\"https:\/\/azure.microsoft.com\/it-it\/services\/cognitive-services\/speech-to-text\" target=\"_blank\" rel=\"noreferrer noopener\">p<\/a><a href=\"https:\/\/azure.microsoft.com\/it-it\/services\/cognitive-services\/speech-to-text\">eech-to-text<\/a><br><a href=\"https:\/\/azure.microsoft.com\/it-it\/services\/cognitive-services\/text-to-speech\/\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Cognitive Services Text-to-speech<\/a><br><a href=\"https:\/\/cloud.google.com\/speech-to-text\" target=\"_blank\" rel=\"noreferrer noopener\">Google Speech-to-text<\/a><br><a href=\"https:\/\/cloud.google.com\/text-to-speech\" target=\"_blank\" rel=\"noreferrer noopener\">Google Text-to-speech<\/a><br><a href=\"https:\/\/aws.amazon.com\/it\/transcribe\/\" target=\"_blank\" rel=\"noreferrer noopener\">Amazon Transcribe<\/a><br><a href=\"https:\/\/aws.amazon.com\/it\/polly\/\" target=\"_blank\" rel=\"noreferrer noopener\">Amazon Polly<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We took just a quick look at what is available in the world of artificial intelligence. These tools, which are constantly improving, simplify the use of speech and create a lot of opportunities. We just have to make the most of them when building the applications of the future.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you\u2019re interested in knowing how we use them, stay tuned!<\/p>\n\n\n[et_pb_section global_module=\"1791\"][\/et_pb_section]\n","protected":false},"excerpt":{"rendered":"<p>While developing our virtual assistant, the vocal component played a fundamental role right from the start . Vocal interaction is the main element in building a more natural User eXperience, which consists mainly of two elements: vocal synthesis (text-to-speech), with which we can say something to our interlocutor, and voice recognition (speech-to-text), to extract text [&hellip;]<\/p>\n","protected":false},"author":195423238,"featured_media":2570,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","_crdt_document":"","inline_featured_image":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[688637375],"tags":[],"class_list":["post-2458","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-en"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/ellycode.com\/wp-content\/uploads\/2021\/02\/1105x656_blog_Applicazione-B.png?fit=1105%2C656&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/pcuDuD-DE","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/posts\/2458","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/users\/195423238"}],"replies":[{"embeddable":true,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/comments?post=2458"}],"version-history":[{"count":38,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/posts\/2458\/revisions"}],"predecessor-version":[{"id":2460,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/posts\/2458\/revisions\/2460"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/media\/2570"}],"wp:attachment":[{"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/media?parent=2458"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/categories?post=2458"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ellycode.com\/en\/wp-json\/wp\/v2\/tags?post=2458"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}