Speech recognition in android apps ~ overview and french try

thibault ketterer

6 min readSep 5, 2019

I was wondering how I could get a good use of Speech To Text (STT) technologies to help my kid who has autism to speak.

So I gave a try to a lot of them. Here is my overview.

TL;DR -> use android native if you can bare the *BIP* sound.

Here the list of techs I tried

google free STT (included in any android app)
amazon STT (paid) (needs java, TK github link)
google cloud api STT — https://cloud.google.com/speech-to-text/
microsoft STT api — https://azure.microsoft.com/en-in/services/cognitive-services/speech-to-text/
pocket sphinx — no demo but https://cmusphinx.github.io/wiki/tutorialpocketsphinx/ — https://github.com/cmusphinx/pocketsphinx
google chrome embed STT https://www.google.com/chrome/demos/speech.html

First, my goal was to build an android app with STT inside and do a word recognition only, no context, no phrases; and continuous recognition.

The app is here, this demo uses google free and native STT http://blog.lankou.org/premiere-demo.html (french)

Google android native speech to text

Google Android speech to text is REALLY easy to setup, it’s only about 10 lines of code, some promise and handlers.

It’s working ok but

there is a sound every time the STT engine starts
the online version has WAY better results than the offline one
sometimes it fails for no reason, so you you have to handle it, and relaunch
it’s no really made for continuous speak to text
it’s a bit slow

Anyway for me it’s the best option for a native android app, it works well in french also, and is ok for most purposes.

For my special purpose it was not that good. Getting continuous recognition means to relaunch speech recognition (BIP) every x seconds, handle many strange errors, handle timeout add sleep(x) to get things running smoothly, and having a concurrency control on the resulting callbacks.

You may tune it up using premature results (OnPartialResults) to make it more permissive, if you know what you are looking for.

It gave me 70% recognition in my goal.

Amazon STT: aws transcribe

I only tried a java demo of it, (TK github link and screenshot)

It gave me good results, but for easier android integration you had to use amazon android framework (https://aws.amazon.com/amplify/) which was a NOGO for me.

I will post the result for french later (TK french results)

Anyway, when I hear Alexa (got one) results compared to google free api, amazon is way behind google in speech understanding (for now ?)

Price, ~ same as google : 10 seconds $0.006

Google cloud api STT

This is just the best, no one but google can have the best results at this.

The API is

not free
very very fast
has no errors and strange result
no BIP
built for continuous recognition
you can give hints about what you are looking for (like words)

In my tests the result were about 80% of the goal, but it was really better than any other. But I gave the exact word I was looking for as a hint.

killer features:

It’s the only API that permit to give hints about what you are looking for.
It’s the only API that gave me result a when looking for a simple letter (like “a”, when “a”, “a” given as a hint).

The only drawbacks, is that android integration is not that easy, the github is not up-to-date with this, you have to manually add a java package to your build (package that has not a at all been made for mobile devices)

The price: $0.006 / 15 seconds (every started recognition count for 15 seconds)

Anyway it’s working really well, you can try google live transcribe app (AI augmented) for an overview https://play.google.com/store/apps/details?id=com.google.audio.hearing.visualization.accessibility.scribe&hl=en

Microsoft azure SST

There is specific recognition engines for many languages models, you may choose for better results.

Anyway I juste wanted an API call and pay-as-you-go, but the pricing was not about that at all, NOGO for a small/free app.

pricing is 1 $ per audio hour but you have to get an instance running for that, it’s not really “pay as you go”.

If you have a continuous amout of speech to handle, it might be cheaper, give it a try.

Results quality: good, even in french

calculator https://azure.microsoft.com/en-in/pricing/calculator/?service=cognitive-services

pricing https://azure.microsoft.com/en-in/pricing/details/cognitive-services/speech-services/

Pocket Sphinx

Integration was the harder, there is some github TK and youtube video (for sphinx 4) you have to sum up the things to make it work.

But when it started it’s

the fastest speech recognition engine
full offline recognition
works ok with french also
can work on very small dictionaries (if you are looking for trigger words)
XXX has some strange tuning based on the word length
lots of false positive

If you don’t really care about the precision and the false positive, which was the case for me, sphinx is a really a good thing to try.

For my special case, sphinx was the solution, because it permits my app to be fully offline, fast, with no *BIP*, and playing with the quality accepted by the recognition engine may make the trick to get event better.

But it’s because I can accept false positive.

quality: 60%

pricing: open source, yeay

Pocketsphinx can fill some purposes.

Chrome embed speech recognition

I thought about making a web app at a time.

I tried both computer chrome and android chrome STT, it’s working great.

Under android it’s the exact same thing as the native free google STT api.

On a computer it’s talking with google directly in GQUIC procotol (thanks wireshark), it’s very fast and can have continuous speech recognition. The result are good, and you can even code some vocals controlled games like this one.

About Firefox webspeechapi

it was existing https://hacks.mozilla.org/2016/01/firefox-and-the-web-speech-api/

but it’s not anymore, there are focusing their work on deepspeech

The future is very interesing and promising with the web speech api

see there https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API#Speech_recognition

There is interesting stuff for getting started but it will be easier in English than French https://github.com/mozilla/DeepSpeech

live demo there (chrome only for now)

Speech color changer

Edit description

mdn.github.io

There is ongoing work with deepspeech and webapi long thread but lots of info https://bugzilla.mozilla.org/show_bug.cgi?id=1248897

Conclusion

In a perfect world, I would have wanted google cloud api offline and with a quality control, like “I want false positives but more results”.

For a quick and simple app: google free embed android api
For offline low quality (fatest): pocketsphinx
For highest quality, long sentences and continuous : google cloud api (but I would check the price with other option, aws and microsoft)
web embed speech seem promising !!
as for now, you cannot rely 100% on speech recognition without having a retry loop or continuous recognition

Things I haven’t tried

Ibm watson
other paid APIs