Vosk/Coqui STT ASRs integration

solyarisoftware · May 25, 2021, 10:00am

Hi Michael, all

Reading here: https://rhasspy.readthedocs.io/en/latest/speech-to-text/, I see Vosk and Coqui STT are missing and maybe I’d suggest to list/integrate them.

Vosk
I’m pretty enthusiast because latencies are very slow ( ~50 msecs - ~500 msecs on my PC). See some tests on my project voskjs, a simple Vosk-api nodejs wrapper. Language coverage is similar to the DeepSpeech/ (now Coqui STT), but latency are from twice as fast, to an order of magnitude faster (e.g. using small models with grammars, you obtain few tents of msecs for few words sentences)!

In voskjs I implemented the simple HTTP demo server voskjshttp. I’d glad to extend it to be used in RHASSPI. May you confirm that the integration could be done withe the “Remote HTTP Server” as described here: https://rhasspy.readthedocs.io/en/latest/speech-to-text/#remote-http-server and here: https://rhasspy.readthedocs.io/en/latest/reference/#http-api ?
All pretty clear but I have a question: using the “Remote HTTP Server” way, client can not specify any parameter in the POST request. Right?

If confirmed, with minor update on voskjshttp, RHASSPY user could use it to test Vosk ASR.
Coqui STT
as you know ids a recent DeepSpeech fork, made by DeepSpeech core team developers, after Th Mozilla “suspension” of DeepSpeech (no polemics intended now). I made CoquiSTTjs, a simple/draft nodejs wrapper.

BTW, Coqui STT, as DeepSpeech, is, as pretty all ASRs, cpu-consuming and unfortunately works on a single thread, see: https://github.com/coqui-ai/STT/discussions/1870. I’ll try to implement a multi-thread/multi-process server architecture as part of my CoquiSTTjs project. If I success I’d try to implement a “Remote HTTP Server” interface.

Thought?
Thanks
giorgio

solyarisoftware · May 26, 2021, 7:43am

Here a draft solution to integrate a Vosk ASR in RHASSPY home server, following HTTP Remote Server interface:

synesthesiam · May 27, 2021, 3:47pm

Hi giorgio! Sorry for the delay in my response, as always

The “Remote HTTP Server” option should work just fine (it looks like you got it working). I’m also working on integrating Vosk directly into Rhasspy as a Hermes MQTT service, but there will be a limited set of supported models and options. So having your server as an option is great

Coqui STT (and TTS) are definitely things I’d like to support. I would consider these as replacing as DeepSpeech and MozillaTTS down the road.

NVIDIA has also released v1 of their NeMo framework, which looks interesting. Their QuartzNet ASR model is what’s under the hood in @DANBER’s Scribosermo, which I’d also like to use in Rhasspy and train models for. The advantage I see of Scribosermo here is that Daniel has an excellent training infrastructure and can export TFLite models (including quantized versions).

solyarisoftware · May 27, 2021, 4:55pm

Yes, the “Remote HTTP Server” option, following the “attribute-less” interface you proposed, delegates to the server all features as:

the natural language (“en-US”, “it-IT”, etc.)
the specific ASR-dependent -> model
etc.

Sever decides all transcript options. This has pros and cons.

I’ll try to develop and share a simple/demo “Remote HTTP Server” for Coqui STT. The problem here is that the Coqui STT(=DeepSpeech) decoder architecture is single-threaded (at least on the CPU version). So I have to set-up a multi-process server. I’ll share result here.

Yes, that’s interesting! BTW, currently, as usual, NVIDIA supplied English language model first, and very few more languages (no Italian) so my enthusiasm is now suspended

Thanks

DANBER · May 28, 2021, 7:17am

If you have an Nvidia-Gpu, you could train an Italian model yourself, if you are interested. Scribosermo already has some tools for Italian datasets (I did train Italian deepspeech before, with ~300h datasets) and training is quite fast (about 3 days on 2x1080Ti for 700h German). I would estimate you can reach 10-15% WER with it.

solyarisoftware · May 28, 2021, 11:24am

Thanks Daniel,

No, I don’t have a NVIDIA GPU and I’m not too much in the training-side of the world. Thanks anyway for your interesting link. Huge work.

BTW, I confess that the page https://gitlab.com/Jaco-Assistant/Jaco-Master/-/blob/master/HowJacoWorks.md is open in my browser since weeks!

My side, my possible contribute here is to setup a Coqui STT server extending this:
https://github.com/solyarisoftware/coquisttjs with a multi process architecture.

Keep in touch.

solyarisoftware · June 7, 2021, 9:37am

About using Vosk and Coqui STT ASRs as RHASSPY remote HTTP servers,

Coqui STT -> coquihttp
I just developed a first-fit solution:

https://github.com/solyarisoftware/CoquiSTTJs/tree/master/examples#coquihttp-as-rhasspy-speech-to-text-remote-http-server
Vosk -> voskjshttp
https://github.com/solyarisoftware/voskJs/tree/master/examples#voskjshttp-as-rhasspy-speech-to-text-remote-http-server

DANBER · July 5, 2021, 9:47am

@solyarisoftware Scribosermo now has an Italian model too, reaching a WER of 11.5%

solyarisoftware · July 6, 2021, 1:39pm

Hi Daniel, 11.5 WER great also because the low number of hours available in Italian language (if I well understand)! Thanks for the update. I’ll deepen/test asap.

Just a question: it the Italian alast .pbmm file this one: saved_model ?

BTW, I’m curious about how you calculated the WER, I’ll read your doc asap.

DANBER · July 7, 2021, 5:57am

Not quite, you can use the linked file (with the other files in the parent directory) with full tensorflow runtime, but I’d recommend to use the .tflite files.
The models are not compatible with DeepSpeech, see extras/exporting directory of repo for a usage example.

Using the algorithm from DeepSpeech:) And in combination with a language model.

DANBER · July 9, 2021, 6:00pm

@solyarisoftware I found some errors in the shared models, they are fixed now but you need to download them again.