Vosk
I’m pretty enthusiast because latencies are very slow ( ~50 msecs - ~500 msecs on my PC). See some tests on my project voskjs, a simple Vosk-api nodejs wrapper. Language coverage is similar to the DeepSpeech/ (now Coqui STT), but latency are from twice as fast, to an order of magnitude faster (e.g. using small models with grammars, you obtain few tents of msecs for few words sentences)!
If confirmed, with minor update on voskjshttp, RHASSPY user could use it to test Vosk ASR.
Coqui STT
as you know ids a recent DeepSpeech fork, made by DeepSpeech core team developers, after Th Mozilla “suspension” of DeepSpeech (no polemics intended now). I made CoquiSTTjs, a simple/draft nodejs wrapper.
BTW, Coqui STT, as DeepSpeech, is, as pretty all ASRs, cpu-consuming and unfortunately works on a single thread, see: https://github.com/coqui-ai/STT/discussions/1870. I’ll try to implement a multi-thread/multi-process server architecture as part of my CoquiSTTjs project. If I success I’d try to implement a “Remote HTTP Server” interface.
Hi giorgio! Sorry for the delay in my response, as always
The “Remote HTTP Server” option should work just fine (it looks like you got it working). I’m also working on integrating Vosk directly into Rhasspy as a Hermes MQTT service, but there will be a limited set of supported models and options. So having your server as an option is great
Coqui STT (and TTS) are definitely things I’d like to support. I would consider these as replacing as DeepSpeech and MozillaTTS down the road.
NVIDIA has also released v1 of their NeMo framework, which looks interesting. Their QuartzNet ASR model is what’s under the hood in @DANBER’s Scribosermo, which I’d also like to use in Rhasspy and train models for. The advantage I see of Scribosermo here is that Daniel has an excellent training infrastructure and can export TFLite models (including quantized versions).
Yes, the “Remote HTTP Server” option, following the “attribute-less” interface you proposed, delegates to the server all features as:
the natural language (“en-US”, “it-IT”, etc.)
the specific ASR-dependent -> model
etc.
Sever decides all transcript options. This has pros and cons.
I’ll try to develop and share a simple/demo “Remote HTTP Server” for Coqui STT. The problem here is that the Coqui STT(=DeepSpeech) decoder architecture is single-threaded (at least on the CPU version). So I have to set-up a multi-process server. I’ll share result here.
Yes, that’s interesting! BTW, currently, as usual, NVIDIA supplied English language model first, and very few more languages (no Italian) so my enthusiasm is now suspended
If you have an Nvidia-Gpu, you could train an Italian model yourself, if you are interested. Scribosermo already has some tools for Italian datasets (I did train Italian deepspeech before, with ~300h datasets) and training is quite fast (about 3 days on 2x1080Ti for 700h German). I would estimate you can reach 10-15% WER with it.
Hi Daniel, 11.5 WER great also because the low number of hours available in Italian language (if I well understand)! Thanks for the update. I’ll deepen/test asap.
Just a question: it the Italian alast .pbmm file this one: saved_model ?
BTW, I’m curious about how you calculated the WER, I’ll read your doc asap.
Not quite, you can use the linked file (with the other files in the parent directory) with full tensorflow runtime, but I’d recommend to use the .tflite files.
The models are not compatible with DeepSpeech, see extras/exporting directory of repo for a usage example.
Using the algorithm from DeepSpeech:) And in combination with a language model.