New project from CMUSphinx (aka vosk)

Hi !
I’ve played a little with this new api :
Looks promissing.
I’d like to test it with Rhasspy but i have no clue how to process.
Looking at the docs for hermes protocol and rhasspy source gets me only more confusing.

Any advice from you fellow devs ?

Vosk seems to use Kaldi under the hood and the acoustic models provided are the same as Rhasspy Kaldi service.

I do not think you’ll get any difference between Vosk and Rhasspy’s Kaldi implementation regarding accuracy.

@Aymux Regarding ASR, the Hermes protocol is pretty simple : start a decoding session when you receive hermes/asr/startListening, subscribe to hermes/audioServer/<siteId>/audioFrame to get the audio chunks to push into Vosk, when Vosk detects endpoint, send hermes/asr/textCaptured. Never really understood the hermes/asr/stopListening topic though.

It may be a good idea to substitute Vosk to the bash Kaldi scripts for better integration (the documentation is not very clear though).

@synesthesiam What do you think ?

The speaker recognition is based on Kaldi xvectors. I’m wondering about the accuracy of the prediction… This is interesting :face_with_monocle:

This seems to be for timeout purposes from the dialogue manager. I treat it as if the STT system has detected an endpoint.

The relevant messages for speech recognition are here. It can get complicated because Rhasspy’s STT services also generate language models during training.

An easier option is to use the Command STT system and wrap your program in a script that takes in a WAV file (stdin) and outputs a transcription (stdout). The rest will be handled automatically by rhasspy-remote-https-hermes.