Mycroft Precise models for sharing?

System: Raspberry Pi4 + Playstation eye + Rhasspy 2.4.40.

Does anybody has any trained models for Mycroft Precise (wake word) and are willing to share it?

I want to try Precise because it seems that pocketsphinx isn’t working and I want to use a pure off-line system with no non-free software (I want to use only Free/Libre/Open Source software).

I don’t mind the chosen wake word, but I’d like to have something to test drive before getting too involved.

1 Like

@ulno and I are in contact with Mycroft to develop a shared repo of sample wake word data so we can produce some good models :slight_smile:

I’d also like to bake in Precise wake word training into Rhasspy in the future, but I need to learn best practices with this first.

4 Likes

Really good news !!
Does Precise works on pi0 ? Even with training/custom model generation on pi4.

Not officially, unfortunately. Someone has supposedly compiled it with Tensorflow Lite, but Mycroft does not support it yet.

I’m working on remote wakeword detection for just this reason.

1 Like

IMHO, “hey mycroft” works currently best with mycroft precise - I also tried out “athena” which works reasonably well but not as good as teh default “hey mycroft”.

?

I guess I would prefer have pi3/4 satellites over continuous network streaming.
Pi4 is also so much faster to start, update, wired network etc, will get a few more to test that.

But make custom wakeword with precise seems to be a really hard way to go :face_with_monocle:

Does Mycroft Precise + Rhasspy 2.5 support multiple custom wakewords ?
I need one per family member :wink:

The goal is to have something like the snowboy website, where users could search for wakewords to contribute to (by language, etc.) as well as download existing models.

The GitHub process is probably a bit much for more non-technical users.

A Rhasspy ‘Common Voice’ or Precise ‘Common Voice’ as likely gender, age and dialect all need profiles.
Can you use completed intent?

@KiboOst yeah I prefer the satelite mode as you can invest in hardware in a singular unit.
Also when it comes to Average Joes and Average Homes it quite likely a single central unit could service multiple satelite rooms.
The diversification of use means that for most parts processed intent is being procesed but wake STT & TTS are often idle.
For an ‘Average Home’ the use clashes could be quite small.

I am waiting for some Pi Soundcards and know the project is called R-Hass-Py and love what the Pi is but when it comes to audio processing and tensorflow there are a lot of tempting alternatives.

I had a look at the steaming and dunno if its my Rhasspy noobness but just looked like a UDP named pipe.
Not that great for WiFi and when you start adding multiple streams it quickly adds up.

But really interesting in playing and expanding on this.
https://roc-project.github.io/
Looks interesting to gain realiability over WiFi even Speex has libs that can help with UDP packet loss.

If satelites can do some form of VAD/AEC and stream above a threshold… I am going Corona stir crazy as some hardware is taking an age to turn up.
Speex looks a good bet as with the codec it also has VAD/AEC even if in seperate libs Speex/SpeexDsp.

Wide array microphones (satelites) can be more accurate than small array microphones and maybe could give a software solution.

The Rhasspy satelite mode is a really good idea and going to play with some ideas on preprocessors and network streaming.

PS can you run Deepspeech with a dedicated language model, with only those keywords?
Kaldi can apparently https://pypi.org/project/kaldi-spotter/

Here in Brazil a Rhaspberry Pi 4 will easily set you back USD 120, an esp32 you can find between 10 and 15 USD, an esp8266 for less than USD 5 (that can even serve its own dedicated wifi wakeword network for 8 clients). So, the option to use something to also stream the wakeword is well appreciated.

Yeah even in the UK for many multiple satelites can get expensive especially in multiple rooms.
A Pi4 is £35 GBP but by the time you have SD, Heatsink, PSU and one critical thing for media playback is a soundcard with AEC/VAD.

The critical thing for media playback is to be able to stop it and without AEC/VAD your voice input will be packed with echo.
I have a ESP32-Audio-Kit on my desk that I have been staring blankly at for some time and now realise that even if I did start dev without AEC/VAD for many purposes it will be no good.
Should of probably got one of these
https://docs.espressif.com/projects/esp-adf/en/latest/get-started/get-started-esp32-lyratd-msc.html

But then again even if its half the cost of what a equivalent PI system can quickly add up to its still far from cheap. ESP are doing much in this arena though.
https://www.espressif.com/en/products/software/esp-skainet/overview

The only known cheap system I know with aec/vad is the RK3308 and Radxa provide one.


The build in audio codec with VAD with the 512mb with Wifi/BT $13.99 is the cheapest I know.
Also more flexible being able to run Debian.

Allwinner also do the R328 and a R329 with NPU accelerator built in but the R328 if it was available prob would be the same price or less. As the chips are extremely low-cost.

I feel Radxa have fudged the release of the R3308 and tried to make it general purpose and really it has one.
There is a Mic hat in the pipeline but who knows what that will consist of.

There is some really cheap silicon dedicated to purely being a voice capture edge device as been wondering if the RK2108 will be even cheaper and maybe also avail by some.
But globally cost is a big factor and could lead to AI exclusion.

The Chinese have complete units available for domestic market for less than $10 that contain R328 solutions.
https://detail.tmall.com/item.htm?id=569134471494&skuId=4307095677709

On another note with wakeword I read an really interesting article recently about using keras.
https://blog.aspiresys.pl/technology/building-jarvis-nlp-hot-word-detection/

It does something really interesting and creates spectrograph images for recognition that are better suited for convolutional networks that garner big improvement gains over DNN.
I don’t know for sure but have a hunch that prob would work with the Google Edge device accelarator from Coral that apparently isn’t much good for Deepspeech but these are essentially audio images.
The python is here.

You might not run wake word locally but purely have AEC/VAD streamers and use a central wakeword voice even profile biometrics centrally.

Or cheaper alternatives such as?


!Doesn’t deliver to the UK but the tools look like they should work and the price aint that bad

I’ve finally found out how to download it and install the default “hey microft” profile. But I’ve found the results…disappointing.

Maybe it was the expectation that it would just work wonders. Maybe it was because I’m not a native English speaker.

Anyway, I still struggle a lot, around1 out of 8 tries, more or less, to get microft’s attention.

I’m using the playstation eye with an adjusted asound configuration file.

On the other front, I’m still trying to install Mozilla’s TTS system on the Raspberry Pi4. Yes, I know it is slow (around a few seconds to generate a phrase on a desktop computer). But the sound quality is awesome when compared to espeak, flite, marytts, etc. My idea here is to somehow cache the generated spoken words for reuse.

The Ps3eye is a funny device to recommend as the hardware driver problems can make it hard to configure optimally

What does arecord -r 16000 -f S16_LE test.wav sound like when you playback with aplay test.wav
On the Pi4 alsamixer doesn’t display the ps3eye volume and the AGC I like of speex doesn’t get compiled as the installed version of SpeexDsp is to old for Alsa=plugins.

As a sample of my /etc/asound.conf and still very low volume on recording

 type hw
 card 1
}

pcm.cap {
 type plug
 slave {
   pcm "array"
   channels 4
   }
 route_policy sum
}

pcm.!default {
    type asym
    playback.pcm "plughw:CARD=ALSA,DEV=0"
    capture.pcm  "cap"
}

To be honest even if one is on my desk not at all keen on the PS3eye.
Pulseaudio with the PS3eye makes a much better job of things but can be a pain with docker.
Dunno why its recommended as the respeaker 2mic is a couple of $ more has much better audio out than the Pi and a few Leds and what not.
The ebay clones are really cheap

But I have an English voice and on my Pi with a standard asound.conf unless I hold the mic really close or practically shout it isn’t good.
Mine is on my desk but prob destined for the box of bits.

Maybe this can be used to run a wake word CNN on a Pi 0 (or similar)

I am really not sure on what the state was with Snips as Snips was this great opensource project with a big community and loads of skills that got sold to Sonos.
Its prob great but its unclear if they or us know if it will remain opensource with active development.

Its why I posted https://blog.aspiresys.pl/technology/building-jarvis-nlp-hot-word-detection/ as the examples are .net concentric, but generic libs and methods not named product.
Its from a Google paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43969.pdf and because it converts audio into Mel Filter (MFCC) Spectrograms it actually doing image detection.

That was an Ooooh! Thats interesting as whats available seems to be heavily weighted towards image detection and that might open up some avenues of generic use.
Keras is just .net on Tensorflow 2.0 I think, but its what it does, what it can use, its generic and not an owned product that might be for sale.
Doesn’t have to be Keras but its Tensorflow 2.0.
Rhasspy seems to be really good as an excellent working platform but also a brilliant indepth working tutorial of many of the voice technologies available as it says in the article.

…these are shipped as black boxes that cannot teach us a lot.

I don’t really have much knowledge about implementing NLP but did wonder if Rhasspy should include a generic homebrew alternative to some of the ‘Black box’ products?
All I know is that many of the available SoC NPU units are not capable of running the tensorflow of Deepspeech and its a GPU or Software choice.
I am making an assumption but think I might be on the right track is because of using Spectrograms basically images and a CNN network it will allow the use of far more commonly available SoC NPU/AI accelerators.

How good it is I just don’t know, but reason why I posted and why I am replying with this again is that it is generic and the oppisite of certain products that are ‘Black boxes’ or potentially could be and maybe could teach a lot and its aimed for performance and low end hardware that we use.
“Convolutional Neural Networks for Small-footprint Keyword Spotting” who knows but seems to be Rhasspy direction.

I was wondering if…

The best performing system cnn-one-fstride4 gives a 27% relative improvement in clean and 29% relative improvement in noisy over the DNN at the operating point of 1 FA/hr.

The only benchmark I know that often quotes that 1 FA/hr is Porcupine so is that saying this method is almost a 30% improvement over Porcupine?
[EDIT] prob not as after looking its 1FA/10 hr that Porcupine publishes.

…model did 94.44% in its best run

Thats pretty damn high isn’t it just for a generic model or black box?

1 Like

Nice to hear that it works for you - it’s a pity it works so badly. I think if you train it yourself the results should be better.

Did you install precise in a Rhasspy docker environment?
I haven’t been able to get Precise for Rhasspy to work. Can you share your method of solution?

Just install precise from the web ui. Then you must follow those links for community pre-trained models and download both files. Then put them in the indicated folder and voilà, it should work.

The project called “ProjectAlice” is using some snips open source components with their voice assistant implementation. For English, they perform really well, even for me! They also have a repository of skills that you can install from the web ui.

But there are many things still bothering me about all those attempts…Maybe we should be contributing more to each other.

The default TTS system for Portuguese just sucks. Nobody not from technical fields can understand a word that it is being spoken. So I started to do some research. Mozilla’s TTS has a very good voice (it is based mainly on tocotron2 + a choice of vocoders). The problem is that it takes too long (up to 10s on my desktop computer) for inference (to generate the sound).

There is FastSpeech, from Microsoft Research, with at least two free software independent implementations on github. Is sounds almost as good as tocotron2, but it is supposed to be 20x faster. Maybe it would be a good choice to run on a Raspberry Pi. I’m still trying to build it.

Really? I’ve bought the psp3 eye because of that Snips review (linked to from some post on this forum). There seems to be a way of increasing its gain. I’ll take a look on the configuration from ProjectAlice, because they have it out of the box.

I thought that Respeakerv2.0 strength was in its AEC implementation, which would only work when using its output channel limited to 16Khz.