Like some may know, I do my tests with snowboy, which works really great but will be turn down end of 2020. Of course it will still works, but we won’t be able to generate new custom wakewords.
I’m looking for an engine with three different custom wakewords. One per family member so I can know who is asking something.
snowboy : free, offline, custom wakeword. But EOL !
porcupine : must redo custom wakeword every 30 days
pocketsphinx : doesn’t seems very reliable
precise : seems working nice. training custom wakeword seems hard though, and still don’t run on pi0 ?
Actually I still run my snips system in production (Pi3 master and Pi0 satellites). The only thing keeping me from running rhasspy in production is akeword engine. Snips one works great with three custom wakewords (snowboy seems even a bit better).
So actually, what is the consensus regarding wakeword engine for rhasspy with several custom wakewords ?
Is there anything new on this side ? Or something to come ?
The KWS system is indeed the last major piece of the puzzle.
There have been some discussions to port the Snips personal wakeword detection system detailed in this post:
I made a Node.js module following these guidelines with a few tweaks and it is working pretty good in my homemade setup (not Rhasspy).
It should not be too complicated to port this to python but for more efficiency it would need C++ or Rust (at least the features extraction and DTW parts). I think @maxbachmann is looking into it but not sure…
Yes I am looking into it, but since I got quite a bit to do at work right now I am not quite sure when I will find the time to implement it. As far as I remember a couple of the third party sources used by it are GPL Licensed and therefor need to be reimplemented
Snips wakeword would be awesome yes.
I saw that ProjectAlice has the dpkg for it, so even if Sonos remove everything it still can be installed and used, with offline generated custom wakewords.
Regarding Porcupine, I did contact them to know what would be a price for more than one custom wakeword and 30days, personnal use. Their answer is just a no-no for this solution
Our sales team has received your inquiry and based on the provided information, has determined that this is not, unfortunately, a good fit for us.
Given our limited resources, we have decided to focus on large enterprise prospects with the significant budgets dedicated to developing innovative voice experiences; Due to the high opportunity cost, we are not able to provide our services towards personal projects, early-stage startups, companies working in the ideation stage, or pure proof-of-concept efforts with no clear path to commercialization in the near term.
I’ve got a start on a Python version of the Snips Personal Wakeword Detector. I’m calling it Rhasspy Raven (Hermes service here).
I recorded myself saying “okay rhasspy” 3 times, trimmed up the audio, and exported the WAV files as 16-bit 16Khz mono. It seems to work OK with a distance threshold of about 38 for me.
It’s a bit CPU hungry right now, so I’m not 100% confident it will run well on a Pi or Pi Zero. I’m not sure I’ve implemented everything correctly either, so there may be a lot of room for improvement. Obviously, a C++ version of the great @maxbachmann will blow it out of the water in terms of speed
If this works for anyone besides me, I can try to incorporate the template recording into the Rhasspy web UI and bundle it with the Docker image. Maybe we can make this like rhasspy-fuzzywuzzy and swap Raven out with an optimized C++ backend at some point in the future.
The most CPU hungry part is the MFCC features extraction.
You can ease the resource consumption by calculating only the new frame MFCC features and not the whole buffer (10x improvement).
You can further improve by reducing the number of DTW calculations by averaging the keyword templates. Average template 1 and template 2. Then average avg 1-2 with template 3 etc. Do not average all the templates in one go (very bad for accuracy).
Also the use of the cosine similarity as the DTW distance calculation function with the probability formula detailed in the blog post helps getting a standardized score/threshold across all templates. Usually between 0.45 (more false positive) and 0.55 (more false negative).
Otherwise (using the Euclidean distance) the length of the templates will add too much variation and will require the user to do a trial and error process to determine the correct value to use for his specific keyword.
I think further improvement can be achieved by offloading the features normalization (as well as extraction) and DTW calculations to a lower level library.
Ok, I just tried this on my Raspberry Pi 3 Model B Rev 1.2 satellite and it’s excellent! Good job
CPU-hungryness is fine: after the initial startup the Python process doesn’t need any more than a couple of percents of the CPU. The arecord process needs more with a continuous use of 11% of the CPU. I haven’t tried it yet on the Raspberry Pi Zero W, this board will probably need the optimizations suggested by @fastjack.
Note that I had to increase the distance threshold from 38 to 47 and I learned that I had to pronounce Rhasspy in a specific way (I listened to the samples and heard that @synesthesiam pronounces it with shorter vowels than I did), but after these two changes wake word detection was excellent and I haven’t found any false positives yet after shouting a handful of other wake words to my Pi.
This was just a short test, but it already works better than Porcupine in my setup, so I’m sure if I record my own samples wake word detection with rhasspy-wake-raven will be more than good enough for production use for me.
@synesthesiam Oh, and can you publish rhasspy-silence 0.3.0 to PyPI? I had to temporarily change the line for this library in requirements.txt to git+git://github.com/rhasspy/rhasspy-silence@v0.3.0#egg=rhasspy-silence (if anyone else wants to try this in the mean time) so it could find this version because it’s not on PyPI yet.
One thing I included in my Node version is that when a keyword is detected the current audio buffer is made available so it can be saved to disk as WAV.
Maybe rhasspy-raven-hermes can send it in a MQTT message like rhasspy/hotword/<siteId>/audioCaptured and save it to disk if a configuration option is set.
The more you use the personal wakeword, the more dataset you generate for an eventual ulterior CNN (like Precise) to be trained for that keyword.
I do not think that Rhasspy will ever run the original Snips hotword detector as it is not open source and is not maintained anymore (if that was what you were saying… not sure ).
What @synesthesiam, @maxbachmann, @koan and I are talking about is recreating a service that will handle the “personal” wakeword detection system that Snowboy and Snips used.
The personal wakeword templates could indeed be directly recorded from Rhasspy GUI (like what you did with SnowboyCustomMaker.
This might lead to some kind of user management (Rhasspy 2.6?)…
So we will need to install and set manually the former snips wakeword ?
That’s why I said project alive have it as a dpkg so if it is removed, we can still install it. We need reliable and fully maintainable stuff to avoid such situation like snowboy EOL.
Of course you can install the snips-hotword package, but it’s not open source, so Rhasspy will not distribute nor promote it.
Our current priority (as it’s really the missing piece now) is having a completely open source wake word detector that performs well with low resource use. An optimized version of Rhasspy Raven can be that missing piece.
This “one day” me be sooner than you think. It seems to me that @synesthesiam’s hands are already itching to add the recording functionality in Rhasspy’s web interface:
Knowing his god-like productivity, this can’t take long
Thank you for the excellent suggestions, @fastjack! Can I use your code on GitHub as a reference for the MIT-licensed raven project?
I think I’m doing this now, if I understand you correctly. For each buffer, I do the sliding window over it and compute MFCC for each (smaller) window. Is that right?
Opps, that’s a good point. If @fastjack allows it, I’ll just port his code to Python for DTW. It’ll be slower for sure, but it will give us a place to start.
Both MFCC and DTW calculations are the slow parts. The template averaging @fastjack mentioned would be a huge boost by itself.
Got it I’ll switch over to use cosine and probabilities.
Once we get Raven working here with properly licensed dependencies, I plan to add it to the Docker image and have the template recording happen in the web UI
So we’d have a fully open source wake word system that can have a custom wakeword recorded from within Rhasspy!
Do you mind if I use some of this code in Rhasspy to do recording/trimming in the web UI?
From what I can understand (not a Python dev…), it seems you extract MFCC features from an entire audio chunk (with an approximate length or the average of the templates). I greatly improved the CPU consuption by only extracting the features from the new window in the audio buffer. The only restriction for this technique to work is that the window size must be a multiple of the shift size (30/10 for example). It’s a bit of a hack but it reduce so much the MFCC calculations that I think it’s worth it.