Hey everyone, I’m a new poster to this forum, but have been following the progress of Rhasspy and similar open-source digital assistant frameworks for a while. One area that I’ve always found to be quite challenging is a good wake word/wake phrase framework and pre-trained models. I’ve seen a number of discussions in this forum about different options for this functionality (e.g., Picovoice Porcupine, Mycroft precise, custom models, etc.), and wanted to share some work that I have been doing in this area.
I just released the initial version of the library: openWakeWord.
You can also try a real-time demo right in your browser via HuggingFace Spaces.
By leveraging an impressive pre-trained model from Google (more details in the openWakeWord repo) and some of the text-to-speech advances from the last two years, I’ve been able to train models with 100% synthetic audio and still show good performance on real-world examples. For example, here is the false-accept/false-reject curves for Picovoice Porcupine and openWakeWord models based on the “alexa” wakeword and the test audio clips (though modified to be more challenging) from Picovoice’s wake-word-benchmark dataset.
I’m finding the openWakeWord models to work quite well in my testing, and the ability to create models for more complex phrases (e.g., the “timer” model) opens up some interesting options for end-to-end spoken language understanding without requiring repetitive activation with a base wake word.
If anyone finds this interesting or useful I would greatly appreciate feedback on how well the models work for different voices and environments, as well as general suggestions for new features and improvements.
What is the performance of this model running on a raspberry pi?
How does it handle streaming and attention in the spectrum with long gated words? I built 2 prototype over the last month with Tensorflow and eventually landed on a model from google research. You can see this thread here: Suggestions for Dutch wake word detection for newbie - #32 by shellcode
@shellcode, performance is reasonable on Raspberry Pi 3, using about ~70% of a single core to run the 4 pre-trained models currently available. There is a script that will estimate how many models would fit on a given system and number of CPUs. Using this script, a Raspberry Pi 3 could run ~15-20 models on a single core in real-time.
However, I haven’t quantized the models yet (that’s planned for a future release), so efficiency of the models will hopefully get better.
And yes, that is a great thread! The streaming models from Google Research look very good, and are extremely efficient. However, I didn’t end up using that framework as from my testing the pre-trained model from Google that openWakeWord is based on is needed to obtain good performance when training on only synthetic data. But this is something I’d like to explore more.
As for streaming, openWakeWord uses a fairly simplistic approach. Basically, the melspectrograms and audio features are computed in streaming mode (that is, each 80 ms frame at a time), but then the trained models predict on a fixed time window that varies depending on the model. For example, the “alexa” model looks at a window of the last ~1.3 seconds when making a prediction.
As for “long gated words”, I assume you mean words/phrases that are separated in time in the audio stream? If so, openWakeWord handles that by simply increasing the width of the time window. For example, the “timer” model uses a window size of about ~2.75 seconds. When paired with the right type of classification model (e.g., a GRU/LSTM or self-attention layers), you should be able to use even larger temporal context windows as the underlying features from the embedding model are fairly robust. In fact, the numbers reported in the Readme for the Fluent Speech Commands dataset are from a classification model with LSTM layers as that seemed to perform the best with larger time windows.
Think of long gated words using something from your example as “Aaalllleeeexxxa” the model I am using from google research successfully detects the attention of “Alexa” in a 1 second window. Because it’s a streaming attention model it can find the attention no matter where it happens in the frame. Where if it’s just a simple mfcc(+whatever else) then is probably equivalent to the already shipping Raven built in model.
The models are trained to predict when the wake word when is near the end of the temporal window, but there is random variation included so that model is not too sensitive to placement. And as the predictions are made every frame in practice it behaves similarly to a streaming model.
While the model isn’t trained on phoneme targets, due to the pre-training and then fine-tuning on synthetic speech the model learns end-to-end to find the right combination of phonemes in the window, regardless of exactly where and at what rate they are spoken. And while the input is a full melspectogram, ultimately it’s the learned features from the embedding model that enable the performance. I haven’t done a comparison to Raven (that would be good to add), but given that it is based on simply dynamic time warping I suspect openWakeWord models will be significantly better.
From the thread you linked before, I know that you and @rolyan_trauts have been working in this area as well. It would be great to do some comparisons between our approaches, I’m sure there is a lot we could learn and share.