openWakeWord - New library and pre-trained models for wakeword and phrase detection

Hey everyone, I’m a new poster to this forum, but have been following the progress of Rhasspy and similar open-source digital assistant frameworks for a while. One area that I’ve always found to be quite challenging is a good wake word/wake phrase framework and pre-trained models. I’ve seen a number of discussions in this forum about different options for this functionality (e.g., Picovoice Porcupine, Mycroft precise, custom models, etc.), and wanted to share some work that I have been doing in this area.

I just released the initial version of the library: openWakeWord.

You can also try a real-time demo right in your browser via HuggingFace Spaces.

By leveraging an impressive pre-trained model from Google (more details in the openWakeWord repo) and some of the text-to-speech advances from the last two years, I’ve been able to train models with 100% synthetic audio and still show good performance on real-world examples. For example, here is the false-accept/false-reject curves for Picovoice Porcupine and openWakeWord models based on the “alexa” wakeword and the test audio clips (though modified to be more challenging) from Picovoice’s wake-word-benchmark dataset.

I’m finding the openWakeWord models to work quite well in my testing, and the ability to create models for more complex phrases (e.g., the “timer” model) opens up some interesting options for end-to-end spoken language understanding without requiring repetitive activation with a base wake word.

If anyone finds this interesting or useful I would greatly appreciate feedback on how well the models work for different voices and environments, as well as general suggestions for new features and improvements.

Thanks!

4 Likes

What is the performance of this model running on a raspberry pi?

How does it handle streaming and attention in the spectrum with long gated words? I built 2 prototype over the last month with Tensorflow and eventually landed on a model from google research. You can see this thread here: Suggestions for Dutch wake word detection for newbie - #32 by shellcode

@shellcode, performance is reasonable on Raspberry Pi 3, using about ~70% of a single core to run the 4 pre-trained models currently available. There is a script that will estimate how many models would fit on a given system and number of CPUs. Using this script, a Raspberry Pi 3 could run ~15-20 models on a single core in real-time.

However, I haven’t quantized the models yet (that’s planned for a future release), so efficiency of the models will hopefully get better.

And yes, that is a great thread! The streaming models from Google Research look very good, and are extremely efficient. However, I didn’t end up using that framework as from my testing the pre-trained model from Google that openWakeWord is based on is needed to obtain good performance when training on only synthetic data. But this is something I’d like to explore more.

As for streaming, openWakeWord uses a fairly simplistic approach. Basically, the melspectrograms and audio features are computed in streaming mode (that is, each 80 ms frame at a time), but then the trained models predict on a fixed time window that varies depending on the model. For example, the “alexa” model looks at a window of the last ~1.3 seconds when making a prediction.

As for “long gated words”, I assume you mean words/phrases that are separated in time in the audio stream? If so, openWakeWord handles that by simply increasing the width of the time window. For example, the “timer” model uses a window size of about ~2.75 seconds. When paired with the right type of classification model (e.g., a GRU/LSTM or self-attention layers), you should be able to use even larger temporal context windows as the underlying features from the embedding model are fairly robust. In fact, the numbers reported in the Readme for the Fluent Speech Commands dataset are from a classification model with LSTM layers as that seemed to perform the best with larger time windows.

Think of long gated words using something from your example as “Aaalllleeeexxxa” the model I am using from google research successfully detects the attention of “Alexa” in a 1 second window. Because it’s a streaming attention model it can find the attention no matter where it happens in the frame. Where if it’s just a simple mfcc(+whatever else) then is probably equivalent to the already shipping Raven built in model.

The models are trained to predict when the wake word when is near the end of the temporal window, but there is random variation included so that model is not too sensitive to placement. And as the predictions are made every frame in practice it behaves similarly to a streaming model.

While the model isn’t trained on phoneme targets, due to the pre-training and then fine-tuning on synthetic speech the model learns end-to-end to find the right combination of phonemes in the window, regardless of exactly where and at what rate they are spoken. And while the input is a full melspectogram, ultimately it’s the learned features from the embedding model that enable the performance. I haven’t done a comparison to Raven (that would be good to add), but given that it is based on simply dynamic time warping I suspect openWakeWord models will be significantly better.

From the thread you linked before, I know that you and @rolyan_trauts have been working in this area as well. It would be great to do some comparisons between our approaches, I’m sure there is a lot we could learn and share.

@dscripka I’m starting work on the wake word portion of the Year of Voice, and would be very interested to talk about your wake word models!

To start, I think I could train some higher quality TTS models for Larynx 2 to serve as input – especially for non-English voices. Larynx 2 (which is going to be renamed at some point) is similar under the hood to Mimic 3, but is much easier to train. Starting from an existing checkpoint, I can get decent audio for a new voice in an hour or two of training, even in a different language.

Nabu Casa is looking into the ESP32-S3 for satellite hardware, and it’s still an open question what kind of wake word models we could run on it. Espressif has their own models, but they’re not open source. An A.I. accelerator of some kind would be nice too, but everything we’ve come across so far has closed tooling, so we’re not interested.

I’d like to see if openWakeWord could be ported to the ESP32-S3, specifically the models with included PSRAM (at least 4MB). The big questions are:

  1. Can the esp-dl quantization toolkit be used to convert the mel spec, speech embedding, and wake word ONNX models as-is?
  2. Can the Silero VAD model be converted or ported?
  3. Can the full stack models be run fast enough for low latency detection?
  4. Can the logistic speaker verifier models be used too (I’m guessing yes)?

I’m not worried about noise suppression, since our plan is to use esp-adf to do pre-processing of all audio from the two mics.

Let me know your thoughts, and thank you for creating openWakeWord :slightly_smiling_face:

You need to look at GitHub - espressif/esp-dl: Espressif deep-learning library for AIoT applications but the biggest criteria is likely Tensilica have licensed recurrent layers as part of thier LX7 offering and maybe why they are missing or limited in esp-dl. So any KWS without Gru or LSTM which prob means a simple CNN or DS-CNN that provides more accuracy, but they are the only ‘common’ kws type models apart from SVD that don’t use recurrent layers.

You can not just use a voice of higher quality it has nothing to do with quality as if you feed with a singular voice the model will become good for that voice alone (overfitted) and useless for others as it will reject them.
Even with a GitHub - neonbjb/tortoise-tts: A multi-voice TTS system trained with an emphasis on quality which several multi-voice TTS do exist your still likely using them to supplement what you can get from datasets and augment to stop overfitting to specific patterns occuring in your data other than the keyword spectrogram. There are a few multi-voice TTS that have many voices that likely could supplement a dataset much.

You do have a choice of using the KWS that Espressif have included its not that great though as wow they have seriously quantised that model down hard and its a pay4 service for any different wakeword.
You don’t have to have a KWS at all as from broadcast always, to VAD triggered and even the ESP32 ASR for dynamic KWS could be used as a trigger where upstream a secondary check is included to increase accuracy.
The secondary check if you have the process is likely always a good idea as is utilising the esp32 OTA and registering an upstream OTA server that creates a model on the fly that is overfit to a user enrolement session and continues to capture KW and training whilst idle and get better through use.

Still not much is known about the Espressif BSS as usually a nMic BSS will split into nMic streams and which stream has the better voice content is random and maybe its inseparable from the KWS.

I guess I have a partial answer while trying to convert the mel spec model:

Constant is not supported on esp-dl yet
Unsqueeze is not supported on esp-dl yet
Pow is not supported on esp-dl yet
MatMul is not supported on esp-dl yet
Clip is not supported on esp-dl yet
Log is not supported on esp-dl yet
Div is not supported on esp-dl yet
ReduceMax is not supported on esp-dl yet
Cast is not supported on esp-dl yet

and for the speech embedding model:

BatchNormalization is not supported on esp-dl yet

Damn, even silero won’t go:

Equal is not supported on esp-dl yet
If is not supported on esp-dl yet

@rolyan_trauts Using a VAD-only broadcast model with OPUS wideband compression might be a good alternative. Espressif does have a VAD model available, though like everything else it’s not open source :frowning:

@dscripka mentioned that this takes ~70% of a core on an Rpi3 for 4 concurrent models. Maybe something like a Le Potato with Speex noise suppression and one of the better mics could be an alternative?

GitHub - fengfeng0328/esp32_speech-vad-demo: vad algorithm based on esp32 for mute detection ? Again via an enrollment OTA ship a deliberate overfitted VAD model.
AML-S905X-CC is not that much of a step up somewhere between a pi3/pi4 if I remember rightly, nearing the Pi4 more than Pi3.

I think quite a few of those are not supported in Tflite, but you need to create a model on what is supported not discount on a model you try is not supported.

Hmmm, looks like they wrapped webrtcvad. Should work pretty well.

Right, and you can actually buy them for $35 USD here on Amazon. If a Pi 3 level of performance is all we need, maybe it’s worth it to do that instead of trying to squeeze everything into an ESP32. This is assuming a device that needs to run a wake word locally, not a VAD + broadcast.

You sure as never remember Potatoe boards being that cheap, but could be the UK as here its

£57.70

Cheapest slightly faster Pi3 I have found is the

https://www.aliexpress.us/item/3256804171701489.html

You have no Hat so your talking a Plugable USB as only stereo USB at Hat prices I know.

There is an interesting but less Pi3 powerful SOC with onboard 8 channel ADC and stereo DAC, but still waiting for release.
Radxa tried with the RockPiS but something didn’t go well with the ADC as its noisy as hell as if it has some sort of Gnd loop going on.

1 Like

What mic would you plug into the audio adapter these days?

Depends as most are single Mic with Tip=signal, Ring=bias, Sleeve=gnd but they expect a broadcast close field input so you really need a preamp to boost and extend into near & far field. (Think how you expect to have the sort of standard PC mic)

Plugable is a stereo ADC Tip=Left, Ring=Right (Or the other way round, but 2x signals no bias) , Sleeve=Gnd and think its expecting near line in and you have to have a preamp board, MAX9814 that also adds hardware AGC, its a product Adafruit stock but same module is cloned everywhere with a Mic onboard.

1 Like

Thanks for the comment, @synesthesiam!

I think Larynx 2 could certainly be used to generate synthetic training data, but @rolyan_trauts is correct, you need a large number of synthetic speakers to create enough diversity in the generated data. In practice, I use two multi-speaker TTS models (more details here) that are sampling based so I get natural variation with each generation. One is trained on LibriTTS so it is >2000 voices, and the other is trained on VCTK with ~110 voices. Plus, I then mix voices together in the latent space to get “new” voices for more diversity. That will likely be the biggest challenge for non-english models. However, if it’s possible to use a dataset like CommonVoice to train Larynx 2 models (e.g., like [2210.06370] Can we use Common Voice to train a Multi-Speaker TTS system?), that could be interesting.

Inference latency is another important question. As you’ve already seen, there are some more unique ops in these models that probably mean they’d be hard to get running on the ESP32-S3. Plus, from this page it seems like a model of roughly the same size as openWakeWord takes ~700ms on the S3. You’d have to run the openWakeWords models at least every ~200 ms to get good detection performance, so it seems like it may be a bit too big for the S3.

The Libre Computer models could definitely be a good choice, and in general anything close to a RPi 3 single-core performance should be enough (especially with a bit of optimization). If only it was possible to actually buy RPi Zero 2W’s those would be perfect. A regular SBC does mean getting a decent microphone, as you discuss. I was lucky enough to get a few Acusis S (Acusis S — Antimatter Research, Inc.) boards before they stopped making them and of course those work very well, but I’ve also tried to make the openWakeWord models robust to noise/reverb and when combined with basic speex noise suppression (and speex AEC, if needed), I’ve noticed that performance if often acceptable.

1 Like

The datasets have always been a pain, especially with the bias for English-Based datasets, but for a while now I have wondered even if the standard wakeword concept is the way to go.
I have a hunch Google strayed from what we see as multi-voice wake-word capture to something similar but simpler.
I think at least there newer units just have VoiceFilter-Lite running that is this hybrid Personal Vad + Blind Source Seperation model done via the brief enrollment process.
KWS might even be ommited as KW might even be processed by ASR but with or without the combination of 2 lesser, seperate processes creates a more accurate end result than multi voice KW detection.

Its the same as the Acusis S may of worked well in low noise or when fed a reference signal, but in less than 2:1 SNR of 3rd party noise, generally results in fail with standard model KWS.
In the domestic context of other media playing has a high occurance and maybe the niche of where the Acusis S worked well is too small.

The Hand Gesture Recognition on ESP32-S3 with ESP-Deep Learning | The ESP Journal is nearly a million parameters and that is pretty large for a KWS.
Params is far from exact but often you get a yard stick and the 870k params is much larger than say than a bc_resnet_2 with 30k params and supposed accuracy of 97.6%
I would have to train a CNN or DS_CNN again (and check) as definately larger than a bc_resnet_2 or go on a vague memory of approx 200k params in a none-streaming model.

Whatever is used maybe some lateral thought is required than the assumption an exact KW is going to be used on a ‘satelite’ whilst merely its the 1st stage to purely initate a broadcast and select a stream from a distributed array for a 2nd factor authentication upstream and central on more capable hardware.

The product of the enrollment VAD with KWS or ASR result may well be far more accurate and lighter weight than a single factor check downstream, whilst requiring simpler far more cost effective ears that push much load upstream.

There is some benchmark details for the esp32-s3.
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/benchmark/README.html

I am still as confused as ever as yeah the best SNR of the 2 post BSS streams uses MISO (Multiple In Single Out), but also if you look the wakenet does have a channel reference as if it is running on multiple channels, so confused about MISO.
There is not a whole lot of info there to how MISO differentiates between signal and noise and it is there for when wakenet is not enabled as you can see the 16bit and 2/3 channel 8bit wakenet KWS, so presuming MISO aint that great otherwise the KWS would be single channel.
I used the ESP32-S3-Box and the KWS was OK but the 16bit version might be better.

@synesthesiam I have always been bemused :slight_smile: that irrespective of what KWS we never drummed up support for a ‘Hey Rhasspy’ dataset.

1 Like

Another idea might be to use the “voice conversion” feature of VITS. You provide a mel spec + src and dest speaker ids/embeddings, and it will convert the audio between speakers without passing through the text/phoneme stages. This may let us leverage English voices for non-English wake words, perhaps by:

  1. Enrolling a user (getting a speaker embedding for them relative to a large multi-speaker model)
  2. Collecting a small number of non-English examples
  3. Converting those examples into mel specs, and
  4. Passing them through voice conversion to get the non-English examples spoken with different voices

Here is an example using the VCTK voice from Mimic 3: voice_conversion - Google Drive
You can hear the results when converting directly between speaker 0 and 10 in the two directories. This is the “best” case, since the two embeddings were trained in the model.

More interesting is the de_sample.wav and en_sample*.wav. I took a TTS sample generated from Mimic 3’s German voice (from @thorstenMueller), picked a male speaker with a somewhat deep voice (speaker 7), and had it convert the audio to several speakers from VCTK. This is the “worst” case, since I didn’t enroll Thorsten’s voice into the VCTK model and get a proper speaker embedding. But still, hopefully you can see the potential :slight_smile:

I’m going to order an Orange Pi Zero2 and see if I can get everything running well on there. It seems like it would be much less effort than to try and port everything to the ESP32.

One of the better mics I’ve tried has actually been the SJ201 from Mycroft’s Mark II. I’m trying to get them to sell the boards individually, because they have an XMOS chip (a later model than the Acusis S did) and two good mics.

Maybe if the synthetic data approach works, we won’t have to :laughing:
But yeah, it would still be nice to have had that dataset!

1 Like

The stock situation with the Rpi02 due to my stupidity and Farnel strangely sending me out emails when they do expect stock is looking very much like it will be Feb24.
The Opi02 works fine with the headless server versions as some of the examples I have been doing of late are not on a Rpi but Opi02.

Its a shame nobody has a skill set other than Python as the ESP32-S3 is the only platform with a free and ready software offering.

Also BSS bridges the gap of 3rd party noise that Mycroft $399 asistants are not even capable of and prob why the acusis-s isn’t sold any more as for function it just wasn’t cost effective.
Its likely one of the comercial units such as @C64ever has tested such as the Anker as Hats are really problematic for cost and also the lack of Pi stock to sit them on.
I have been shouting for a while about how cost effective a ESP32-S3 would be and its a perfect device for a Hass model and likely I could but with my MS and fatigue its a big undertaking but its the support as if I switch off from a project each time I return I have to almost start from scratch again and it infuriates me.

Its a bit like beamforming and AEC with a smart assistant as generally its overrated due to it having gaps in certain scenario’s and it being costly in training, load and implementation.
Even if you do get the synthetic data approach working its likely there are solutions that are easier in training, load and implementation that will provide better results and certainly far less cost.
Maybe one of the home-assistant.io / EspHome / Tasmoto crowd will rise to the challenge one time but it also needs upstream to provide how websockets will be implemented.

1 Like

Those are impressive examples! Do you think it would be possible to train a Larynx 2 VITS model on LibriTTS? The speaker set there is relatively diverse (at least for English), and if the conversion to other languages is similarly decent that’s worth of some experiments.

The SJ201 boards from Mycroft would be pretty great as well, I assumed that since Mycroft is winding down they wouldn’t have stock to sell those.

I’m happy to train a “Hey Rhasspy” model for the next openWakeWord release; if that’s one that people use a lot it would help provide some more evidence to whether the synthetic-only method is viable.

You are right, that’s relatively large for a KWS model. Openwakeword has about ~400k parameters, so maybe openWakeWord would be around ~500 ms on an ESP32-S3, which is still a bit slow. You can definitely get smaller models to work well assuming you have real data, but since in practice getting data is often not possible I’ve found that you have to trade-off model size to get things to work well with synthetic voices. Similarly, since array mics, beamforming and AEC are often really hard to manage via low-cost hardware and open-source software (and sadly, often beyond my abilities), I’m trying to see well the models perform without those features, and that’s another area where the bigger models help. So far I get acceptable performance with openWakeWord models at even <5 db SNR for non-cancelled background noise and “normal” USB mics, which might be good enough for many deployments.

There are still some ideas I have to continue improving performance as well, so I’ll take the ML approach as far as I can. BSS is another great option I’d love to explore more in the future.

Its sort of strange but the 30k bc_resnet_2 model has gone strangely round circle as Esspressif are using https://arxiv.org/pdf/1811.07684.pdf aka Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy,Mathieu Poumeyrol, Thibaut, Lavril, Snips, Paris, France according to the Esspressif documentation.
There is a working framework in the kws_streaming repo from Google Research.

Esspressif seem to have a channel(s) input vector which is near what I was planning myself as was expecting 2x instances, but it does seem to look like they use a single model with a channel dimention, which I didn’t think about.

If you don’t make the assumption you need a huge dataset of all voice-types for the KW of choice so that any utterance of that KW will trigger, or that you need a KWS at all downstream.
Via enrollment you can create what I have previously called a bad VAD on a small dataset of captured voice and I called it ‘Bad’ as its overfitted to enrollment voice, so only accepts the enrollment voice, but in this case isn’t ‘Bad’ as provides one way for Personalised VAD that combined with an Upstream KWS overall provides more accuracy.
VAD with shorter frame / window has less parameters and is considerabilly lighter, but the same principle goes with a ‘Bad KWS’ that is enrollment based that has a double factor check with upstream models.
There are many ways to attack this and lighter weight schemes in conjunction with upstream methods can provide some of the cutting edge Targetted voice extraction that the likes of Google do accomplished by lateral thought more than Google science :slight_smile:

Its not low-cost hardware or opensource its non Rtos / non DSP applications SoC’s such as the Pi where schedulers make it near impossible to guarantee exact timings. Low-cost micro-controllers because of being an RTOS don’t suffer the same way as the state cycle should always be the same.
When Esspressif added vector instructions with the ESP32-S3 it has elevated a low cost microcontroller very much into this realm and why Essp[ressif have frameworks, where the previous minus vector instructions struggled.

Google & Amazon run on lowcost micro-controllers likely some form of Arm-M style because they have created and researched thier own DSP libs as have Xmos and they are no different or better its just your paying for closed source software embedded in silicon at a premium.
Even then often implementations lack correaltion as KWS/VAD should be synced to direction, but most of the time its the Pi that is the problem not the Algs themselves.

Google likely doesn’t even employ beamforming at all and from testing the Nest products are slightly better in noise than the Gen4 Alexa’s complex 6mic, that I presume Google have concrete patents for and deliberately force Amazons more expensive hardware platform.

Speex AEC and this is what I find so strange, is that I seem to be the only one in the community to tackle low level DSP and have a working beamformer and know the technology well and why I don’t really think its the way forward, but it can definately be used.
But GitHub - voice-engine/ec: Echo Canceller, part of Voice Engine project did a great job on AEC and @synesthesiam has created a fork for me as I can push a few changes to add a few more cli parameters, to make config for environment a tad easier.
For beamforming have a look at GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer as here my name is in reverse as gmail had already been taken.

It writes the current TDOA to /tmp/ds-out watch -n 0.1 cat /tmp/ds-out to monitor but likely should store a buffer of TDOA so on KW hit you can fix the beam by writing the avg TDOA to /tmp/ds-in and delete the file to clear.

Generally when going over 33% SNR unfiltered recognition starts to degrade (The MFCC is less clear) and you can get higher levels of SNR but false positives/negatives start to steeply climb, but there are some great filters that if the datasets are mixed with noise and then filtered before training, the filter signature will be recognised (Build the filter into the model).
This is something that has always been missing as we do have that opensource with DTLN and the awesome RTXvoice like Deepfilter.net but we have to train in the filter into the dataset, but if you do especially with Deepfilter.net you can achieve crazy levels of SNR its just a shame its LadSpa plugin is single thread only, but a base with an RK3588 or above as said is pretty amazing DTLN less so but does have much less load and far more Pi4 friendly.

We have always had many solutions, AEC & Beamforming but as a community its lacked ability, resources and will and to generally leave this open to any filter and bring your own DSP & Mic, but unfortunately for a voice assistant input DSP audio of a closed circuit of dictate, as a solution is actually far superior and could exist, but seems unable to co-ordinate. That closed loop of cheap hardware adds a big advantage to commercial hardware, that models are specifically trained for.

They exist, so do others, but complementary ASR/KWS models do not exist and in some cases without they can degrade recognition.

I’m currently using Orange Pi Zero 2 as my custom made satellite boards.

It handles Google’s CRNN keyword spotting model (Tensorflow Lite quantised) easily on ALSA and have lots of CPU capacity remaining.

I’ve tried to plug a USB sound card to the USB pins the board provides (to reduce form factor) but they generate lots of noise (which is not there when using the USB port).

The OS available is not recent but working.

My satellite have been up and running 24/7 for the last year without issue.

Hope this helps.

2 Likes