DIY Alexa on ESP32 with INMP441

mtt · April 30, 2021, 11:19am

I’m agree with you. Anyway it could be possible to provide more solutions to realize hotword detection, that a user can select configuring setting.ini file.

rolyan_trauts · April 30, 2021, 1:47pm

The only thing Atomic got wrong was the dataset as it confuses all as the ‘Goggle command dataset’ is a benchmark dataset to test accuracy and hence contains a high proportion of bad samples and extremely varied accent content.
If you are making a custom dataset you should use the input voices only as why add native accents from around the world without the possibility they are coming to visit.

Also he used Marvin which is 2 syllable and fairly short whilst 3 syllable and filling the frame as ‘heymarvin’ would also of been much better, contains more phones and is more unique.

A couple of choice sentences can return 40 or more words and the only hassle is repeating your KW the same amount of times.
Record on the device of use and mic of use and then tools such as Sox can quickly augment to result in 1-2k KW & !KW.

Atomic also followed the basic audio Tensorflow example which is more of an introduction than a supposed working KWS where google have published a framework of current state-of-art KWS.

I have published a repo just to make it easier to get started and install tensorflow & tensorflow addons (not needed for esp32 as delegation is not supported)

I also created a sample repo of how to create datasets as a last thing that was missing from Atomics KWS was a silence classification which acts as a catch all between spoken KW and spoken !kw and greatly increases accuracy.

https://www.tensorflow.org/lite/microcontrollers supports ESP32 and the CNN examples in the G-KWS are perfect to export to microcontroller and use the front-end ops for microcontrollers.

When you create your dataset the quantities of samples greatly effect the classification weights and with the 3 classifications of Silence, !KW & KW the quantities they contain can be tweaked to attain the results required.
Training is in for stages with prob 2k steps each being a minimum and 8k starting to get to a point of no real return.

You can get an extremely noise resilient KWS that far surpasses ASR which is not tolerant of noise.

romkabouter · April 30, 2021, 5:04pm

All very nice, but the DIY Alexa project uses marvin as keyword.
I might try and add it, but the keyword should be a custom keyword.
This involves training the model and for some of us training it will do fine, but for most of use (users instead of tinkerers) a better tool is needed to train custom models.
If there was a tool users can easily use to train their custom keyword, it would be great. I could not find such a tool.

I will investigate that project and I might be able to incorporate the marvin keyword.

rolyan_trauts · April 30, 2021, 10:11pm

As said Marvin because of a low phonetic count is a bad choice and Atomic got poor results on that even A-Lex-A is 3, Hey-Goo-Gle are better, use Marvin if you wish but the shorter and less phones the less accurate your KWS will be.

There is no such thing as a better tool to train custom models as it runs from a single command line and is by Google Research for state-of-art KWS for liteweight platforms.
Nothing needs to be done to train a model as that is all done with tensorflow or other frameworks, the work that is needed and is critical to results is the dataset builder.
You need to build a dataset where training/testing/validation sets are made for the right percentages for 3 classifications ‘silence’ ‘!kw’ & ‘kw’.
You need to mix in background noise and not make the classic error of drowning the foreground and its that simple.
Its a button press to train from then and just wait for step completion and resultant .pb and .tflite

I just knocked up the dataset builder for @synesthesiam so he could take a look at the rough requirements and maybe at one stage provide an interface through the web console.
Its extremely easy to augment samples so that relatively small 20-40 word samples can be varied that quickly become 1-2k in each classification.

An on screen prompter should provide some phonetic language sentences that for a language contain as many phones & allophones and don’t have to make sense and prob could be contributed by the community.
For English examples are http://clagnut.com/blog/2380/#English_phonetic_pangrams

*“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.” *
“Shaw, those twelve beige hooks are joined if I patch a young, gooey mouth.”
*“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?” *
“The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.”

4 simple sentences can provide enough for an initial model and then its choose your KW and repeat approx x40 and starter model is good to go with the addition of a background_noise folder.

The G_KWS & Dataset_builder where just proof of concepts to test resultant accuracy and the results are extremely high.

The Google KWS framework plus a dataset builder should reside on Rhasspy as then the collection of use of KW and !KW command sentences can be used to create highly accurate KWS far in excess of what ASR is capable of as the resultant .tflite model files can be shipped out OTA to any satelite and a system will get better through use.
It just needs web front end putting on it but it would seem apart from Synesthesiam who already provides a mass of code there is a lack of ability than to merely copy projects and rebadge but the actual code is there and process is fairly simple as even a non coder like me can create something in python that works and proves its concept.

synesthesiam · April 30, 2021, 11:20pm

I’ve finally made some progress this week on KWS. The Google system @rolyan_trauts mentioned works pretty well, but it’s always bugged me that you need to start from scratch each time you train a new keyword.

Then I came across this paper: Few-Shot Keyword Spotting in Any Language. Their approach mixes aspects of an earlier proposal from @rolyan_trauts:

Gather 1 million+ samples of hundreds of keywords from open speech data like Common Voice
Augment the samples with time shifting, background noise, and SpecAugment
Train a KWS model to classify each of the keywords (as well as a “background” category)

So far, this is a vanilla keyword spotter, but here’s what I think is the cool part:

Freeze the model weights and put them on top of a second model that will only classify KW, not-KW, and silence/background
Train on ~5 positive samples (augmented as before), as well as some negative/background – about 256 total

By choosing the initial keyword set carefully (multiple languages / broad phoneme coverage), you can pre-train a very robust system. The final customization can then be done on device with very few examples.

I trained a model like the paper’s using augmentation techniques from @rolyan_trauts’s dataset builder, but with a lot of other corners cut (no attempt to balance categories, etc.). The results were surprisingly good, but the CPU usage was not (20% on my desktop). So I’m going to try this trick with the Google KWS CRNN model – pre-train on many keywords, freeze weights, add layer, train on one keyword.

romkabouter · May 1, 2021, 7:11am

Yes, but Marvin is used in that project

What I need is existing keyword models or a clear and simple tutorial on how to create them
You say:

But it is the “from then” part which is totally unclear to me. Building a dataset? Mixing background noise? Probably simple when you have build enough knowledge on the whole topic, but I do not have that.
I will have a look on the dataset builder repo, but it still requires efforts for users to build their own keywords.

What I want for my project is a set of keywords already existing and from which a user can choose and enter in settings. For tensorflow, the only example I got is Marvin

I would be extremely useful if it would work like raven does in Rhasspy.
Record a couple of samples and then download a model

rolyan_trauts · May 1, 2021, 10:35am

The difference between a universal prebuilt model and a custom tailored model can be a noisy WER (Word Error Rate) difference of almost 30%.

Marvin is used in that project purely because Atomic choose to use the 1 second ‘Marvin’ samples for a KW, but as said the ‘Google command set’ is a benchmark dataset that contains up to 10% bad and has an extremely high proportion of non native speakers.
Its a dataset to test the metal of the best state-of-art KWS models to give a datum that no KWS will manage 100% as we would not be able to differentiate accuracy.

You do not build keywords as keywords are just a collection of 1 second samples of the keyword being spoken and you record keywords.
If you want existing keywords then that has to encompass all regional variation, gender and age profiles and all noise profiles and microphone response and the difference in end WER to a custom model using the conditions of use is absolutely massive.

That is what I am saying as you can do what raven does but create highly noise resilient. extremely accurate lite-weight KWS that scale voice actors without extra load.
You record approx 20 - 40 KWs in a web boutique that merely asks you to emphasise a pause between words and silence strip a spoken recording into 1 second samples.
They are then augmented with pitch, tempo, volume and padding to create small variations to create a larger number of samples because the 10% validation and testing parts of the dataset expect a minimum of 100 so we need a minimum 1000 which is automated from the initial 20-40 extracted words of a recorded sentence (the different samples of noise mixed greatly varies sample and creates noise resilience).

Its also better to record twice @near & @far usually (0.3m & 3m from your mic but whatever the far sensitivity allows) the natural room reverberation and proximity effects will be recorded to create more accuracy but you can just do near as the accuracy increase is only fractional compared to the huge increases of a model based on the voices of actual use.

For tensorflow Tensorflow give examples in fact they have created a MFCC front end for microcontrollers specifically for KWS whilst Atomic just uses the casual introduction tutorial to audio spectrograms.

Its documented here and I have posted this now a plethora of times to even have it deleted.

TensorFlow Lite for Microcontrollers step by step it shows the simple procedure for taking a tflite model Get started with microcontrollers

The info and procedures have been there for 2 years but the recent introduction of the Audio “frontend” TensorFlow operations for feature generation was obviously a stumbling block so they now provide the microcontroller code.

From code to tutorials to training framework its all provided 100% by Google and its state-of-the-art and all it needs is a Web GUI to record and review KW, !KW & Silence.

Its so simple that a brain damaged MS sufferer with no Python experience can hack together extremely accurate working models that surely any Rhasspy dev can automate as example proof of concept have been forwarded by a community member above.

If you can not work it out for your self I can help and tell you how as maybe my confusion why this hasn’t been implemented and why code is just being copied rebranded whilst the likes of Google have opensource already provided.

You just have to read what they have provided like I have done and keep posting here to no avail.

The more custom a model goes from all voice to native speaking, to regional accent, to own voice are massive steps in accuracy as a model becomes more defined as the reality is they are little more than highly evolved tensor classification filters and the less cross entropy in the dataset means more clear cut classification and higher accuracy and its that simple.

sox -m noise.mp3 voice.wav mixed.flac what is so confusing about that in terms of a dev providing a solution?
All I am stressing is that the noise samples need to be volume matched to dataset samples otherwise the noise may become the foreground audio turning the dataset sample into a garbage entry.
Check volumes and adjust before mixing and this is just simple code that has been missing for some time (always mix in noise at a lower volume than the sample or the sample will be overwritten) as the Precise training methodology is broken and introduces large proportions of garbage and cross entropy.

It shouldn’t end there because we are on a local private VoiceAI we should be able to easily capture KW and command sentences to create datasets of use and a KWS that ships out models OTA that increase in accuracy by use by using the recorded samples of use automatically in a 2 stage firmware like delivery.

rolyan_trauts · November 29, 2021, 11:20am

I just got a like on this topic and saw your reply.
@synesthesiam The new federated learning additions are really important as a new keyword always needs to trained from scratch as you are pattern matching on a keyword which is far more accurate or should be than phonetic dictionary based ASR.

So it doesn’t solve that but if you want to add to the dataset you always had to have the full dataset and retrain where now with federated learning you can train a main model and then have a smaller local model specific to user and device hardware that alters the weights of the main model.
So you can increase accuracy for a user and device for accent and device usage tonal results.

I haven’t worked it out yet but know from the new Pixel phones it is what they use to such effect there models are now offline.
The new Tensor TPU and cutting edge phone obviously has some grunt but currently Googles mobile models seem to be providing leading edge ASR & KWS and a few other great AI enhanced methods as well as what has become normal in HDR photography.