Wake word creation -- Snowman anyone?

greg_dickson · June 16, 2021, 10:32am

Porcupine looked great but retraining every 30 days or paying $400/per month until I die, or using some very lame generic wake word… Naah sorry not an option.

However now that kitt.ai no longer resolves and thus is forever gone.

I looked for an alternative and found snowman .

Has anyone played with this.

Privacy for me is my right.
I have nothing to hide but nothing to share either.

rolyan_trauts · June 16, 2021, 11:56am

Google and Arm have been playing with state-of-art KWS and publishing as opensource and was kicked started with GitHub - ARM-software/ML-KWS-for-MCU: Keyword spotting on Arm Cortex-M Microcontrollers
Google did some comparisons looking at low latency streaming KWS with google-research/kws_streaming at master · google-research/google-research · GitHub which Arm now seem to be also using as a base GitHub - ARM-software/keyword-transformer.

A KWS 20ms streaming model will run with less than 20% load of a single core with state of art accuracy, but its still the old adage of getting a dataset for custom KW but I have a little utility to quickly create an augmented dataset so you can create any custom KW for your voice or anyone else’s you wish to add to the mix.

Reader.py just shows a series of words on screen based on your KW and a few sentences containing as many phones & allophones and records each as a individual 2.5sec wav so you have a bit of reading time to err.
You then just run split0.py or split1.py which refer to the methods of the above Google-research training routine split_data.
split0.py uses my method for mixing and matching background noise and creates fixed training, testing & validation folders for the datset split type 0 of G-kws and split1.py doesn’t mix in any noise as that will be handled by G-kws and produces KW & !KW folders again for training by G-kws.

G-kws is kicked off with a script of settings and its just wait for complete

#!/bin/bash

# Train KWT on Speech commands v2 with 12 labels

KWS_PATH=$PWD
DATA_PATH=/home/stuart/Dataset-builder/dataset
MODELS_PATH=$KWS_PATH/models2
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"
TF_ENABLE_ONEDNN_OPTS=1
#PATH=""/usr/local/cuda/bin:$PATH""

$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn_state/ \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 2000,2000,2000,2000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.0 \
--alsologtostderr \
--wanted_words silence,notkw,kw0 \
--split_data 0 \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
--feature_type 'mfcc_op' \
--fft_magnitude_squared 1 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.5 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1

Took me a while to work out, to up the the dropout to cope with overfitting and false positives dropout1 0.5 which I use seems to work well.

Sanebow has done a great wrapper in GitHub - SaneBow/tflite-kws: Keyword Spotting (KWS) API wrapper for TFLite streaming models.

The Dataset-builder has quite and exhaustive list of datasets from noise to keywords as opposed to ASR sentences the word datasets can be quite hard to find.

I should update g-kws but its only a install guide on Arm really as its the Google-research framework but for some reason has all in one repo so.

mkdir g-kws

git clone https://github.com/google-research/google-research.git
mv google-research/kws_streaming .

you can delete the rest of the google-research stuff after that and just keep the important kws_streaming dir.

romkabouter · June 16, 2021, 1:03pm

@rolyan_trauts Does it generate Snowboy models or any other models supported by Rhasspy?

synesthesiam · June 16, 2021, 1:21pm

Hi @greg_dickson, welcome to the community. I’ve packaged up a snowboy wake word creator that you can use to train your own Snowboy wake words.

In a future version of Rhasspy, yes The client code is functional, but the training needs to be done via command-line as @rolyan_trauts described. With his dataset builder, you can get very good KWS accuracy with only a few minutes of recording time.

My plan is to create a training Docker image that will use the dataset build and Tensorflow to generate a TFLite KWS model, which you can then load into the new Rhasspy KWS system.

rolyan_trauts · June 16, 2021, 1:22pm

No it doesn’t generate models but it generates datasets but why use an obsolete already implemented KWS.

Dataset generator will do it.

@synesthesiam I need to get someone to do some other language sentences to read and my horrid hacks are just proof of concept but needs somone like yourself with more polish.

romkabouter · June 16, 2021, 1:38pm

Euh, because the user want to use it with Rhasspy I assume.
As long as it is not (yet) supported by Rhasspy, is it useless for the user.

rolyan_trauts · June 16, 2021, 1:41pm

@synesthesiam whilst your about I have been doing some rethinking about beamforming as the pulseaudio beamforming does actually work you just can not auto steer it.

But I am thinking because of the low load of the Google KWS models you can run instances off the same model as the model is just loaded up into tensorflow.

Pulseaudio is actually great to be able to load and unload modules on the fly and even switch playing streams between sources, sinks and clients.
You can use the streaming envelope of the KWS to grab a section of input and feed to a TDOA and then load the beamformer with target.
Rationale whilst 2 is that one is feed from that target with a fall back of no beamforming as people can move and the other is a wide range fullback.
You can expose to alsa with the pulse plugin so for alsa it will just be all seamless.

So the beamforming thing could be done which has always been a bugbear.

@romkabouter As two of us have already said it can provide datasets for any quite quickly, it doesn’t have to be supported by Rhasspy as its a dataset…

@synesthesiam It shouldn’t be ‘bemused’ but ‘voltron rover’ client/server and external to Rhasspy and also capable of feeding any ASR system.

romkabouter · June 16, 2021, 1:45pm

So help me out here, exactly how is this user going to be able to use it with the current version of Rhasspy?

rolyan_trauts · June 16, 2021, 1:46pm

Why I am talking to @synesthesiam as with your level of understanding it obviously pointless.

synesthesiam · June 16, 2021, 1:55pm

Please try and keep discussions civil.

For most users, I expect they will need a KWS system that is well integrated into Rhasspy – where they can just pick it from a list and click a few buttons. This is always the goal, as most people are not Linux command-line ninjas.

But it’s also fair to point out that there are great options outside of what’s currently supported in Rhasspy. And since @greg_dickson didn’t hint at their level of technical expertise, it’s possible that something gKWS could be just what they’re looking for.

romkabouter · June 16, 2021, 2:00pm

There is absolutely no point in this rude answer man.

A question is asked from a USER perspective and you post about something not supported by Rhasspy.
That, imo, is really the thing pointless here.

If you cannot even read what the user is actually asking, please do not respond with this KWS nonsense

rolyan_trauts · June 16, 2021, 2:13pm

@synesthesiam
That is what I don’t get as Rhasspy is web based so that it doesn’t need to be forced with the horrid mess of Hermes audio to be part of the web control.
To keep forcing highly integrated Rhasspy protocol of a few users just negates what could be a more interoperable and modular system that has the advantages of being used by a much larger herd.

G-kws is by google and also the same datasets can run on ESP32 and that is another reason why I keep mentioning it as it pointless to keep supporting a whole load of KWS that all have hardware and specifics whilst a single framework can run from singular datasets and also work on all hardware.

Its the datasets that are important and something we can do quite easily but just as forcing highly integrated code makes a common function more specific to a singular platform and small number of users it offers a rake of alternatives and makes those pools even smaller which is poorer for the user.

greg_dickson · June 16, 2021, 10:46pm

Thanks guys. Sorry to trigger an argument. Yes everyone has their own path.
My level of technical understanding is pretty high however I am new to Voice and computer learning. We all have our own path to learning and our own passions. No one was born with the knowledge,
My goal would be to have a basic wake word that eventually would allow a voice activated introduction.
Having a general but lowish accuracy at first but a training trigger so a new user could train it to their voice.
I thanks you all for your input and will get a better understanding as I travel down this rabbit hole.
Thanks again,
Greg

Daenara · June 16, 2021, 11:08pm

This argument is nothing new and happens in quite a few threads concerning wakewords. Don’t be bothered by having started it again.

As for the original topic. Since you said you have the technical understanding, I suggest training your own model. One of the starting points I can suggest is using Raven for a while, it gives pretty decent recognition but it strongest point is in saving the wakewords if you configure it right. That will help you out with getting a small dataset started, and you can supplement it with the dataset builder mentioned above.

Once you have a dataset to play with you can either go the more involved method of a google tensorflow model like @rolyan_trauts advertises. This has the advantage of being lightweight and portable since it also works on phones, so the resulting model might even work with the rhasspy mobile app, but with the disadvantage of not being supported by rhasspy (yet if I read @synesthesiams post right). The other way would be to play around with mycroft precise which can be used by rhasspy, but the scripts are outdated and badly maintained so if you want to use it it is of uttermost importance to use an outdated Linux to play around with (I use an Ubuntu 18.04 vm, anything newer breaks stuff) and I advise looking the training thread up in this forum, I remember correcting the guide in the first post of it to account for all the problems resulting from the bad maintenance. There is a fork/pull request that says it is w working updated version and while I could use it without encountering errors the model the conversion script returned did not work with rhasspy. The script might be broken, the resulting model to new or I might have messed up but I advise to not put to much time into that fork.

Whatever way you go, building a good dataset is always a good thing to do. To further augment my own dataset I build myself a python script that reads the audio buffer from the mqtt server and saves once the wakeword is detected, so every activation will augment the dataset. If you want to look it up you can find it here but I am pretty sure it could be done better. The disadvantage of the mqtt is, that the audio traffic has to go through mqtt and the ootion of using udp unless the wakeword was detect us ruled out. It would be great if rhasspy could save all wakewords and not just raven but for now, we have to make due.

rolyan_trauts · June 16, 2021, 11:55pm

Raven uses a quite old sort of best fit algorithm that is not very accurate and sort of instantly hits a ceiling in terms of accuracy as more recordings adds load.

What it is great for is to be able to record just a few KW samples and go and then use that to store KW & !KW of use.
Currently it can store KW but unfortunately doesn’t keep the following command sentence that could be used for !KW. Maybe the script @Daenara mentions could supplement that.

Most models based on custom datasets can provide really high accuracy and its the quest to provide a model and KWS that fits all like commercial versions that is prob a quest beyond a small community.

Its all about your dataset getting enough variation to stop false negatives and understanding how the dataset is the ultimate decider on accuracy that can give huge differences whilst model technologies from the early DNN of say Snowboy of approx 87% accuracy to latest and greatest of say transformer models are pushing 98% but still will return dross with a bad dataset.

GitHub - linto-ai/linto-desktoptools-hmg: GUI Tool to create, manage and test Keyword Spotting models using TF 2.0 was the best visual tool introduction for me as its runs through the dataset and provides click links to the sample items providing false positives/ negatives and wasn’t much use but extremely educational.

I champion the google-kws framework because it comes from a technical source far beyond any of our capabilities and is a working framework that just leaves you to provide a dataset.
Tensorflow is a static compiled model as opposed to Pytorchs dynamic model that gets far more research work done with it as unlike tensorflow new functionality doesn’t need to be C hardcoded into layers.
Tensorflow because it is a static model seems to provide smaller more efficient models and because Google & Google research have been specifically focussing on low end embedded hardware for KWS for me it makes sense to use what they have released as opensource.
From ESP32 to a Cuda GPU armed X86 machine the G-kws framework can produce a working model from the same dataset with various model options.

The CRNN streaming model they mention here from the framework google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub for me seems the best bet for Arm64.
That is another consideration as really any project that is heavily AI based really should be 64bit because it can operate on 2x more 16bit tensors for the same clock tick which results in realworld 2-3x speed improvements by just starting with a different image over Arm32.

If you are going to create a custom dataset and custom KW always record on the mic and even device of use and your going the right way.

I think because of the nature of voice AI that for the majority it is idle its very easy to collate a usage datset and train away when idle and provide a constant flow of model updates of improved accuracy through use.

I don’t think any KWS system should be constrained with the protocols and requirements of a ASR/STT/NLU system and is a project that can feed all.
It just needs a Rhasspy connector module to convert an audio stream and metadata so if you wish to stream audio over MQTT you can if its a bad idea or not (bad but hey).
The overall opensource voiceAi herd is already far too diluted in to small pools of specific systems doing the same task with far too small communities repeating and redeveloping the same to do the same with the only real unique feature being the branding applied.

I would have a look at Dataset-builder as it merely uses sox to provide pitch, tempo and reverb augmentation to provide multiple dataset samples from a single recorded sample as that is the main problem for KWS as we have a huge wealth of recorded sentence datasets but very few word based for KW systems.
What I have done is concentrate on matching and mixing background noise to high levels whilst always leaving a predominant foreground sample but really the inbuilt of g-kws doesn’t fair much worse, but any dataset builder might be of interest as recording and augmenting loads of samples is just a pain.

greg_dickson · June 17, 2021, 6:06am

This looks like a nice clean simple forward looking solution.
That just needs work.
I will work with that.
I don’t plan to delve too deeply into this rabbit hole but I will help where I can. Also not wishing to start another argument I will just say I am not a python coder. But once again I will do what I can.
Integration seems to be key here.

greg_dickson · June 17, 2021, 8:46am

Ok just to be clear here and after a very brief look at the system.
If I choose local command in the wake word setup.
I write an executable that simply runs until it gets a valid wake trigger then fully exits.

So rhasspy will startup set itself up run it’s web interface then run my executable then effectivly sleep (maintaining the web inteface).
When my executable exits rhasspy will then move to the next step ie start the recording …

So any Wake word / trigger excutable could easily be incorporated into the system.

Have I got that correct?
That seems a little too easy.
I must have something wrong…

rolyan_trauts · June 17, 2021, 11:49am

Never tried it but looks like so.

Someone will have to prompt how a stream on stdin should be handled as just never done it.

import sys data = sys.stdin.readline()

?

wakewordId to standard out and exit

greg_dickson · June 17, 2021, 9:50pm

Thanks Rolyan.
Yes after a ltttle rtfm
https://rhasspy.readthedocs.io/en/latest/wake-word/#command
Also after looking at the sleep.sh It appears to be that easy.
jq tricked me for a moment but for those that follow.

aptitude show jq

Package: jq
Version: 1.5+dfsg-2
State: not installed
Multi-Arch: foreign
Priority: optional
Section: universe/utils
Maintainer: Ubuntu Developers ubuntu-devel-discuss@lists.ubuntu.com
Architecture: amd64
Uncompressed Size: 90.1 k
Depends: libjq1 (= 1.5+dfsg-2), libc6 (>= 2.4)
Conflicts: jq:i386
Provides: jq:i386 (= 1.5+dfsg-2)
Provided by: jq:i386 (1.5+dfsg-2)
Description: lightweight and flexible command-line JSON processor
jq is like sed for JSON data – you can use it to slice and filter and map and transform
structured data with the same ease that sed, awk, grep and friends let you play with text.

It is written in portable C, and it has minimal runtime dependencies.

jq can mangle the data format that you have into the one that you want with very little effort,
and the program to do so is often shorter and simpler than you’d expect.
Homepage: GitHub - stedolan/jq: Command-line JSON processor

gyppe · June 18, 2021, 2:45pm

i testing Linto thanks is good project!!! How to use the models produced with rhasspy or other external software?