Mycroft Precise - Installation and Use

That mostly happens to me with over training.
I also think my strategy is slightly different. I dont have any data in not-wake-word or test/not-wake-word when i train. I only have the data in the noise folder.
Thats why i only train for 50-100 epochs in the beginning to have a start but the result will listen to anything as its trained without any non-wake-word data.
All the real training in my case happens in the incremental part against the randoms folder.
I also start from zero with a fresh model everytime i add data as i found continuing training with new data gave me worse results. It probably took me 15 complete training runs to get a good model in the beginning. I just train on a spare raspberry pi 4 and let it sit there for a few hours each time to do its thing while i do other things.

Just for informational value:
Right now i train against 1642 one minute files of random audio.
Thats about 27 hours.
My first usable models were probaby trained on about 5 hours in the beginning.
I use about 100 recorded wake-word samples but i also duplicate each sample with added random background noise from the 1642 randoms.
All in all a robust model takes 2000-3000 epochs of training with this.

I find that having not-wake-word data helps quite a lot. With just 200 epochs of training I have a model that does not react to quite a bit of what I say. I recorded myself saying random words and put about 10 of them into the folder. End result was much better than training incremental with data collected from music, random noises and a few recordings of sentences I had lying around.

As of now my model still reacts to me saying longer words but with just a few samples I have it not reacting at all to short words or similar words than the samples were while still reacting to my wake word.

I plan on adding just random audio of me talking on voice chats into data random once I get around to record them but just for training it not to react to everything I say not-wake-word is great.

A factor of why this works so well for me instead of using random data I did not record might be because my rhasspy mic has permanent noise in every recording and my random data is noises, noise and music in clear quality.

I use a lot of 5 minute recordings i made with the mics that in my case voice2json is running on during normal household activities here. So i recorded hours of us watching television, cooking, vacuuming or talking.
I also really recommend to use audio from videos on youtube like one hour of coffeeshop or bar noises.
There is a lot of them. You can use youtube-dl to download just the audio and than convert and chop it with sox.
Its just an easy way to have more random audio bits with lots of variety.

I am still in the collection phase, I just started recording audio yesterday after fighting with vms and precise for a week. I do plan on recording everyday sounds once I actually do more than watch videos on youtube (I did record 30min of that already, with various videos). I have to be somewhat careful about what i put in as random data because my wakeword of choice is “computer” and I don’t want that in the random data.

Is there any difference with the length of the random data? For now I just chucked 10min+ recordings in there and they run through with a pretty decent speed. Are they actually handled differently when they are shorter or is it just the time factor on the pi4 that has you keeping them short?

On the pi its also a memory factor as it can actually crash with files that are too big during incremental training but the biggest thing i noticed was that it also gave me better results.
When using longer files what i found was that it would train a lot on the first few long files during the incremental training and than the model would be just good enough to not trigger false positives and skip alot of the random data.
By training on shorter random files i found it would train a lot more on a lot more variety of my random audio as it would only train limited amount on each file. This is also because the order that precise-incremental uses the noise data in to train on is random. So this resulted in a lot more epochs the model was trained on a bigger spread of my noise audio and this way also a more robust model.
For example training a model on the same random data in five minute chunks gave me below 1500 epochs while when splitting the same audio in one minute chunks gave me over 2000 and a more robust model.
Another factor is that i use those one minute chunks to add noise to duplicates of my wake word recordings and as i do this from random pieces of audio from the noise folder with my own script (the precise-add-noise command didn’t work well for me) I get a better spread in my noisy wakewords and what noise is added this way.

Zram and extending the dphys-swapfile will prob help much. Training on a PC or Colab notebook is likely much less painful and wondered why we don’t have a Colab Notebook as the free GPU access is pretty handy and you just upload your dataset to your google drive.

I am not Precise fan and will never use it so haven’t scripted a notebook but have wondered for those who are why not?

Also inputting audio in 5 minute chunks if I am reading this right is not going to produce a very efficient model unless your scripts further break down to shorter.

The image you feed sets the initial shape of your model and the number of parameters your model copes with.
Often with KWS and one second windows the time scale index is 30-40 x 13 of the mfcc which sets up feature extraction for labels and the resultant model paramters.
Hopefully I am reading this wrong but your not inputting 5 minute images?!?

No in incremental training mode the model so far is run against the chunks of random audio which can in theory be any length.
Than if a false positive is detected the piece of audio that triggered this is saved to the training folder and than after a certain number of false positives this data is the one that gets trained against.
So the pieces used to train against are only very short bursts (less than a second mostly) that are being saved from the longer audio.
Training is never directly with the random noise folder.

Oh so its just a feed to grab further dataset images I get it.

Not that I use precise but guessing Keras you don’t need to run to thousands of epochs at the start.
The gains you make from approx 50 onwards are minimal but create the final best model.
You can do testing relatively accurate on much less and then just do a final long run.

This might be done but what I do is also feed my dataset into a short epoch model and delete the false positive/negatives as what a model thinks is a bad sample to what a human can hear seems a mystery.
I often get a quick 3-5% accuracy on doing that alone.

I am starting to look again KWS as a few things should become avail.

All just hacks but the ASR datasets are huge and just voice so when you mix noise you will always get a consistent SNR of one you control. As from videos you have no idea what is noise.
I did use Deepspeech to use the timing transcripts but the timings are lousy to accuracy but thinking that and VAD could grab much cleaner word files.

With precise as I use it right now I need something between 200 and 500 epochs to get the accuracy up to nearly 1 with my initial data. After that I let it run run incremential training until it runs out of data to train against (I don’t have much collected yet, so I run out pretty quickly). But since my pc runs through the normal training pretty fast even though I can’t use my gpu and I use tensorflow without AVX2 instructions I just let it run to 1000 because right now it is literally the difference of a few seconds.

Okay, so i should probably try splitting my files into shorter ones, at least the non music files. In the music files I trained against so far it only triggered on a few specific notes, not even any speech, and triggered that note 10 times per file, so I don’t think i will split up music files in chunks.

Since I am not well versed in linux at all, do you know of a script or a command that will split a file into one minute chunks while keeping the name and numbering them? I am pretty sure I could find something working on google but since you are already doing this, you might be willing to share how and it would be collected in this thread which is what I think most ppl with next to no knowledge about how to train a model will use as a reference.

You can use a simple bash command with sox for that:

for f in *.wav; do sox "$f" "split.$f" trim 0 60 : newfile : restart ; done

This will split all wav files in the current folder into 1 minute chunks. It will add a split to the name and sox will number them automatically.

2 Likes

@JGKK ihave been using pysox as more complex scripts in python are for me easier than bash.

https://pysox.readthedocs.io/en/latest/

Bash is just really good for small automations like that. Especially for renaming, moving around files or manipulating them en mass i find a little bit of bash very helpful.

Yeah its usually my first port of call I did some audio manipulation to create that ‘hey marvin’ dataset started off in bash but turned into bash spaghetti.
I used the above and just posting as found it excellent especially when it comes to multi line sox commands as the sox cli documentation for me is bad but the api of pysox is really easy.

Its in https://github.com/StuartIanNaylor/crispy-succotash/blob/main/heymarvin.py

Terrible hack stuff but really easy to hack something together the above might be good as an example.
PS I used the silence cmd as the vad one with sox seems pretty poor.

1 Like

This does the job, but I decided to go over the top a bit with my limited shell script skills. I did not like the fact that it tried to split files that were less than a minute and the naming wasn’t what I wanted. This is the result:

#!/bin/sh

if [ -n "$1" ]; 
then 
	SOURCE_DIR=$1; 
else 
	SOURCE_DIR=.; 
fi

if [ -n "$2" ]; 
then 
	DEST_DIR=$2; 
else 
	DEST_DIR=split; 
fi

#cd $SOURCE_DIR

if [ ! -d $DEST_DIR ];
then
	mkdir -p $DEST_DIR;
fi

for f in $SOURCE_DIR*.wav; 
do 
	lenght=$(sox --info -D $f);
	echo $f;
	name=$(basename $f .wav)
	if [ $(echo "$lenght > 60" | bc -q) = 1 ];
	then
		sox "$f" ""$DEST_DIR"/"$name"..wav" trim 0 60 : newfile : restart;
	else
		cp $f $DEST_DIR
	fi
done

It takes the source folder as the first argument, the destination folder as the second. Files larger than a minute are split and saved into the destination folder as FILENAME.NUMBER.wav. Shorter files are just copied over.

1 Like