Porcupine V1.8 Feature Tour

The main problem I have with Porcupine is that it only handles english phonemes. Pronuncing “Alexa” or “Jarvis” with a french accent does not work very well.

Oh and it is not “open source” either… :wink:

I can confirm, that the new models don’t work with the porcupine version included in Rhasspy 2.5.5.
Did you manage to get it working?
Or is an upgrade already planned in the next Versions? :smiley:

I agree with @fastjack , Porcupine does not work well with french accent.
So I use mycroft-precise train with my voice on the word “maitre yoda” but it has too many false positive. And it detects only my voice due to a lack of other voices in the training.

Dunno haven’t tried 2.5.9 update as thought it was supposed to be.

You really need a lot of samples so its like a label in the Google command set (2k+).

If you can record 50 samples you can quickly multiple that by using some pitch and padding changes with a tool like sox.
Then add noise to your samples but would have to check how precise deals with noise as that may cause a double dose.
But yeah with only your voice it will detect on voices which are so similar that it might only be yours (do you do impressions :slight_smile: ).

You would be better just picking any dataset such as https://drive.google.com/open?id=1-kWxcVYr1K9ube4MBKavGFO1CFSDAWVG use the google command set for an equal amount of non keywords then add a lot of your voice so its weighted to you but not uniquely you.

Its real problem as even with English word datasets are much rarer than ASR sentence ones.

I am not a fan of Precise so don’t use it but the above is a rough guide how to train any word model.

If you can use all your words as their label with a simple dummy model and run through and have the model delete the really low scored then retrain with precise.

French Audio Datasets

Nijmegen Corpus of Casual French: 35 hours of high-quality recordings featuring 46 French speakers conversing among friends, orthographically annotated by professional transcribers.

French Single Speaker Speech Dataset: CSS10 is a collection of single speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox.

Traitement de Corpus Oraux en Français (TCOF): Over 500 transcriptions of 124 hours of spoken French. The corpus is divided into two main categories: adult-child interactions (children up to 7 years old) and records of interactions between adults.

VoxForge: Set up to collect transcribed speech for use in Open Source Speech Recognition Engines, VoxForge contains 37.5 hours of oral recordings of texts in French.

??

Ask another French speaker if they will record 20-50 of your KW as again you can multiply with Sox but don’t weight the model too much to them.

I am having a go with https://github.com/prosodylab/Prosodylab-Aligner to extract words from as many ASR datasets as possible into word folders and hopefully collecting a qty of each word. (Yoda may be a problem :slight_smile: ) I bet if you scoured some vids you could create a collection then its just ( Master? apols my French )
https://github.com/prosodylab/Prosodylab-Aligner is OK but think I need to combine it with a good VAD which maybe I can us the nemo VAD for https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/v0.11.0/voice_activity_detection/tutorial.html

Haven’t got to implementing the VAD as the aligner does work but still needs work where to cut and maybe the VAD may help.

1 Like

Definitely do duplicate your set with added noise as otherwise you will get a model that will only work in a quiet room but not when there is any background noise. Duplicating with added noise will give you a much more robust model. There is a tool included in precise to do this but it’s not great (call precise-add-noise --help from the venv for how to use it).
I personally use a simple bash script:

#!/bin/bash

NOISEDIR=$1

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

for f in *.wav
do 
    NOISEFILE=$(find ${NOISEDIR} -type f | shuf -n 1)
    
    sox -m $f ${NOISEFILE} noise.$f trim 0 `soxi -D $f`
done

Prepare a folder with lots of pieces of random noise (they should be longer than the wakewords). Save the bash script as addnoise.sh to your folder that has you wake words. Run the script with the path to the noise folder as an argument. This will create a copy of every wake word sample with added random noise from the noise folder.

1 Like

I will let you describe the SNR levels you should use as you use precise and sure you do.

I actually use them one to one as for a lot of the noise in my case both the wake word and the noise sample are recorded on the same microphone and i get a realistic representation how a noisy wake word will sound in that room.

That is not very representative of the levels many KWS can cope with noise usually the best have a max SNR of 5db.
So at best normalize your you noise 5db lower than you KW.
I just noticed you where not normalizing in your script so you could have a noise file louder than the KW and that is just plain bad.
The predominant signal should be your KW especially if your adding an equal duplication again or your accuracy will obviously plummet.

I am not a Mycroft fan and don’t go near but a good KW system it would be automated and also it would have steps maybe 5db & 15db below that to create different levels of noise and save you the hassle and headache.
If you can record the noise you have and pick noise files that might be common noise but hey noise is random.
Also just for others do not add the noise files that you use as noise to your ‘not keyword’ samples.

1 Like

Well it worked for me so far but i do go through the created noisy samples and pick out the ones that just plain dont make sense snr wise.

I am not sure what that means but you just mixed 2 signals one is signal and one is noise hence signal to noise ratio.
So when you mix you should first normalise all your KW samples -1 to -3 is usual and then your noise samples -5db below those or more and maybe do half and the other half @ -15db of whatever you set your KW samples at.

Im just saying it worked for me just plain mixing them. If you normalize all samples than probably the input audio when using them would have to be normalized too. I actually use samples recorded at several distances with several microphones which are not normalized at all.

The input is normalized and yes it matters as with models garbage in is an old adage.

I just had a look and the noise function for precise looks pretty robust I suggest you use than the manner you have just described.

1 Like

The problem with the precise add noise is that it doesn’t give a good spread of noise from your noise data when you have a lot of it. Id rather add normalizing to the bash script.
Feel free to adapt and post the script to how you think it should be.
So you would recommend to also normalize the clean sample set beforehand?
So for example normalize wake words to -10dB and added noise to -15?

You have to normalise to a set value so you have a datum to work from so you know what you are mixing at.
Otherwise you just have not got a clue what ratio they are.
What do you mean it doesn’t give a good spread of noise from your noise data?
I haven’t used it but also on train is it not called like I thought it might or do you have to call it?

:-if --inflation-factor int 1
The number of noisy samples generated per single source sample
:-nl --noise-ratio-low float 0.0
Minimum random ratio of noise to sample. 1.0 is all noise, no sample sound
:-nh --noise-ratio-high float 0.4
Maximum random ratio of noise to sample. 1.0 is all noise, no sample sound

So at guess at a glance at the code its dependent on how many noise samples you have and the inflation factor? Dunno I haven’t used it.

But seriously what you have been doing mixing 50:50 on blind db level files is just the worst method of any as its pure chance what your SNR of KW to noise is.

Its all a catch-22 but just adding more noise at higher levels isn’t going to make things better and likely worse.
What happens is your KW image in the model becomes so blurred that the cross entropy with non keywords is just likely to raise. That will produce a lower confidence level and you could be doing the the opposite of what you are trying to achieve.
Your KW always needs to be predominant and clean should be the largest amount of a single block of samples.
You should then split your noise in say 15/15/15 5db, 10db, 15db noise and 55% clean or somewhere around.

You can do it your way but I have noticed it only takes a small number of aberrations in you KW files to actually have a large effect on accuracy.

It will only use very little of the data so the noise added will not be very diverse as its only from a very limited number of the noise files.
Thats why i wrote the script above which will choose the noise file to use randomly for each wake word file from the several thousand noise files i have.

What do you mean by this?

You can simply add a -v (for example 0.5) argument before the noise file in the sox command in the script to adjust the ratio of the noise file relative to the wake word file:

#!/bin/bash

NOISEDIR=$1

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

for f in *.wav
do 
    NOISEFILE=$(find ${NOISEDIR} -type f | shuf -n 1)
    
    sox -m $f -v 0.5 ${NOISEFILE} noise.$f trim 0 `soxi -D $f`
done

Yes but as I said when only using a clean set to train with precise you will get a model which doesn’t react at all in noisy environments.

Look go and start reading about cutting edge KWS I am not going to use Precise but have have played with quite a few models and what I am talking about is what I have seen from the like of Google & Nvidia.

I have no idea we have such a collection of crap KWS and methods when cutting edge is opensource and documented but all I suggest is to follow their methods not mine, mycroft or the dross we have.

There is just about every cutting edge network in here with results and example code and methods.

Check what they are doing and also in the scientific papers published.
They give you a headache but are extremely comprehensive but you seem to have some false assumptions about models and how things can work.

I agree im not a general expert on key word systems like you in any way. I dont even claim do be intelligent enough to understand any of the machine learning parts happening in more than the broadest strokes.
My advise is simply what i found worked best for me to build models that are robust enough for daily use with precise and no other system. This advise comes from training over a 100 iterations on a few different models and trying what worked best, fine tuning the precise training settings and seeing what didn’t work at all. Experimenting on what improved the outcome. And adjusting accordingly. Thats all. So its purely empirical and very focused just on precise and in that only on what gave me the most versatile model in daily use through trial and error.
With my method described i got a model which we now use 24/7 which in our household gives about 1 false positive an hour but at the same time allows me to get a response even when the tv is running with both the 2mic and an electret per your method.
So thats all i claim is my knowledge.
As i don’t understand what im doing and at the end of the day i am just another dumbass maker take my experience with a grain of salt and im always happy to learn how i can improve my models further.

I am no expert but you wouldn’t mix a cocktail with the blind qty’s you displayed on mixing noise to your KW model.

as i said i actually listen to the outcome and sort out files where the keyword is apparently over powered. I will experiment with the -v volume factor to ensure that the wake word is always at a higher ratio. Maybe im just lucky with my set when it comes to precise.