KWS small lean and just enough for fit for purpose

@fastjack

Also deepspeech KWS is single threaded and on a Pi3 runs at less than x.5 real-time.

However as an authoritive server KWS it might be very useful, but yeah KWS availability, accuracy and process load doesn’t give much to choose from.
There are KWS available and the deepspeech eventually will be released so its actually interesting as this allows some flexibility towards accuracy as being lite-weight is also a big concern and in many satellite situations a server can also be an authoritative KWS and override lower tier error.

I am back on about what I have posted before.

https://blog.aspiresys.pl/technology/building-jarvis-nlp-hot-word-detection/

That much is already there but playing with the simpler.

That model creation is actually quite easy as in the tensorflow example its done on 1 line of an input parser.

I don’t know if its intentional but the tensorflow example is extremely stinky in its choice and creation of spectograms whilst maybe its intentional it omits a line and lib call to librosa.
As the Aspiresys.pl blog states.

def extract_audio_features(self):
    audio, sample_rate = librosa.load(self.file_path, sr=self.sample_rate, res_type='kaiser_best')
    self.mfcc = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    self.mfccScaled = np.mean(self.mfcc.T, axis=0)

There is only a little bit more as spectograms don’t do well with noise but at the creation of an mfcc you can raise the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT and this is something I was thinking would have to happen as an audio expander filter but actually it can be done there.

So code wise we already have the source it just needs patching together and then its back to that single line of model creation.
default='yes,no,up,down,left,right,on,off,stop,go',

Models are often far too literate there is basically the keyword and further classification is to split and provide phonetic start and syllable count that have huge input into a mfcc spectogram as you can deduce as much to its match with other classifications as with the keyword itself.

Then we keep mentioning its about the quality of the models and how we lack them but actually in terms of dataset now there is not really a shortage and that is all that models contain.

How you capture audio sets a window and that is pretty consistent and with the audio tools we have available it is possible to recreate the same window around data-set items for use.

Its all here and its not all programming as there is quite a bit of ‘filing’ to be done, but also because of the current situation and lack of options all we need is something that is fit for purpose.

I have only got a GTX780 but a model only takes a few hours, in fact I have another on the way as planning on knocking out a few models and sharing results.
But it does take hours and there will probably be quite a few models but its not that a huge undertaking if shared by a few.
We just need to set some presentation standards in labels used, accuracy and classification sensitivity and I have a hunch we might get the start of a model that can be fit for purpose but also a great demonstration of much of what is involved as this is just KWS and thankfully I don’t have to think about ASR.

2 Likes

Have you ever used Google colab?

That is a cloud service and not sure what that has to do with anything?
It sort of goes against why I want a private AI and prob also many others.

As far as I’ve understood you’re building a ml model with your graphic cards. In case it’s not fast enough it’s maybe worth a try. I’m using a lot google colab to work with nlp models at uni atm due to covid.

Sure it’s not that private using this. Has just been an idea. No offense

Apols Bozor actually I agree with you but its called Cloud TPU but the Colab is an actual service that as far as I am aware it really sucks you into using google services and Apis.

If it just comes to booking per hour tensor flow yeah no prob with that.
Not up for long and obvs haven’t woke up yet.

Its actually about sharing load as say the mozzila common voice its organisiation is awfull and we should be putting these datasets into accessable databases as the content they hold is immense.
I downloaded the 42gb tar of common voice and there is so much in a single folder it just kills my 16gb I7.

Maybe colab as get where you where coming from now, but even thinking some openspace to share.
It just needs simple agreement to do so and nothing more and some webspace.

If we get it then I can start using a database to provide queries on dataset selection but the actual couple of hours to run a model that your testing isn’t that bad.
Its a couple of hours with a single GPU and will have to relatively low cost ones and prob over an hour maybe 2.
For many they can run a model and the go alien kicking or whatever in a game of choice afterwards. :slight_smile:
I picked a GTX780 as its a a relatively low cost reasonabilly modern card for $60-80 but guess many will have more than that available already.
The hindrance isn’t running the model its creating models on data that differs from the capture you feed it.
That is extremely concentric to the project and not just ‘black box’ sharing of completed models.

What we should be doing is windowing the datasets so they match our method of capture.
Then many can share and run models or use TPU cloud access if they wish.

If you look at the wav capture and gap spacing then often they change in normalised volume gap spacing and wav centering.
Oftens its 16000Hz S16LE 1 channel.
But the capture routines such as HermesAudio should also be able to parameterise dataset entries.

PS apols as duno why but completely got the wrong endof the stick but actually we don’t need to enforce collaboration but the common data we have could really do with being in a better format than it is.

The VAD & Sox processing we use leave a distinct signature on capture that should be the same on the dataset.

Originally I thought ‘keyword’ / ‘not keyword’ would be the best way to classify and if you apply only a single ‘wanted-word’ in the above that is what you will get.
‘Keyword’, ‘Unknown’ & ‘Silence’.

I have been more interested in the ‘Unknown’ classification in that it hardly ever changes substantially and leaves results based purely on recognition score which also has limited change to new recorded wavs of keyword.
The latter is the recording window and presentation of our dataset is often very similar in the dataset but not to the new recorded recognition keywords.
Its a pretty easy fix as the datasets we have need to be preprocessed with the audio capture tools we use.

This is what I am getting onto as a project as there is a complete lack of audio_processing tools for recognition and its just purely a blackbox of a coding exercise of python skills rather than the project of a Voice AI that it supposedly is.

That there isn’t an iniative or work in providing a Rhasspy dataset speaks volumes in the reality of what is here.

With a project you have set audio_processing capture and you know the exact methods and paramters so you can massively increase accuracy by creating a dataset that has been processed by the projects methods.
It will never be like a general purpose commercial KWS that copes with all form of input, but it doesn’t have to be because this is for Rhasspy.
You can quite easily match commercial KWS recognition levels be simply processing the dataset to your audio processors of the project.

The only other thing is what is unknown as there is no such thing with voice as its purely a collection of phonetic and sylable count or at least to a large extant that is all a MFCC spectoram is.
In English there are only 6 base phonetics.

Its not something we need to work out as that is already done and documeneted, I have to admit I don’t get the Glottal collumn as in English it would seem more of a 7th row, but hey :slight_smile:

The starting consonant and sylable count, paints an extremely strong picture in a spectogram that with consonants there is no such thing as unknown.
Sylables often don’t go past 3 sure there are 4 but struggling to think of one but if there are only 6 consonants I might argue there are 7 that is still only 18 - 21 classifications.
With 4 sylables then that is 24 - 28 dpending on how many consonants exist.

Using ‘Unknown’ to provide weight is absolutely pointless but just like google page rank the weight of the returned page is more than just word occurance and click count and its weight is modified by the proximity of surrounding classification.

That isn’t rocket science and niether is a KWS specific to a project with specific audio processors.
Its extremely likely it could be done just isn’t, isn’t even an attempt.

Programming and creating audio processors without the extensions to manipulate datasets is a huge ommision from the project.

Hey @rolyan_trauts!

I’ve been trying to train a CNN model (using Tensorflow.js) for both KWS and speaker identification (as I think this is the easier way to get both :wink: ) but without success.

I think my dataset is too small to achieve a good enough accuracy. I’ve augmented it with added noise, normalization, tempo changes, etc.) but I cannot get the trained model to detect my keyword correctly.

Did you succeed to build a KWS model that at work (even poorly)?

Yeah but it will always work poorly as the dataset wav have an extremely common recorded format that when presented with new or via rhasspy it wll lower the recognition score.
You can test by using a model wav vs one recorded yourself.

Even with the terrible spectograms of the tensorflow speech-command example I get +.85/+.65 recognition of model wav / recorded wav repesctive.
The spectograms created in that example are pretty stinky but what is even more stinky is the difference between dataset wavs and use wavs.

Really they need to all be fed through the rhasspy audio processing and then collated into a dataset otherwise it will always be bad.
But just off the bat the KWS example in the tensorflow github works extremely well with the model wavs but not as well with new recordings.

Wav (left) from model
left (score = 0.98248)
yes (score = 0.01425)
unknown (score = 0.00211)

Wav (left) not from model
left (score = 0.64037)
right (score = 0.26301)
unknown (score = 0.07466)

Just doing what the tutorial says vanilla with stinky spectograms and no audio processing of dataset.

    python3 tensorflow/examples/speech_commands/freeze.py \
    --start_checkpoint=/tmp/speech_commands_train/conv.ckpt-18000 \
    --output_file=/tmp/my_frozen_graph.pb

python3 tensorflow/examples/speech_commands/label_wav.py \
--graph=/tmp/my_frozen_graph.pb \
--labels=/tmp/speech_commands_train/conv_labels.txt \
--wav=/tmp/speech_dataset/left/0a2b400e_nohash_0.wav

PS the image data is MFCC but dunno why they just look really low res and not detailed enough to other forms I have seen.
The above with that works ok with a single sylable word but should be even better with multi sylable of the audio concatenation of ‘hey rhasspy’ as a MFCC or similar key words.
I do think lumping many together into ‘unknown’ is a bad idea as then it creates this unknown classification that is a blur of such variance that its little use.
I have a hunch you could do this by phonetics and sylable count and use the complete return parameters of multiple categories to apply weight to the keyword score that might not be an obvious standalone score :-
Wav (left) from model
left (score = 0.98248)
yes (score = 0.01425)
unknown (score = 0.00211)

The use of phonetics again can make things more accurate but aso narrow scope as in terms of dialect here we have Received Pronunciation or RP (TV English).

https://www.englishclub.com/pronunciation/phonemic-chart.htm

But based on how you label your data and classify your data you probably make it more accurate by the weights of other classifications.
Maybe even I have got it the wrong way round about the MFCC images and theres is too much resolution that creates difference.
Probably raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT would help with pattern matching but I am groaning when looking at the code.
I suppose in models.py its easy enough but my python is pretty shoddy but wondering if its as good as librosa but its certainly not as easy to read.

python3 tensorflow/examples/speech_commands/train.py --data_url '' --data_dir $HOME/speech-commands/speech_dataset --wanted_words "visual" --testing_percentage 30 --unknown_percentage 30

The dataset now has all wav files in subfolders of visual, OneSylable & TwoSylable.
‘visual’ was chosen because it was the only 3 sylable word, no phonetics have been used just a split of the remaing into one or two sylable and the testing & unknown percentage has been upped to 30%.

INFO:tensorflow:Step #400: rate 0.001000, accuracy 97.0%, cross entropy 0.072042
I0516 14:37:50.033674 140668330936128 train.py:260] Step #400: rate 0.001000, accuracy 97.0%, cross entropy 0.072042
INFO:tensorflow:Confusion Matrix:
 [[  8   0   6]
 [  0   0   0]
 [  2   0 137]]
I0516 14:37:51.808054 140668330936128 train.py:287] Confusion Matrix:
 [[  8   0   6]
 [  0   0   0]
 [  2   0 137]]
INFO:tensorflow:Step 400: Validation accuracy = 94.8% (N=153)
I0516 14:37:51.808295 140668330936128 train.py:288] Step 400: Validation accuracy = 94.8% (N=153)
INFO:tensorflow:Saving to "/tmp/speech_commands_train/conv.ckpt-400"
I0516 14:37:51.808451 140668330936128 train.py:296] Saving to "/tmp/speech_commands_train/conv.ckpt-400"

Classifying by the construction of the the words provides far more correlation to the mfcc spectogram so instantly the model accuracy is high.
The 3 sylable ‘visual’ is extremely unique and is akin to ‘heygoogle’ ‘alexa’ keywords that are also chosen for sylable complexity and phonetic uniqueness.

I will let that run and see how the end tests are.
The dataset was moved 2 new folders added (OneSylable, TwoSylable) then all word folders moved to the corresponding folder.
The testing percentage has been upped and this is going to take much longer and generally the Google command set is specific for commands and prob not the best collection for KWS but the default demo only used 10% of what was available which is limited anyway.

I think this is how Snips trained their « Hey Snips » model. They trained a model to recognize « unknown », « Hey », « Sni » and « ps » sounds and used a specific post processing algorithm to check if the correct succession of sound was detected.

I have no knowledge of Snips but just been looking and researching what a MFCC image looks like as an image.
So first the sound and construction of words is the only logical structure for a time based image of words that MFCC creates.
But yeah I think you can classifiy with high accuracy and then post process against those classifications in a sort of reverse google page-rank to provide further weight to recognition.

I think Mycroft Precise is a phonetic splitting engine so you can create your own combinations but that increases load and reduces acuracy.
I have noticed with Precise that at times it will recognise ‘Mycroft’ without a ‘Hey’ as KW even though it shouldn’t.
I think that is for most parts why Porcupine is so fast and light as its a single word trained model and not some postprocessing system of concatenated phonetics.
Doesn’t matter if as literal language ‘heysnips’ is two words as a model word its a single 2 sylable word if trained as porcupine models.
Splitting into single words and then concatenation processing just requires heavier models and postprcessing complexity.

I am going to run this which will be just ‘visual’ vs ‘unknown’ then run with ‘visual’, ‘OneSylable’, ‘TwoSylable’ as wanted to see even though distinct the blur of all in unknown creates a none descript and essentually useless datum for postprocessing which is just a score alg aka page-rank.

Then also if you are going to phonetic split to how many classifications and how much data you need.
The common voice, Librevox, VoxCeleb and google command set all have the wavs in there datasets but extracting into words that are more applicable to give even coverage in classification quota whilst maintaining fast and small models is all a leap in the dark to me.
Also in this example the MFCC format has me totally bemused as bin_count for mfcc is 40?! Whats that then!? :slight_smile:

Also the trick to reduce background noise of raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT.
Not really sure where to do that in this code but many advocate its use as basically its a bit like an audio expander where the predominant features are accentuated and what are likely artifacts tend to drop out of the MFCC image.

Rhasspy could definately have its own KWS but to stop its models being a black box we need to share and have some form of queryable dataset that is just taken from the huge collections that are freely availablle but based on word construction and sound rather than literal meaning.

Also you can overtrain models as not sure what really happens but accuracy seems to have this epoch target and then going past that things seem to degrade.
I hit 100%/100% training and valdation accuracy after a couple of 1000 steps and now seem to be making things worse but will just let it run purely so its like for like and maybe readup on that another time.
Keras seems to have this function to cut short training on best hit accuracy step, with the google example you use Tensorboard and monitor and set that yourself.
Seems once you hit a level training from that point has no gain and can even add error to the model.
Prob about 8000 pointless steps done on this very simple model.
I think you can load up any checkpoint and start again or process that step --start_checkpoint

1 Like

visual (score = 1.00000)
unknown (score = 0.00000)
silence (score = 0.00000)

With a trained wav couldn’t be more certain.

Same with my voice as this is with various volume levels

visual (score = 0.92848)
silence (score = 0.07152)
unknown (score = 0.00000)

But if it had audio processing and normalisation the test on 5 recordings is 100%

@fastjack

Its a bit of a pain in the proverbial as it grabs labels from the bottom folder name and copying the files is a problem as even though looks like a uid or hash or somethig there are many duplicates
Prob tomoz but think its demonstrated how accurate a CNN can be if you categorize by mfcc sound image rather than literal words.

The model is here and some capture clips after the model was created.
Using visual as it was there but the clips show how the capture window must be the same as the model window and that KWS is extremely easy and accurate but the dataset wavs must be preprocessed by the project audio_processor otherwise you are just adding inaccuracy.
You have to have accurate capture windows in your dataset otherwise the accuracy for word simularity will suffer or like the majors you select and fix with unique non similar KW choice.

There are always going to be words that are extremely similar visable/visual but also regions are not just country and generally an assumption of ‘Capital’ pronounced recievership of a nation.
Models are really easy to create the datasets are really accessable and we have a Voice.AI project that doesn’t process its own datasets.

The models and labels are in speech_commands_train.zip.

Model can be run, just edit the folder/file names.

python tensorflow/examples/speech_commands/label_wav.py \
--graph=/tmp/my_frozen_graph.pb \
--labels=/tmp/speech_commands_train/conv_labels.txt \
--wav=/tmp/speech_dataset/left/a5d485dc_nohash_0.wav

From the tensorflow repo

Its not about the models for a singular KWS as that is no advantage for opensource or a primary function of sharing models.
Sharing a dataset for a project is hugely important as if that is done the models wil come.
Strangely that is not done or even talked about.
Its not really opensource when your datasets are closed as in terms of models that is the code.

1 Like

PS there are some brilliant resources on https://openslr.org/83/

That one for me being English is important as it is split regionally and by gender.
Being a Northern Oink! http://www.openslr.org/resources/83/northern_english_male.zip is likely to produce much better results but the words are in sentances but there are many.
But its not just being English as regional dialect can have huge effect on accuracy in all languages.

Is there any of you python guru’s who can run ASR and mark audio with cut points on recogintion and split wav sentances into words.
It doesn’t matter so much about words that can not be split as generally the datasets are extensive and presuming numerous word sets could be gathered.

It would be great to have a database that you just run a query and get a dataset or datasets based on region, word, sylable, phonetics, pronounciation, language…
Getting the space and setting it up could be a huge task and would stilll benefit from a word extraction tool as many datasets are sentences.

If you are only armed with Audacity organising a dataset for a model is hugely intensive, but some tools to normalise, set wav window parameters, rate and encoding are probably not much to code and does anyone know if anything already exists?

PS another model there which was just the interest of how a keyword relates to what its not so had the additional labels of onesylable & twosylable with corresponding word collections.
Testing with my audio_capture from above the simple cnn as long as normalised manages to differentiate similar words such as visible and mythical extremely well.
Also when normalised can tell the are 2 sylable words and not sure if they will be used but it does seem to make the keyword detection more accurate.

I’m using Sox for audio data augmentation like normalization, adding noise, tempo changes, etc.

It is quite easy to use and pretty powerful.

Yeah its ok for taking out silence and stuff and a great toolkit.

Its the ASR to create split points in audio so that something like sox could.
Most datasets are spoken sentences that we have the text but not split cue points.

If you take the google command set there are 3000 for each word and don’t fancy doing that manually via audacity.
The datasets also allow to to queue up sentances that you know conatin words but even with sox without those split cue timings its a no go.
Unless you got ideas how?

Deepspeech I love you!

deepspeech --json --model deepspeech-0.7.0-models.pbmm --scorer deepspeech-0.7.0-models.scorer --audio audio/2830-3980-0043.wav
Loading model from file deepspeech-0.7.0-models.pbmm
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.1-0-g2e9c281
Loaded model in 0.00822s.
Loading scorer from files deepspeech-0.7.0-models.scorer
Loaded scorer in 0.000137s.
Running inference.
{
  "transcripts": [
    {
      "confidence": -24.236307801812146,
      "words": [
        {
          "word": "experience",
          "start_time ": 0.66,
          "duration": 0.36
        },
        {
          "word": "proves",
          "start_time ": 1.06,
          "duration": 0.32
        },
        {
          "word": "this",
          "start_time ": 1.42,
          "duration": 0.16
        }
      ]
    },
    {
      "confidence": -34.029636880447235,
      "words": [
        {
          "word": "experience",
          "start_time ": 0.66,
          "duration": 0.36
        },
        {
          "word": "proves",
          "start_time ": 1.06,
          "duration": 0.32
        },
        {
          "word": "his",
          "start_time ": 1.42,
          "duration": 0.16
        }
      ]
    },
    {
      "confidence": -35.76723024350955,
      "words": [
        {
          "word": "experienced",
          "start_time ": 0.66,
          "duration": 0.4
        },
        {
          "word": "proves",
          "start_time ": 1.1,
          "duration": 0.3
        },
        {
          "word": "this",
          "start_time ": 1.44,
          "duration": 0.18
        }
      ]
    }
  ]
}
Inference took 0.879s for 1.975s audio file.

Still need delete or omit the words deepspeech gets wrong as been wondering if proffessor floyd had a chip on his shoulder.
“Professor Floyd pointed out that mass lesions are a possible cause for epileptic seizures”
recognised as
"
{
“transcripts”: [
{
“confidence”: -100.5955242491676,
“words”: [
{
“word”: “professor”,
"start_time ": 1.3,
“duration”: 0.38
},
{
“word”: “floyd”,
"start_time ": 1.72,
“duration”: 0.4
},
{
“word”: “pointed”,
"start_time ": 2.14,
“duration”: 0.36
},
{
“word”: “out”,
"start_time ": 2.52,
“duration”: 0.14
},
{
“word”: “that”,
"start_time ": 2.68,
“duration”: 0.16
},
{
“word”: “malaysians”,
"start_time ": 2.92,
“duration”: 0.92
},
{
“word”: “are”,
"start_time ": 3.9,
“duration”: 0.08
},
{
“word”: “a”,
"start_time ": 4.0,
“duration”: 0.1
},
{
“word”: “possible”,
"start_time ": 4.14,
“duration”: 0.38
},
{
“word”: “cause”,
"start_time ": 4.66,
“duration”: 0.36
},
{
“word”: “for”,
"start_time ": 5.06,
“duration”: 0.08
},
{
“word”: “epileptic”,
"start_time ": 5.18,
“duration”: 0.58
},
{
“word”: “seizures”,
"start_time ": 5.92,
“duration”: 0.4
}
]
}
]
}

Its dangerous stuff this recognition lark! :slight_smile:

I’ve stumbled upon an article detailing how Snips handled the “personal” wake word detection with only a few samples:

This was a pretty interesting read…

@synesthesiam Maybe something similar could be implemented as a Rhasspy service? It should allow both custom wake word detection and speaker identification…

1 Like