New wakeword system

I hope you don’t as an average streaming keyword spotting accuracy of 85.2% for state of the art is extremely poor that can garmer +98%.
The only thing that paper proves is that general catch all models for KW provide poor results and why you would embark on what already publishes a best of 85.2%.
A KW essentially has no language as its purely an audio sample that you are trying to detect in a time frame and if your best accuracy is 85.2% with noise its downhill from there.
The above paper is not multilingual it just uses multilingual datasets to provide transfer learning weights for speaker intonation as could be used for squeaky helium voices or any intonation.

Transfer learning is about reusing huge training of large corpusses to tailor for variation and generally is always inferior to specific trained models.
KW models and datasets are relatively small in comparison to proper linguistic models of ASR or TTS and its just seems strange to use methods that off the bat published have what is with KWS nowadays an absolutely huge 13% less accuracy whilst state of the art battles for fractional improvements.

Transfer learning will help with the spectral difference of different hardware and environment as each audio hardware setup has its own audio signature that upon analysis can give much variation in MFCC/Spectogram analysis.
Again it will be always inferior to hardware dictate where the spectral signature of capture will always be the same as in commercial smart speakers.
I have always felt the bring your own microphone to Rhasspy was pure naivety as the fix of dictating and providing for specified hardware is so much easier than a system to deal with the unknown.

The paper provided is aiming “our longerterm efforts will target deployment on low-cost, memoryconstrained, power-efficient microcontrollers to enable alwayson KWS support” but both Rhasspy & Mycroft are not microcontroller based and standalone or master/satellite there is a much more capable SoC that can actually do model training.
A pi4 might take 2 days to train a KWS model but if you think how much time a SmartAI spends idle then specific tailored models updated every week is no big problem and the storage available can easily contain the datasets.
But that isn’t even needed as a training signature can be saved and you can incrementally add data and run additional epochs of a much smaller run and update a dataset with specific capture.

If you look at the speech patterns of smartAI interaction it can be relatively easy to classify positive and negative results of use and that audio has already been captured.
State of the art KWS will likely use state of art models and employ a much easier programmatical system to enable self training and extract its own dataset maybe even jettisoning the original or partial use.

There is a paper about how some of the best KWS are affected by noise and because generalisation is the adoption of noise (variation) it becomes less effective to recognise KW.

4 Likes

I did find this IEEE Xplore Full-Text PDF: and it does something very similar but supposedly 98.5% accuracy rather than the relatively terrible 85.2%

Think this is a github version https://github.com/phanxuanphucnd/wav2kws

1 Like

@synesthesiam I have started some ground work for my Project Ears app and toying with ideas, kws and datasets and going back to transfer learning.
From what I have seen such as the On Device Training the less you transfer the more accurate it is as looking at datasets we have labels with huge numbers of items now we have Mlcommons but even that for many labels only has a few entries.
A good KW is as phonically complex and unique it can be in its time segment so the 'Hey" & “OK” plus something such as Hey Google or OK Computer are quite good 1 sec KW.
I wonder if you can just create a range of base models that create a phonetic model zoo and for words of lesser count create a smaller model to be used in transfer learning so we are tranfer learning s shorter distance away from the main model and therefore retaining better accuracy?

What would constitute “base models” here? Would these be models trained on wake words that we have a good amount of public data available for, or do you mean phonetically diverse words from the Multilingual Spoken Words Corpus?

We have always needed a model zoo but yeah a sort of phonetic base as if you look at Mlcommons and hopefully they will continue to add more unique words and other languages as yeah lot of words turned up of late.
The transfer learning I have seen always seems to be the further you try to transfer the further the losses but its quite likely any keyword can be transferred from a ‘sounds like’ and each time you do just add to the model zoo.

If someone wants Hey Maria and Hey Career due to word qty available likely it would make a good transfer model which both tensorflow and Pytorch have various frameworks for.

To be honest I have never understood why we haven’t had a model or dataset zoo with the emphasis coming from you, its always been curious to me why you have never said hey guys if you have time send me ‘Heys’ & ‘Rasspy’s’.
Its never been models that are a problem its datasets but now transer learning is becoming a thing as long as they retain minimal accuracy loss its likey a phonetic model zoo could be a base of transfer learning for quick turn around on models.

The Mlcommons thing has just got me interested again as now I have a multivoice dataset to go at and just planning through much of what I forgot as for me it was just to have a single high performance model as I couldn’t care about accessorizing with bespoke KW its just got to be accurate and noise tolerant.
I would pick a base model with the highest accuracy but thinking about it yeah you could use tranfer learning and also supplement with on device training if you wish but you need the checkpoint and training signatures so its time to start training some models purelly based on dataset availability that you can transfer learn.

@synesthesiam https://drive.google.com/file/d/1BRl2QkYlSkgeypCvFpLgmOnJLJUcSjNb/view?usp=sharing

Was what I settled on there is a small low epoch small database file there as pulling my hair out as keep somehow corrupting files.
I did the above just as a sanity check as after you do a 200k label dataset over 10x to be scratching your head what is corrupt you need one.
I will give it a go tomorrow but what I was trying was to get a unknown word list of roughly a 1000 picked by trying to get an even phonetic distribution as was using this little script.

Every time I was ready to click train tensorflow would tell me there is a problem with the wav loader but not which wav and it takes ages to run tests on a dataset that size and I have and failed to find what or where.
But yeah eventually after adding a softmax to the model I went with a running avg

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading


def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n


def sd_callback(rec, frames, time, status):
    global input_details1
    global output_details1
    global inputs1
    global window
    global window_size
    global window_count
    
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, 320))
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    window[window_count] = output_data[0][0]
    window_count += 1
    if window_count > window_size -1:
      window_count = 0
    kw_hit = moving_average(window, 50)
    if kw_hit[0] > 0.5:
      print("Hey Marvin")
      window.fill(0)
        
# Parameters
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
window = np.zeros(50)
window_size = 50
window_count = 0

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="stream_state_external.tflite")
interpreter1.allocate_tensors()
# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()
inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

Oh as it was a sanity check the dataset is just norm -0.1 so prob needs a good signal as did bother to augment volumes.

https://drive.google.com/file/d/1vpkmpPQYvSoVdRv7SqMpXpZHctCa2HPe/view?usp=sharing

As per usual I know or at least think I know exactly what I want to do and have wondered into distraction avenue.
Mlcommons is great but boy there is a lot of dross in there and attempts to clean has had me wondering away from purpose.
So I will post what I have that maybe of use as dealing with a dataset as that is very intensive so here is the word and file list in a SQLite database with also a matching phonetic word table.
Also as txt files.

I have always felt throwing random words into #unknown# is exactly that and random and likely to cause random results.
Also a model has no concept of what a word is and in the model I choose with 20ms (x50) in a 1sec windows uses a whole window so pointless copying small words that have more KW whitespace to negate proabablity than KW.

My idea is to create a collection of KW sentences matching roughly the KW based on phonetics and phonetic count.
A model is just a graph and what you are doing is creating known spectral variations to a spoken word label of !kw #unknown# as opposed to the voice free !kw label of #noise#

So on analysis with a database hey on a simple selection gives me

age	'eI	2
aid	'eI	2
ate	'eI	2
eight	'eI	2
bay	'eI	2
day	'eI	2
gay	'eI	2
hey	'eI	2
lay	'eI	2
may	'eI	2
pay	'eI	2
ray	'eI	2
say	'eI	2
they	'eI	2
way	'eI	2

Marvin

machine	m	n
making	m	N
martin	m	n
meaning	m	N
meeting	m	N
million	m	n
mining	m	N
missing	m	N
mission	m	n
modern	m	n
morning	m	N
motion	m	n
moving	m	N
machines	m	n
martians	m	n
merchant	m	n
millions	m	n
moment	m	n
moments	m	n
marvin	m	n

and so on …
To create a KW matrix to concatenate as a phonetic ‘hey marvin’ form where from the database selection is from the phone occurrence column and max number of phones and also some interpretation on sounds like phones.

There is far too much sh it in mlcommons as its absoltely packed with dross which is no matter as the labels often have huge quantities to pick from and when creating a full length kw you quickly gain large qty’s ^2 due to the word matrix ‘hey marvin’ concatenation produces.

I really need to clean up the dataset and its that, that has led me down distraction ave as currently with what I have is going back to random as the actual wav file doesn’t say what it says on the tin.
Maybe feed into some sort of ASR as it doesn’t need to be perfect but so much is so bad I will never know if its the model, dataset or just the sh itty wav files themselves.

A streaming VAD based on a much shorter window is prob much easier as its just singular single phone selection and maybe I will take a break and do that instead.
@synesthesiam you got any ideas on how to further clean up the mlcommons dataset?
As from age moments to gay martians its almost very close but just in need of pruning.
What you are doing is surrounding the KW with another classification and forcing the model to focus specifically on those differences and when I get a dataset of some worth I will give it a go as still not sure if that alone will suffice or its additional to the usual random #unknown#.
If you have a label that is close but not the KW it forces the model to train hard whilst and unknown of huge variation far from the KW makes early accuracy very high and garners less training.

@rolyan_trauts I would be interested in your progress of creating a new KWS system. I’m currently not so satisfied with porcupine but I couldn’t find a better system yet…

Regarding your approach you mentioned here, I think you might be interested in a somewhat similar approach I tried some time ago. Instead of collecting phonetically similar words, I just predicted the (greedy) text with a small STT model and compared it with the textual keyword by calculating character differences (CER). I found this was working really well for some phrases like “I don’t know”, but didn’t work for most other words. So your phoneme based approach might achieve better results …
You can find the implementation here: Files · scrimo_ww · DANBER / Jaco-Satellite · GitLab

You might also be interested in this paper: https://arxiv.org/pdf/1811.10736.pdf
I think it’s a quite interesting approach because you can get personalized keywords with only few self recorded examples, which also should solve detection problems when the user has a foreign accent. I already did a short test (Files · qryex_ww · DANBER / Jaco-Satellite · GitLab, basically the last commit in this branch), but I had some problems with the audio windows not matching well to the moment the keyword is spoken. Maybe adding a VAD model could help here. I hadn’t enough time to continue it further.

Regarding the paper you linked above, the problem I see there is that they use a quite large model. But similar as the authors I would say that pretraining the KWS model on larger STT datasets will benefit the recognition accuracy.

1 Like

I will look but I am targeting supplementary on device training and its hard to convert CTC error rate to what is usually quoted for KW. To get user of use and hardware of use into the dataset is what I am aiming at and its quite easy with Tensorflow but with the above I am quite unsure how to implement that.

I am doing prep work for ‘Project Ears’ which is just a simple interoperable KWS system that will work with any ASR the server side either sits on ASR host or is standalone and on device training is a bit of a misnomer as the server is the device and trained models are shipped OTA to simple ‘Ears’ which are simple KWS.
The models are not big only the original master model is of a large dataset that is loaded and shifted by the weights of locally trained models.
I only know how to do that with standard TFLite models but also I am to employ both Pi & Microcontroller as have a hunch I can squeeze this down to ESP32-S3 but recently been thinking the PiZero2 is such a good price the couple of $ is not worth the horsepower loss but keeping ESP open as likely those ESP32-S3 will eventually land around the $5 mark.

I am using a google framework the models are already done for me google-research/kws_streaming at master · google-research/google-research · GitHub

I just just think some basic assumptions of creating datasets are misguided due to the way the vector graph and softmax work.

softmax

As the math looks confusing but the the correlation between labels is quite simple especially with low label count KWS as when you create an ‘all-in’ ‘unknown’ its a seesaw as you move away from KW by adding !KW you automatically inflate anything not from that labels softmax score if it doen’t fit that label.
I think we have always had the tools and models to create good KW but the methods of use for me after much trial and thought are slightly suspect.
Google already provide tensorflow and even publish a python framework ready to go to create models I guess its understandable they stop there as you shouldn’t copy what is a benchmark dataset methodology for a working accurate KWS dataset.

What I am saying its any and all vector models are the same and are built with simply too few labels where not only do they act as a sink for !kw due to the seesaw of softmax when there score rises it has a reciprical effect on KW so by clustering more labels you can create more relevance to softmax that often swings to max because the other label fit is far from current input rather than the KW actually matching as often KWS are shy of K in the above to classify accurately… Which is a fairly easy fix as aiming to add a singular ‘similar’ label to the already kw, unknown & noise that has much less variance than unknown but wider than the KW and see how some recipes work, but boy 1050ti aint great hacking thousands of soundfiles with Sox is a nightmare though…

Accessorizing KWS with personalized KW is nice to have but secondary to accuracy and I know accuracy can be much better by adding more labels to stop the near binary switch of ‘if it isn’t this label it must be that as an output must be made’
Once a single model is done it can be shared and the best accuracy is to create models for specific language tailored to fit or use the usual standard English model that garners more personal accuracy through ‘on device training’.
Training a CTC for all the phones than just the single audio sample of a vector model is far beyond me and my GTX1050ti are capable of but with some frustration its possible with more simple vector models.

Also due to the problem of working with even though large on certain words the quality of Ml-commons is not great and sort of need to build models to filter my dataset and create a base dataset to work with but at least for 1st time there is quantity.
I have been wondering about taking a different tack and building 2x models to make dataset manipulation easier with Hey & Marvin on two separate models as like the smaller single phone ‘Hey’ model will also be a reasonably accurate VAD so sheding load and getting a model based VAD for the cost of training flexibility but again prob distraction avenue but I have that one to try also.

The hardest part is automating silence stripping of wav files and after many different tries have just gone for brute repetition of increasing strength of the ones still longer than required window.

for f in *.wav; do
  cp "$f" /tmp/sil1.wav
  sox /tmp/sil1.wav -p  silence 1 0.1 0.0025% reverse | sox -p "$f"  silence 1 0.01  0.005% reverse
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 0.005% reverse | sox -p "$f"  silence 1 0.1  0.01% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 0.01% reverse | sox -p "$f"  silence 1 0.1  0.02% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 0.02% reverse | sox -p "$f"  silence 1 0.1  0.04% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 0.06% reverse | sox -p "$f"  silence 1 0.1  0.1% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 0.15% reverse | sox -p "$f"  silence 1 0.1  0.25% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 0.5% reverse | sox -p "$f"  silence 1 0.1  0.75% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 1.0% reverse | sox -p "$f"  silence 1 0.1  1.5% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    sox /tmp/sil1.wav -p  silence 1 0.1 3.0% reverse | sox -p "$f"  silence 1 0.1  5.0% reverse
  fi
  cp "$f" /tmp/sil1.wav
  dur=$(soxi -D /tmp/sil1.wav )
  dur=$(echo "scale=0; $dur * 32768 /1" | bc)
  if [ $dur -gt 13107 ]; then
    rm "$f"
  fi
  if [ $dur -lt 8192 ]; then
    rm "$f"
  fi  
done

Its only way I can find without killing so many files, but slowly getting scripts and methods together to automate much.
With nothing to do for a model apart from edit the param script ffor the Googleresearch KWS benchmark framework

$CMD_TRAIN \
--data_url '' \
--data_dir /opt/tensorflow/data \
--train_dir $MODELS_PATH/crnn_state/ \
--split_data 0 \
--wanted_words heymarvin,unk,noise \
--resample 0.0 \
--volume_resample 0.0 \
--background_volume 0.30 \
--clip_duration_ms 1000 \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 10000,10000,10000,10000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
--return_softmax 1 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.1 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1 \

Then edit on the ondevice at the end of training.

PS Mlcommons is also a big problem as wow up to nearly 70% is wrong or unusable, some words are captured well but the majority not.

I wonder what they used to extract the words. I believe the source dataset was Common Voice.

Yeah and they tested against Google Command set using words that are also in Common Voice Word Segment that MSWC classes contain nearly 80% of and got cccuracies as low as 60% with the added alignment captures but believe the model tests are a process of reverse snobery as the base of common voice is such high quality a GSC KWS would reject :slight_smile:

crispy-succotash did a better job and I abandoned that as a bad job.

I am in the usual bemused state :slight_smile: as found this with hey as 78% of mlcommons is the common voice single segment as the file names match of the classes it had.
But ‘hey’ just to check as apart from padding and format…

Which is even a worse indictment to their aligner as it only managed to get 80% of what where already single word classified samples.
But they are mixed in and on those classes are approx 78% with only 22% aligned additions which they also happened to use as classes for tests…

Since crispy I have always thought there are likely better and additional ways to strip out words but still its a brute force and reject job with latter batch KWS doing post filltering as ASR datasets are huge and eventually you can fill an even distribution of words.
https://github.com/petewarden/extract_loudest_section the GCS guy prob was in the right direction but failed to compile as always get a seg fault. I think again banging on about phonetics and syllables but its likely they have a common volume distribution that could create a pattern envelope for variable extract_loudest_section and create a much better reject alg.
But also even though it can be reclaimed they have stripped the dataset of crucial metadata.

[EDIT]
Nope they ran a filter so 60% is just about right on listen to approx good/bad and still a big problem of how a language dataset of commonvoice was hugely skewed with non native speakers and gender but lacked sufficient metadata as it was an opt in than enforced and many didn’t

1 Like

Ok continued with the additional label of ‘sounds like’
https://drive.google.com/file/d/12NW_vNgYFLs8OAnWOjngMUfel_VsI_38/view?usp=sharing
The model is just ‘hey’ and not a good test as even Google gets this wrong with very close phonetic syllable words and seems as good as.
Its part of the 2 model structure as it was just easier than the hassle of joining words to create a single KW model and the first softmax is used to create a product with the 2nd and Marvin will be next.
Actually prob will do this again with less heysound.txt words and drop to 3 syllables or less only. As also included 4 in this as thought I may get too close to the ‘Hey’ KW but obviously not from results and this is just the process of getting a feel for early recipes.

TFL-stream.py is displaying and calculating a running avg on all labels so you can get a feel for what is happening rather than actually need to as softmax in simpltic terms is the labels score divided by the sum of labels so any softmax is the result of all.
That is the problem with softmax and especially 3 label models as the score often isn’t representative of a good fit to recieved kw purely that its fit to ‘unknown’ and ‘noise’ is very low so softmax returns a high value even though it could be low also but kw is much higher than the other labels.
Which tends to result in a panic of many random single samples that becomes a compleltely random process of unknown.
Hence an additional label of ‘soundslike’ and the balance is to get near to KW whilst still having difference as this then creates a far more real context to kw softmax score and also forces the model to concentrate and the much less difference of the soundslike label and produces a much more linear training curve.
A 3 column model is a paradox as it forces you to provide for the unknown which how do you quantify and select samples for unknown and hence when there is a recieved input that doesn’t fit unkown or noise it could quite likely return a false positive.
Not because the recieved word is a good fit just the the argmax if very low on other labels so softmax zooms up to a high score.
With a ‘sounds like’ column that can not happen and it forces a better fit on KW irrespective of other column argmax values.
This then allows us to concentrate on known values and the ability to have a defined dataset size with multiple samples.
KW is augmented up to what it says in splits.txt but was from an original 2k. Soundslike produced a list to match meant each of those words a 100 samples where picked and then augmented to KW level. Then unknown was supposed to be 20 with no augmentation but mlcommons obviously shares words taken from different sentences and pushed it up to 30 as I have a hunch the same files are getting overwritten which at the time was easier than applying some naming scheme to stop that. Unknown is now not unknown but a wide selections of words not containing KW or soundslike words from the great distribution of the same syllable count being hey and 1 is why its capable of VAD as they include all phones and they ones it doesn’t have are in kw & soundslike.
Noise is a collection of split hw length files.
Then all files are prefixed with random numbers and volume augmented randomly x.35-1 of their norm0 initial state as the GoogleKWS framework as background noise on 80% at 0.3 levels so just above.

I still have aproblem with Mlcommons as some classes are 20-30% bad but hopefully they might continue to evolve as if you give the above tar a go you will see even with that level of unwanted dross the model does well.

I might keep with a 2 model as you will see the prefix model can very likely be used as VAD than the hassle of joing up words samples for a single model kw but much was due to being distracted by how much bad is in mlcommons and should learn just to ignore.
The ‘sounds like’ addition will increase accuracy of any of of the leading models and as usual went for my fave of CRNN as its lite, relatively fast training and still hold very good accuracy score.

Its all going to be part of project ears my ASR agnostic interoperable client server KWS system and will when I get round to it have the additional accuracy of on device training of use samples as a part of normal operation.
I am only retraining for a closer match between KW and sounds like to get more range to KW hit probability as still to test if that can be used as a metric in a single zones distributed KWS array.

I really suggest you create a sqlite database as it makes slection and working with large datasets so much easier.
Also I never got round to adding a f32 0.0 tensor to clear the model input at the same time as clearing the running avg numpy arrays on kw hit so is still post kw hit sensitive but actually not that much but doing that will also improve.

So thats it and just add another label so the model has known comparison and its closeness forces the level of work as trying to reference against unquantifiables such as noise and unknown is as stupid as infinity and beyond.
We have always had the models to make accurate KW just the datasets and layout was limited creating many erroneous argmax/softmax results.

I will park that there as Mlcommons is likely now a White Elephant that hinder KW dataset collection until the community around it become more honest to its worth. After downloading 32Gb (En) you will realize 50% was silence and what remains is a very skewed dataset in terms of word distribution to small 3/4 letter words that contain a large error qty.
A large proportion is non native speakers which is great for non native KWS, poor age and gender distribution and the problem of going back to the original common-voice dataset to get to the poorly distributed meta-data it once contained.
I was never a fan of the common voice focus on bulk whilst seeming to have little criteria to enforce metadata and quality assurance and that seems to play out in how often its quoted as a source dataset.
I get that it has quantity but when you throw in the high error rate of the shorter words the dataset predominantly is, as yeah there are a few words that do align better much better than others but wow some are terrible.

Is one of those academic papers that is highly academically verbose full of facts and figures that just omits to say generally the dataset is a turd and lacks any decent metadata.
Closest it gets is:-

We note that the relatively low accuracy (60.4%) of the 5-target model trained on GSC
and tested on MSWC is due primarily to misclassifying target keywords as ’unknown’. This behavior
is likely due to the fact that the GSC model is trained only on high quality, manually recorded target
samples, and therefore is more likely to classify many samples in a wider distribution as ’unknown’.
Models trained on smaller curated KWS datasets may therefore be brittle in practice.

GSC is not good its a benchmark dataset that I often spend much frustrating time filtering the dross.
Everything is relative but say in comparison to the datasets Dan Povery and crew offer on https://openslr.org/ its a pretty resonable assertion to say common voice is infererior.

It does say:-

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages

So maybe there is hope this was just a 1st revision but with common voice not sure how they can improve things.

‘Unknown’ and limited classes are prob not a good idea for KWS as likely its use has been a copy and paste of the many multi-class examples that are out there where the qty of classes means softmax is less of a binary switch as each one has a context in the overall result.
A pretty easy fix is a singular addition of a ‘sounds-like’ that contains a distribution without drifting to far away from the KW of interest, unkown is still is that catchall dump but the soundslike create a less binary class structure that at least being phonetic and syllable based have some known qualities.
That still with the word distribution and qty of samples available is very much down to luck with the initial kw you have chosen and maybe multiple ‘sounds-like’ on different parts of phonetic syllable structure but your choice of classes is purely creating a recipe.

Splitting complex unique KW into 2 models wasn’t my initial idea but its starts to get a very complex process to augment large datasets and actually it does work quite well but does make things much more simpler.

Maybe as a native En speaker I am being over critical but ignoring native speech and doing manual analysis as actually listening and pruning a class I can find little correlation to the WER quoted in the above paper where some classes I would consider up 70% problematic for KWS usage.

Disheartened really as the names and institutions quoted on the paper you would of thought maybe better… ?

https://drive.google.com/file/d/1i-jC-29d8jbhlkRNdv7XuhoGuT3zqGOS/view?usp=sharing

There is a Marvin in there to try with kws.py which is prob something like I will settle on.
The database is in there even if I haven’t got it exactly right but shows how the classes where picked.
I have been trying a few recipes of multiple classes and basically it seems keep it simple is the best option with short words such as hey or martin then maybe 2 ‘sounds like’ of start and end syllables.
Marv being the start MAA1R and then IH0N-NG ending samples making sure you don’t get the same in each different class including ‘Unknown’

The only reliable way I have found to clean Mlcommons is to run inference on the model and remove the files from the dataset and run again unk being such a wide class is more problematic that others but the tfl-non-stream.py is setup to read the dataset and log to a file and you can also delete.
Sqlite is a great little database but for design I have been swapping between Librebase which I hate and the DBbrowser for sqlite which is great but lacks a design window.
Is there a good database front end for Linux as also tried Kexi which crashes quite often DBbrowser with a design window rather than plain SQL would be amazing but sadly not even though good but swapping between librebase and DBbrowser is possible but a pain at times.

1 Like

You can bang both ‘Hey’ & ‘Marvin’ in together and have some sort of locking mechanism so one runs after the other with a reset to start again.
A singular KW is a much better option but a far bigger undertaking in creating the dataset of combing words which if Mlcommons was a bit cleaner probably would of but not going to bother.
I did the 2x KW models just to get a feel for a basic recipe and both KW supplied had errors by me choosing the dataset and those contained in Mlcommons with not very long training runs.
Also could of quantised the models and used a single MFCC feed to reduce load and memory size again it was merely out of interest with the phonetic side of the dataset what works best.
Same with code as just me being lazy and should get rid of all those globals and create a better structure.
Same with the augmentation to get class sizes equal to the 50k samples of ‘Unknown’ as used Echo, Reverb, Pitch & Tempo but the Echo is prob a bad method as its a much less likely effect than what real environments likely contain and prob was a bit too strong.
Also the parameters of the window sizes and probability threshold likely needs some tweaking as never did test that very much.
I will give it a break for a while so I forgot what I was doing and get a fresh perspective (will not take long) and go for a more polished version.

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading
import soundfile as sf
import time

def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

def sd_callback(rec, frames, ftime, status):
    global input_details1
    global output_details1
    global inputs1
    global kw1_window
    global kw1_window_size
    global kw1_window_count
    global kw1_max
    global kw1_last
    global kw1_start_time
    global cap1_window_size
    global cap1  
    global input_details2
    global output_details2
    global inputs2
    global kw2_window
    global kw2_window_size
    global kw2_window_count
    global kw2_max
    global cap2_window_size
    global cap2
       
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, 320))
      
    cap1 = np.roll(cap1, -320)
    cap1[cap1_window_size - 320:cap1_window_size] = rec
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
    interpreter1.invoke()
    output_data1 = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])

    kw1 = output_data1[0][0]
    kw1_window[kw1_window_count] = kw1
    kw1_window_count += 1
    if kw1_window_count > kw1_window_size -1:
      kw1_window_count = 0
    kw1_hit = moving_average(kw1_window, kw1_window_size)
    if kw1_hit[0] > kw1_max:
      kw1_max = kw1_hit[0]
    if kw1_hit[0] > 0.65:
      if kw1_max > kw1_hit[0]:
        print("Hey", kw1_max)
        kw1_start_time = time.time()
        kw1_last = kw1_max
        kw1_max = 0
        kw1_window.fill(0)
        sf.write('kw2-hey.wav', cap2, sample_rate)
        #sf.write('kw1.wav', cap1, sample_rate)
    if time.time() - kw1_start_time > 1.4:
      kw1_last = 0
      
    cap2 = np.roll(cap2, -320)
    cap2[cap2_window_size - 320:cap2_window_size] = rec
      
    # Make prediction from model
    interpreter2.set_tensor(input_details2[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details2)):
      interpreter2.set_tensor(input_details2[s]['index'], inputs2[s])
    interpreter2.invoke()
    output_data2 = interpreter2.get_tensor(output_details2[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details2)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs2[s] = interpreter2.get_tensor(output_details2[s]['index'])

    kw2 = output_data2[0][0]        
    kw2_window[kw2_window_count] = kw2
    kw2_window_count += 1
    if kw2_window_count > kw2_window_size -1:
      kw2_window_count = 0
    kw2_hit = moving_average(kw2_window, kw2_window_size)
    if kw2_hit[0] > kw2_max:
      kw2_max = kw2_hit[0]
    if kw2_hit[0] > 0.65:
      if kw2_max > kw2_hit[0]:
        print("Marvin", kw2_max, kw1_last * kw2_max)
        kw2_max = 0
        kw2_window.fill(0)
        sf.write('kw2.wav', cap2, sample_rate)
                             
# Parameters
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

kw1_window_size = 15
kw1_window = np.zeros(kw1_window_size)
kw1_window_count = 0
kw1_max = 0
kw1_last = 0
kw1_start_time = time.time()
cap1_window_size = 8000
cap1 = np.zeros(cap1_window_size)

kw2_window_size = 35
kw2_window = np.zeros(kw2_window_size)
kw2_window_count = 0
kw2_max = 0
cap2_window_size = 16000
cap2 = np.zeros(cap2_window_size)


# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="hey_state_external.tflite")
interpreter1.allocate_tensors()
interpreter2 = tf.lite.Interpreter(model_path="marvin_state_external.tflite")
interpreter2.allocate_tensors()
# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()
input_details2 = interpreter2.get_input_details()
output_details2 = interpreter2.get_output_details()
inputs1 = []
inputs2 = []
for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
for s in range(len(input_details2)):
  inputs2.append(np.zeros(input_details2[s]['shape'], dtype=np.float32))

print("Loaded")
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

The models are just renamed of previously posted above, I just added kw1_last and a timer to clear as a simple method to sync.

https://drive.google.com/file/d/1GpLsf3N75XtbslfR8J4rMdtOGLH4GUdH/view?usp=sharing

Got round to doing a single kw ‘hey marvin’

https://drive.google.com/file/d/1EFT4T0sxyVo9EXWMh-V0BL4QWXAFfVlE/view?usp=sharing

This one splits the ‘sounds like’ into 4 labels and is a huge improvement on the previous as the 1st try of dumping all phonetics into a single label has too much variance and ends up very much like just another ‘unknown’ label.

Its a good example as same dataset but previous single kw has HH, EY1, MAA1, IH0 N/NG phonetics split as ‘HH-MAA1’, ‘HH-H0 N/NG’, ‘EY1-MAA1’, '‘EY1-IH0 N/NG’ so each phonetic is close to the ‘Hey marvin’ without the variance a single label gives.

It was the same with the 2 kw models as they where more accurate with 2x ‘soundslike’ labels of focused phonetics that a single all-in one.
It creates a more complex dataset but seems to make no difference to load or memory usage.

I will tell you whats weird is that full TF as in the kw1.py uses far more memory but actually is much lighter on load.
Maybe because TF=2.7.1 & TFLite=2.7.0 and the load has been fixed and I need to compile the wheel rather than grab what is available on pip.

You will see load is about 40% with 200mb with full TF and actually jumps to 60% but with only 90mb used which is a surprise?

Also after not playing with KWS for a long time its been good to start with a fresh perspective as MS wipes clean.
The 2nd figure in a KW hit is the max volume of the KW as playing with various hats/usb without AGC the volume is usually appalling and did some tests with this where I didn’t augment the volume and had all samples @ 0db.
Training goes perfect as your feeding loud samples but in use recognition will be littered with false positives.
Also clipping creates harmonics all over the place and will flood your input with spectra so the option to put volume to the max is a really bad idea as it will have no headroom to stop clipping.

You basically have to use AGC be it hardware or software SpeexAGC which works well but its default of 8000 is far to much gain and prob needs to be nearer 2000 even a 1000.
Also for me the attack / decay of speex doesn’t seem to do as well my hardware AGC but the code is a very easy edit and do intend to give that a tweak and release in https://github.com/StuartIanNaylor/Project-Ears when I start releasing in that repo.

Same with Speex AEC as even though I am running without AEC I have a Google Nest Audio 0.5m running @ 50% playing music currently and have no problems with recognition.
I did have background noise mixed in @ .4 but dropped this to .3 so that volume augmentation covers a wider range as the augmentation does seem to have a huge effect on false positives but still guessing at an optimum value.
Increasing background noise mix level seemed to do little to overall ability to work in 3rd party noise.
I am using a https://www.scan.co.uk/products/enermax-ap001e-dreambass-usb-soundcard-plus-earphones-genie-with-integrated-80-hz-plus6-db-bass-boos with an electret uni directional and not what I consider without audio DSP awful mems mics as the AGC and chipset on that £5.99 usb sound card seems to be really excellent and see the respeaker hats pretty pointless with the 2mic being just god awful confusing to setup with a myriad of settings that your not exactly sure what and if they work.

But if any of you are getting really bad recognition and false positives have a look at the figures your KWS is getting and not what alsa or pulseaudio purport as the reality can be very different.
Don’t set your input volume above 70% as you will have little headroom and likely you will be clipping incoming audio.
Its also likely that your volume at that level will be poor < 0.1 and as I say use AGC its absolutely essential and even then its very common for input volumes with AGC still to be poor I have a gaming headset that I have to turn off hardware AGC set to approx 60-70% input volume and use Speex AGC.

Set up a /etc/asound.conf or ~/asoundrc

#pcm default to allow auto software plughw converion
pcm.!default {
  type asym
  playback.pcm "play"
  capture.pcm "cap"
}

ctl.!default {
  type hw card 1
}
ctl.equal {
  type equal;
}
pcm.plugequal {
  type equal;
  slave.pcm "plughw:1,0";
}
pcm.equal {
  type plug;
  slave.pcm plugequal;
}

#pcm is pluhw so auto software conversion can take place
#pcm hw: is direct and faster but likely will not support sampling rate
pcm.play {
  type plug
  slave {
    pcm "plughw:1,0"
  }
}

#pcm is pluhw so auto software conversion can take place
#pcm hw: is direct and faster but likely will not support sampling rate
pcm.cap {
  type plug
  slave {
    pcm "plugequal"
    }
}

pcm.agc {
 type speex
 slave.pcm "cap"
 agc on
 agc_level 2000
 denoise off
}


#sudo apt-get install asound2-plugins
#will use lower load but poorer linear resampling otherwise
defaults.pcm.rate_converter "speexrate"

My hardware AGC is great so my default input is cap but for speex agc just change the default capture.pcm to “agc” and don’t use denoise as its just artifact city.

I will add modified code to SpeexAGC (attack/delay) & SpeexAEC(frame/window) to https://github.com/StuartIanNaylor/Project-Ears as if I can get over my noob frustrations with C then I have a simple beamformer (Delay Sum) to add that gives a fractional improvement but might become a ToDo but would really like to get TF & the beamformer wrapped in C as Python does suck so bad for Audio DSP and if there is a C guru who can lend a hand please do as will save me some hair loss.
The client/server audio likely can be python as long as your not doing something relatively cretinous like broadcasting raw audio over encrypted ill fitting protocols such as MQTT websockets with a audio codec such as Opus or AMR_WB is pretty lightweight with the bottleneck being the network and not the device.

Audio in is any RTP but likely will have a default install of https://github.com/badaix/snapcast

I may also extract the CRNN from the Google code as don’t know if to keep as is as its a great framework and just add the ‘On device training’ or a simpler just CRNN model code guess will see how much a headache injecting the ‘on device training’ functions is.
Because I have been testing a new phonic method for model creation I am a bit burntout running training and stuff but will get that done and start creating a model zoo with a choice of some KWs in that sort of form of ‘Hey Xxxxxxx’
I haven’t got round to testing a cascading model as presume its little use with models of the same training run but will eventually.
Also the models above are 1st offs and really more care, longer training runs and a few tweaks are needed the process to create a single model is no problem at all but when you have done as many as I have done over the last weeks you quickly get sick of it.

Also as with the C if there is anyone that wants to aid in providing cleaner pythonic code than my hacks please so as I have a whole rake of things audio wise I have pretty clear idea on what to do and why.
Also with the phonetic recipes about the word choice has been purely a guesstimate on that looks approximately right and maybe fewer closer phonetics might be a better choice and if anyone else fancies giving it a go please do as I have a rest from training burnout.

I haven’t even got round to simple post training quantization or training aware quantization as current load is without any tflite optimization at all.

PS for tflite

import tflite_runtime.interpreter as tflite
interpreter1 = tflite.Interpreter(model_path="stream_state_external.tflite")

Last bit you can drop the softmax and running avg and drop a lot of load and do what some seem to do and just count concurrent argmax hits to a set level but often especially with noise you might get part of the KW that favors another label and then you get how many false hits do you ignore in a concurrent argmax run and start getting nearly as complex.
The running avg method does add load but I can not think of a better method that calculates the overall KW envelope that is referential to other label values that softmax & running avg does and also its capturing each KW that if you don’t intend to do ‘on device training’ then that is also unnecessary load.
Currently they are just running all the time where during broadcast its likely it will just count ‘noise’ hits to swap from broadcast mode back to listening.

Also I had even forgot but looking at my asound.conf reminded me but I am running the alsa equaliser plugin as meant to get specific low & high pass filters but just used Eq as a simple way to get a voice band pass.

If you set it up as a control device its just alsamixer -D equal that might be why I am getting fairly good recognition with a Google Nest Audio @50% playing music as much of the Bass is filtered and contains much of the energy of music at least.
Its another thing that I had a guesstimate on testing but nothing empirical, in fact haven’t even checked if does make a difference with input.

1 Like