Google KWS TFL flexdelegates custom layers

Its volume then as I actually reduced pitch as thought some sounded ‘mickey mouse’ and likely out of any norm.

Interesting as didn’t lower much and can go further I just need to play with the params but hopefully it wasn’t noise as another goal is to introduce noise to the max and make it as noise resilient as possible as with KW that should be much easier than ASR and from KW at least you can ‘duck’ playing volumes if not urban noise.

If I am reading right a I5 6600K CPU 32 ms frame executes in 2.08 ms which might mean its possible as x15 still leaves some scope. Whoops still talking AEC.

PS did you try a 64bit version of RaspiOS as maybe ?

PS there have been a number of updates to https://github.com/google-research/google-research/tree/master/kws_streaming

Really should switch but for some reason it still fails with a error on custom datasets.
Just remark part of line 260 out so its looks like this
for dir_name in dirs: #+ [du.BACKGROUND_NOISE_DIR_NAME]

There is an error if you use TF2.5 but 2.4.1 works fine

Also you should be able to record your own dataset and copy and paste over mine and train and we should have a 2 person custom dataset that is likely to lose some indvidual accuracy but also be a bit more universal and should continue that way with each additional voice actor.

I tried to use your dataset builder to record my kw & !kw. Now the problem is that mix.py always gives “too many samples are oversize” (or undersize). Is it okay to just change the 1.25/0.75 ratio in the code?

Don’t :slight_smile: been having a huge brainfart but the NS/AEC was a good diversion and went back to the original g-kws.

In Google-kws the default for silence & notkw is 10% and finally I clicked where I had taken a detour and all my improvements where slowly making false positives worse.

I will load up what I have just been doing as gone back to scratch and should not try to improve things and leave that to someone with more talent than I.

Mix-b.py is a restart as had to make a 2nd take and build up to see where I was going wrong.
Basically you want to ‘overfit’ kw as much as possible and ‘underfit’ silence & notkw so they have higher hit probability.

I will push it now.

The g-kws I just cut out some elements from tfl-stream just to make non-kw reset quicker.

    if np.argmax(output_data[0]) == 2:
      if kw_count > 3:
        print(output_data[0][0], output_data[0][1], output_data[0][2], kw_count, kw_sum)
        if output_data[0][2] > kw_max:
          kw_max = output_data[0][2]
      kw_count += 1
      kw_sum = kw_sum + output_data[0][2]
      kw_avg = kw_sum / kw_count
      if (kw_sum / kw_avg) / 45 > 1:
        kw_probability = 1.0
      else:
        kw_probability = (kw_sum / kw_avg)  / 45
      if kw_probability > 0.50:
        kw_hit = True
    elif np.argmax(output_data[0]) != 2:
      if kw_hit == True:
        print("Kw threshold hit", kw_max, kw_avg, kw_count, kw_probability)
        file_object.write("Kw threshold hit " + str(kw_max) + ' ' + str(kw_avg) + ' ' + str(kw_count) + ' ' + str(kw_probability) + '\n')
      kw_count = 0
      kw_sum = 0
      kw_hit = False
      kw_max = 0
      kw_probability = 0

You just have to do a couple of ‘close’ tests to set the probability divisor ‘45’ in the above.

The step count needs to be only 800 and you can under-fit further by running 400 on each step which also has the bonus of a quicker train.

I have split out a lot of the pysox commands as for some reason even though you should be able to do in a single build I seemed to get weird results.

parser = argparse.ArgumentParser()
parser.add_argument('-b', '--background_dir', type=str, default='_background_noise_', help='background noise directory')
parser.add_argument('-r', '--rec_dir', type=str, default='rec', help='recorded samples directory')
parser.add_argument('-R', '--background_ratio', type=float, default=0.25, help='background ratio to foreground')
parser.add_argument('-d', '--background_duration', type=float, default=2.5, help='background split duration')
parser.add_argument('-p', '--pitch', type=float, default=4.0, help='pitch semitones range')
parser.add_argument('-t', '--tempo', type=float, default=0.8, help='tempo percentage range')
parser.add_argument('-D', '--destination', type=str, default='dataset', help='destination directory')
parser.add_argument('-a', '--foreground_attenuation', type=float, default=0.4, help='foreground random attenuation range')
parser.add_argument('-A', '--background_attenuation', type=float, default=0.4, help='background random attenuation range')
parser.add_argument('-B', '--background_percent', type=float, default=0.8, help='Background noise percentage')
parser.add_argument('-T', '--testing_percent', type=float, default=0.1, help='dataset testing percent')
parser.add_argument('-v', '--validation_percent', type=float, default=0.1, help='dataset validation percentage')
parser.add_argument('-S', '--silence_percent', type=float, default=0.1, help='dataset silence percentage')
parser.add_argument('-n', '--notkw_percent', type=float, default=0.1, help='dataset notkw percentage')
parser.add_argument('-s', '--file_min_silence_duration', type=float, default=0.1, help='Min length of silence')
parser.add_argument('-H', '--silence_headroom', type=float, default=1.0, help='silence threshold headroom ')
parser.add_argument('-m', '--min_samples', type=int, default=100, help='minimum resultant samples')
parser.add_argument('-N', '--norm_silence', type=bool, default=True, help='normalise silence files')
parser.add_argument('-o', '--overfit-ratio', type=float, default=0.75, help='reduces pitch & tempo variation')
args = parser.parse_args()

The --overfit-ratio reduces the settings of KW with pitch & temp of !kw so that the data is more varied in !kw as well as only 10% in the dataset so its ‘more’ ‘underfitted’ and hence accepts more.
Conversly kw has far more items in the dataset with less variance and so is ‘more’ ‘overfitted’ to stop false positives.
The 0.75 default if prob quite high so lower to tighten the variance of KW.

I have been using a cardioid and get really good results with noise but omnidirectional just bleed in equally so much less tolerant to dB noise in.

Without beamforming I find el cheapo cardioid mic & usb soundcard (generalplus chipset) much better than any omnidirectional hat.

If the generalplus chipset the agc is pretty great on the above.

I knocked out the attenuation ranges and do very little with the silence category but need to add back and not further break like I did before.

The mix-b.py works! It has been running quite a while (~1.5hrs) and is still running. It would be great if it supports parallel processing so I can put it to some multi-core server for faster mixing.

I am also curious if g-kws can support training with multiple custom kw?

Yeah it can its just my dataset-builder that builds as such but in reality there are 3 classifications silence, notkw and kw which equally could of been multiple custom.

My idea was to use Raven as that has a really simple alg that capture kw as reader.py and getting initial samples grows on each addition.

https://github.com/google-research/google-research/blob/master/kws_streaming/experiments/kws_experiments_35_labels.md Shows and example but basically you can do the same with any model by adding classification but the example does show 2 extremely accurate lightweight models from the framework.

With mix-b.py its likely that further resilience to false positives could be by adding notkw from the Google command set but have tried not to include and 3rd party single language dataset.
I have tried to keep to own voice for kw & notkw, silence I have posted a selection of background_noise files that are universal. Actual background_noise can only increase this.
With the cardioid mic I get an element of natural beamforming and resilience to noise is much higher and why I am desperate to find a steerable beamformer and DOA where the result expectations can be quite low as even with the relatively low attenuation of a cardioid mic when you sum with a noise resilient kws the overall sum is seems to be exponential in result.

I picked the CRNN because it has only the GRU is delegated out but is a natural streaming model with only slightly more Ops than a CNN that seems to provide decent accuracy levels.
In the framework though and model from basic CNN to att_mh_rnn could be used and to be honest we are only talking a couple of % in overall accuracy.

A CRNN is basically the GRU of the Precise model but with a post CNN layer section that greatly reduces GRU parameter processing with the delegates the GRU layer is delegated so that TFlite can be used for the majority of processing, then add quantization, Aarch64 the accuracy and load just completely blow precise out of the water, whilst on accuracy its just a little more accurate than GRU alone.

What is Raven? Any reference links?

Its really stupid as its been implemented only capturing KW so you have an incomplete dataset.
But raven could be a quick start-up KWS to capture KW & command sentences for a more accurate KWS as Raven is not very accurate but enough to get you going and also doesn’t scale well for multiple voices.

Its an old alg where it looks for a line of fit through some frequency bands that was employed by some early KWS examples.

Its only if the ‘reader.py’ training is considered too much.

Try to train the prepared dataset with kws_streaming, encounter this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 
Node 'training/Adam/gradients/gradients/gru/cell/while_grad/gru/cell/while_grad': 
Connecting to invalid output 51 of source node gru/cell/while which has 51 outputs. 
Try using tf.compat.v1.experimental.output_all_intermediates(True).

Any idea?

Yeah your using TF2.5

Same for me on an early adoption over 2.4.1

Dunno some TF error with sym links??!

code

Thanks. “unroll=True” did the trick. Now training.

A question: is training for 4000 step necessary? Now at 1500 steps it seems already converges to ~99% accuracy.

1 Like

The 6x en*.txt are short phonetic sentences based on English (en) if you manage to find any similar alternative language sentences then please post.

parser = argparse.ArgumentParser()
parser.add_argument('-b', '--background_dir', type=str, default='_background_noise_', help='background noise directory')
parser.add_argument('-r', '--rec_dir', type=str, default='rec', help='recorded samples directory')
parser.add_argument('-R', '--background_ratio', type=float, default=0.20, help='background ratio to foreground')
parser.add_argument('-d', '--background_duration', type=float, default=2.5, help='background split duration')
parser.add_argument('-p', '--pitch', type=float, default=4.0, help='pitch semitones range')
parser.add_argument('-t', '--tempo', type=float, default=0.8, help='tempo percentage range')
parser.add_argument('-D', '--destination', type=str, default='dataset', help='destination directory')
parser.add_argument('-a', '--foreground_attenuation', type=float, default=0.4, help='foreground random attenuation range')
parser.add_argument('-A', '--background_attenuation', type=float, default=0.4, help='background random attenuation range')
parser.add_argument('-B', '--background_percent', type=float, default=0.8, help='Background noise percentage')
parser.add_argument('-T', '--testing_percent', type=float, default=0.1, help='dataset testing percent')
parser.add_argument('-v', '--validation_percent', type=float, default=0.1, help='dataset validation percentage')
parser.add_argument('-S', '--silence_percent', type=float, default=0.1, help='dataset silence percentage')
parser.add_argument('-n', '--notkw_percent', type=float, default=0.1, help='dataset notkw percentage')
parser.add_argument('-s', '--file_min_silence_duration', type=float, default=0.1, help='Min length of silence')
parser.add_argument('-H', '--silence_headroom', type=float, default=1.0, help='silence threshold headroom ')
parser.add_argument('-m', '--min_samples', type=int, default=100, help='minimum resultant samples')
parser.add_argument('-N', '--norm_silence', type=bool, default=True, help='normalise silence files')
parser.add_argument('-o', '--overfit-ratio', type=float, default=0.20, help='reduces pitch & tempo variation')
args = parser.parse_args()

Prob should do more testing but the above seemed a reasonable start point.

Attenuation isn’t employed in mix-b.py as seems to cause all sorts of probs but likely KW was just less ‘overfitted’ KW but also the NotKW & Silence percent was far too high and not ‘underfitted’ enough.
Prob can be but now I have realized where I was going wrong from my initial attempts it needs to be employed without creating too much KW variance.

In TFL-stream.py I have the sensitivity as 0.5 as watching the near hits that maybe a normal setting of 0.65/0.7 would not produce.

Finished training and tried it on my Mac first. It seems to work very well, even the mic is different from that used in my dataset.
But when I try to run it on Pi with tflite_runtime, it failed as there’s no flexdelegate support. Seems like I need to install full tensorflow first.

Yeah install the full TF again pinto. https://github.com/PINTO0309/Tensorflow-bin/blob/main/tensorflow-2.5.0-cp37-none-linux_armv7l_numpy1195_download.sh
https://github.com/PINTO0309/Tensorflow-bin/blob/main/tensorflow-2.5.0-cp37-none-linux_aarch64_numpy1195_download.sh

Just call the lite method from TF same perf and the usual aarch64 is 2-3x faster than armv7

import tensorflow.lite as tflite

Guess you could use tflite and let it delegate out but presume that just loads up both in a way as tflite is a method of tf maybe delegate just loads the delegate method as dunno as always just used full tf.
grcpio can be an absolutely killer install on a pi3 have a look at my install guide of kws and install zram & increase swap size for the compile from hell.

Its always better by quite a bit to record on device of use with sound equipment of use.
Just clone and use record.py and then stfp the ‘rec’ files to run the model on the mac.
In the 4 training steps you should maybe only need 800,800.800.800 even 400 steps will prob do.

$CMD_TRAIN \
--data_url '' \
--data_dir ~/Dataset-builder/dataset/ \
--train_dir $MODELS_PATH/crnn_state/ \
--wanted_words silence,notkw,kw \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 800,800,800,800 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--alsologtostderr \
--resample 0.0 \
--split_data 0 \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
--feature_type 'mfcc_op' \
--fft_magnitude_squared 1 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.1 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1

The CRNN is just the GRU of Precise but with the CNN filtering parameters so when delegating out the GRU part has much less work to do than Precise.
Surprised the have still kept with a full non TFlite model for Precise as there are a couple to choose from the DSCNN is supposed quite good and is fully tflite with no delegation but slightly more load and isn’t a true streaming model but as said is full TFlite.

You can use the same dataset and play with the models.
I never worked out if the SVDF one is or isn’t full tflite but also another true streaming models but the GoogleR guys do get all to stream.

The training loop is pretty old school as the Google-research team choose to keep it the same as the initial kws example of tensorflow and https://github.com/ARM-software/ML-KWS-for-MCU for direct comparison.

They keep adding new bits as recently they added some of the new dynamic post quantisation methods with a CNN example.
I am not sure how easy it would be to change the training loop to something a little less ‘brute force’ but whilst they are still adding and updating out of interest want to make sure its capable of those changes and haven’t looked at what might be needed.

Tried on Pi and worked like a charm. Can detect keyword when my TV play loud shows. I noticed in tfl-stream.py you have changed the kw hit logic (originally I remember is by kw_sum?). Now if I want to adjust the sensitivity which number should I modify?

I initially had a count to only reset on a arg_max from a non KW result which I removed as just not needed as seemed to make no difference.
I need to check as occasional dips maybe lowering score and addition may not increase false positives.

I sort of lost track where I was going as had started and stopped and me being me forgot where I was going.

Basically in a clean environment test with 3-4 or more good near KW tests and use the max result as the divisor.
Divided the kw_sum by the kw_avg just seems to normalise the result and provide a better scale.
That is used and test for a max divisor so that gives you the usual 0-1 probability score.

I always order the initial model silence, notkw, kw but the index in the model folder will show indexs

    if np.argmax(output_data[0]) == 2:
      if kw_count > 3:
        print(output_data[0][0], output_data[0][1], output_data[0][2], kw_count, kw_sum)
        not_kw_count = 0
        if output_data[0][2] > kw_max:
          kw_max = output_data[0][2]
      kw_count += 1
      kw_sum = kw_sum + output_data[0][2]
      kw_avg = kw_sum / kw_count
      if (kw_sum / kw_avg) / 55 > 1:
        kw_probability = 1.0
      else:
        kw_probability = (kw_sum / kw_avg)  / 55
      if kw_probability > 0.5:
        kw_hit = True
    elif np.argmax(output_data[0]) != 2:
      if not_kw_count > 3:
        if kw_hit == True:
          print("Kw threshold hit", kw_max, kw_avg, kw_probability)
        kw_count = 0
        kw_sum = 0
        kw_hit = False
        kw_max = 0
        kw_probability = 0
        not_kw_count = -1
      not_kw_count += 1

The 55 is just the max divisor just to create a 0-1 prob float as that seems more the norm.
Sensitivity is kw_probability > 0.5 so really set the divisor to the next ceil int of what you get as a max of some good clean KW tests.
Then set kw_probability to approx ‘0.65-075’ (again purely a norm) but really whatever you deem fits and seems to work as the divisor is just to cope with a range of quality many user models may create rather than a single known black box model.
There are just 2 settings purely to convert the summed envelope into a more standard probability score but you could just kw_sum / kw_avg or kw_sum but `kw_sum / kw-avg’ gives a much more normalised range over kw-sum alone and the divisor again is purely there just to convert to a float probability.

Choice is yours as basically the KWS is a raw model and the code currently has just got messy after a lot of misguided hacks and changes.
I would say kw_sum / kw_avg and conversion to float and whatever you deem fit as in code as it is a bit messy at the moment but not much is needed really.

If you can get a unidirectional (cardioid) or beamformer then the attenuation to noise will also increase resilience and making the clean voice more predominant gives results much higher than just the attenuation value you may achieve.

I will let you decide as your code is far more pythonic and streamlined than mine and really all is needed is a websocket that transmits the KW hit and following command sentence.

I was also going to do a little routine to store last_kw as in audio to a /tmp file that a websocket command would retrieve on some form of logic that it all ran through ASR to intent and there is a pause which seems to signify completion and a correct inference session. Then fire off a get_last_kw which is retrieved from /tmp.

Kw-hit is for zoned distributed array so that the best stream can be selected for ASR and other zone lesser Kw-hit KWS command sentence transmits can be kicked.

But really any method you deem fit, there really isn’t all that much to a NN KWS model and they are quite simple. I even ignored my own mantra that less is more and created loads of additions rather than simply underfitting non kw classification and overfitting kw to help stop false positives.
As more natural usage data is added and the augmented samples replaced it should grow in accuracy on each retrain.
Again part of the websocket could be to deliver an OTA tflite model so much of the code is peripheral to the actual inference.

I did think about maybe using the other classification results to add weight to the kw score but never got round to devising a plan. (IE a KW hit that has strongly negative other classification has more weight than the same KW value where the other classifications are showing more entropy)
prob simple to implement as the av of the sum of other classification and its the span of that to current KW value rather than kw alone, just never checked what results that provides.

If you have a KW that contains smaller common word phones likely adding them to notkw as individual words should stop any problems if they arise but also maybe a minimum kw_count could be implemented but its very subjective as the difference between a fast kw and slow spoken ‘subword’ could be minimal.

       kw_count += 1
      kw_weight = (output_data[0][0] + output_data[0][1]) / 2
      kw_sum = kw_sum + output_data[0][2] + (kw_weight * -1)
      kw_avg = kw_sum / kw_count

?



Just out of interest