The model is a bit more problematic in that I think a GRU like Precise doesn’t run on tensorflow lite.
But basically run tensorflow you can use my example tfl-stream but you will have to create a mfcc frontend feed as the g-kws emebed mfcc in the model so you can just forward chunked audio.
That linto hmg is great to get a feel of what of how samples can effect the model as its extremely interesting and enlightening to see which fail.
Its through linto hmg I discovered how many bad samples are the the Google command set and then it clicked that the Google command set is a benchmark dataset and not nessacarily a good dataset to create a model on unless you trim out the bad.
The linto project has a MFCC routine and in there repo’s there is info and here is mine with just a chunked audio stream.
I could prob knock you up a MFCC version to test that GRU but after playing with an easy GUI I would suggest going a bit more hardcore with the G-kws models as the tensorflow-lite models run in much less load the difference is pretty huge.
mkdir g-kws
cd g-kws
git clone https://github.com/google-research/google-research.git
mv google-research/kws_streaming .
you can delete the rest of the google-reasearch dir if you wish as dunno why they put all in one repo
#!/bin/bash
# Train KWT on Speech commands v2 with 12 labels
KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models_data_v2_12_labels
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"
$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn_state/ \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 2000,2000,2000,2000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.15 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.5 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1
You just need to create a dataset and do something like I did in tfl-stream.py g-kws/tfl-stream.py at main · StuartIanNaylor/g-kws · GitHub
You can use the google command set or create your own with the dataset-builder GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws or another
Have a read of google-research/kws_streaming at master · google-research/google-research · GitHub as its extremely well documented and contains just about every current state-of-art model for KWS.
The CRNN-state is prob a good start and google-research/base_parser.py at master · google-research/google-research · GitHub contains all the parameters that after a bit of headscratching should get you going.
PS if you have 2 mics I have been playing with the pulseaudio beamformer which I ruled out as you can not steer it.
I have had a rethink and can now steer it so as well as cutting edge model you can prob also add beamforming if interested.
google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub gives pretty good guide and the command set for testing can be downloaded
# download and set up path to data set V2 and set it up
wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
mkdir data2
mv ./speech_commands_v0.02.tar.gz ./data2
cd ./data2
tar -xf ./speech_commands_v0.02.tar.gz
cd ../
or
# download and set up path to data set V1 and set it up
wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
mkdir data1
mv ./speech_commands_v0.01.tar.gz ./data1
cd ./data1
tar -xf ./speech_commands_v0.01.tar.gz
cd ../
But if you have a look on my dataset-builder repo I posted as many datasets & noise files I could find so even if of no use it might be a good source of datasets even if few.
sanebows code is far more polished than anything I have provided