Wake word creation -- Snowman anyone?

rolyan_trauts · June 18, 2021, 3:05pm

The model is a bit more problematic in that I think a GRU like Precise doesn’t run on tensorflow lite.
But basically run tensorflow you can use my example tfl-stream but you will have to create a mfcc frontend feed as the g-kws emebed mfcc in the model so you can just forward chunked audio.

That linto hmg is great to get a feel of what of how samples can effect the model as its extremely interesting and enlightening to see which fail.
Its through linto hmg I discovered how many bad samples are the the Google command set and then it clicked that the Google command set is a benchmark dataset and not nessacarily a good dataset to create a model on unless you trim out the bad.

The linto project has a MFCC routine and in there repo’s there is info and here is mine with just a chunked audio stream.

I could prob knock you up a MFCC version to test that GRU but after playing with an easy GUI I would suggest going a bit more hardcore with the G-kws models as the tensorflow-lite models run in much less load the difference is pretty huge.

mkdir g-kws
cd g-kws
git clone https://github.com/google-research/google-research.git
mv google-research/kws_streaming .

you can delete the rest of the google-reasearch dir if you wish as dunno why they put all in one repo

#!/bin/bash

# Train KWT on Speech commands v2 with 12 labels

KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models_data_v2_12_labels
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"


$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn_state/ \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 2000,2000,2000,2000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.15 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.5 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1

You just need to create a dataset and do something like I did in tfl-stream.py g-kws/tfl-stream.py at main · StuartIanNaylor/g-kws · GitHub

You can use the google command set or create your own with the dataset-builder GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws or another

Have a read of google-research/kws_streaming at master · google-research/google-research · GitHub as its extremely well documented and contains just about every current state-of-art model for KWS.
The CRNN-state is prob a good start and google-research/base_parser.py at master · google-research/google-research · GitHub contains all the parameters that after a bit of headscratching should get you going.

PS if you have 2 mics I have been playing with the pulseaudio beamformer which I ruled out as you can not steer it.
I have had a rethink and can now steer it so as well as cutting edge model you can prob also add beamforming if interested.

google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub gives pretty good guide and the command set for testing can be downloaded

# download and set up path to data set V2 and set it up
wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
mkdir data2
mv ./speech_commands_v0.02.tar.gz ./data2
cd ./data2
tar -xf ./speech_commands_v0.02.tar.gz
cd ../

or

# download and set up path to data set V1 and set it up
wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
mkdir data1
mv ./speech_commands_v0.01.tar.gz ./data1
cd ./data1
tar -xf ./speech_commands_v0.01.tar.gz
cd ../

But if you have a look on my dataset-builder repo I posted as many datasets & noise files I could find so even if of no use it might be a good source of datasets even if few.

sanebows code is far more polished than anything I have provided

gyppe · June 18, 2021, 3:53pm

Argh! Too advanced for me, I will have to study a lot.

rolyan_trauts · June 18, 2021, 4:08pm

Suck & see is often far more educational.
Just grab the g-kws framework

mkdir g-kws
cd g-kws
git clone https://github.com/google-research/google-research.git
mv google-research/kws_streaming .

Go with command set 2

# download and set up path to data set V2 and set it up
wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
mkdir data2
mv ./speech_commands_v0.02.tar.gz ./data2
cd ./data2
tar -xf ./speech_commands_v0.02.tar.gz
cd ../

save and run this script

#!/bin/bash

# Train KWT on Speech commands v2 with 12 labels

KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models_data_v2_12_labels
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"


$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn_state/ \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 2000,2000,2000,2000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.15 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.5 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1

Just make sure the path of the tflite model is right and run or just download

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading


def sd_callback(rec, frames, time, status):
    global input_details1
    global output_details1
    global inputs1
    global kw_count
    global kw_sum
    global kw_hit
    global kw_avg
    global kw_probability
    global not_kw
    global silence_count
    global silence_hit
    
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, 320))
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
     
    if np.argmax(output_data[0]) == 2:
      print(output_data[0][0], output_data[0][1], output_data[0][2], kw_avg, kw_probability, kw_count)
      kw_count += 1
      kw_sum = kw_sum + output_data[0][2]
      kw_avg = kw_sum / kw_count
      kw_probability = kw_avg / 7.5
      silence_count = 0
      if silence_hit == True:
        print('Silence hit')
        silence_hit = False
      if kw_probability > 0.5 and kw_count >= 15:
        kw_hit = True
    elif np.argmax(output_data[0]) == 1:
      not_kw = True
      silence_count = 0
      if silence_hit == True:
        print('Silence hit')
        silence_hit = False
    elif np.argmax(output_data[0]) == 0:
      not_kw = True
      silence_count += 1
      if silence_count >= 100:
        silence_hit = True
      
    if not_kw == True:
      if kw_hit == True:
        print("Kw threshold hit", kw_avg, kw_probability, kw_count)
      kw_count = 0
      kw_sum = 0
      kw_hit = False
      kw_max = 0
      kw_probability = 0
      not_kw = False




# Parameters
word_threshold = 7.5
word_duration = 10
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
kw_avg = 0
kw_count = 0
kw_sum = 0
kw_probability = 0
kw_hit = False
not_kw = False
silence_count = 0
silence_hit = False

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')


# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="/home/pi/g-kws/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
kw_count = 0
not_kw_count = 0
kw_sum = 0
kw_hit = False



    


# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

To be honest there training script and layers is a complete confusion to me but we don’t need know what google-research do.
The model just fires back a tensor where each classification is just an element in an array where I use argmax() to get the biggest and thats it really apart from building your own dataset.
I have been wondering if there is a better way to get a single probability score from a 1 sec envelope of 20ms steps but the above will do.

greg_dickson · June 19, 2021, 2:24am

So to put it in a Rhasspy context
you could use your python as the External Command by simply exiting the script on a valid responce - printing wakewordId to std out on exit of course-.

Is that correct?
It still sends the wav over http but if the script is on the same machine as rhasspy the overhead is quite small. AFAIK

rolyan_trauts · June 20, 2021, 12:29am

Yeah when it comes to an all-in-one Rhasspy is pretty good so apols for confusing you with opinion about external networked KWS.

I am not sure why Rhasspy focuses on a ‘singular unit’ as to be honest the commercial guys have us beat hands down for much less than we can build.
But if you where looking at creating a single brain with multiple ears of distributed KWS IMO chunking raw wavs over MQTT so its broadcast to all subscribers is just plain bad and that always brings up a contentious argument.
A few don’t like my opinion that mic arrays without DSP are useless and if I call it Herpes audio it says enough of what I think without further tedium.

If you are going to use the host and pipe in the audio and pipe out default on the wakewordId it is super easy.

greg_dickson · June 20, 2021, 2:11am

But that is the real point. I find the cost to my privacy way too high. So any private system. Is cheaper than the alternatives.
It all depends on how far down the particular rabbit hole you wish to travel.
The current invocation of rhasspy works. It is easy to configure and run and can be a complex as you like. The external command opens up some doors. and if you wanted satelites you could simply implement a server on the main host that each satelite connected to independant of rhasspy. The wake word server could be on the same machine as rhasspy so the MQTT stuff is always local. Rhasspy starts the wake word server which then deals with the wake state however it likes.
Like say polling a pin for a push button event.
Polling over another bus say i2c to a movement sensor.
Or even wifi to another distant trigger client like a phone.
The fact that rhasspy sends it’s wav to the server is only one possible input.
You can implement what ever you like and rhasspy stays simple and usable within it’s own little world.
From what I can see rhasspy responds specifically to each specific wakewordId so the posibilities and flexibility are still there.
I am only new to this and have not got all the facts yet but it seems to me that rhasspy offers the best of both worlds.
Even though she might have herpes as you say.
Having herpes myself I can attest to the fact that one can still function normally with the disease. you just have to do things differently.

greg_dickson · June 20, 2021, 2:18am

I have had some problems as for some reason the docker doesn’t match my pi4 with the correct image.
I understand this is still beta but for now I will try another path.

However it is clear to me that rhasspy has a bright future ahead.

Thanks everyone there is some very useful information in this thread alone.
Thanks Daenara I will try Raven just to get me started.
Every time I explore the docs and re-read this thread I learn so much more.
Wow this is so cool.

gyppe · June 20, 2021, 3:06pm

For kws is ok, now I do some tests. Still not clear to me how to interface with rhasspy, does your latest script do this? I am a programmer but at a microcontroller level, with high level languages I am poorly prepared.

rolyan_trauts · June 20, 2021, 4:04pm

Stop right there! Did you say ‘microcontroller level’ !!!

Oh yeah as I will handhold as much as it takes don’t do a CRNN but a CNN microcontroller google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub

As if your armed with tensorflow/tensorflow/lite/experimental/microfrontend at master · tensorflow/tensorflow · GitHub then google-kws will train for that front-end or at least there is parameters and code to do so just never tried.

And maybe we could be the 1st to squirt a model down say an ESP32-WROVER with a I2S mic module?

If you look at bemused client https://github.com/rhasspy/bemused-client/blob/master/bemused_client/main.py
Its taking the audio from stdin and the code is in the above.

Why bemused client takes a streaming kws framework and creates a non-stream model is bemusing to me and hence you might realise why it got that name.
Prob sanebows is a better option and just steal the stdin methods.

But the same datasets can be trained by the same google-kws-streaming framework with a choice of models, converted and quantised for tflite.
But also you can take the next step and employ the microcontroller code of Audio “frontend” TensorFlow operations for feature generation and train for that front end and convert the tflite with XNN to a binary for the ESP32.

I keep thinking this so should be done as you can get ESP32-Wrover (It does have the comfort of the 8mb sram) for < $5 and a I2S mic module for <$3

So yeah test away and raspberry and embeded linux (pi) is prob easiest way to get a feel to things.

Bemused client is just another KWS that only has host capability but any KWS code can use that simple stdin method.

gyppe · June 22, 2021, 1:02pm

I would be very happy! Unfortunately, I have very little free time. : /