Suggestions for Dutch wake word detection for newbie

Hi community,

After learning about Rhasspy from “Everything Smart Home” and recent news events, I successfully set up a Rhasspy base and satellite connected to my Home Assistant installation.

Since me and my family are Dutch, we would like to use some Dutch wake word. So, I started out with using Raven to train a custom wake word. This did not turn out so well, it often wakes when it should not, and often does not wake when we say the wake word.

After browsing the community forum for a while, I get the idea that Raven may not be the best option.
So, I’m looking for some up-to-date recommendations on how to wake up my Rhasspy in Dutch.

Thanks!

Ronald

Raven is good for one thing only really and that is to quickly setup something to capture wake words so that you can train a model, its not really great as KWS.
There isn’t a framework to train your own models unless you use commercial services such as.

Which is part of Rhasspy.

Training your own and providing a framework is a bit of a hole as there is so much to what can be a good kw and model creation.
Starting with KW itself there are very good reasons Google uses HeyGoogle and Alexa likewise because they are Unique and that is one of the reasons they work.
I have never understood why we don’t have a Rhasspy Dataset or a Hey-Rhasspy as Rhasspy is its name and Hey has sort of become a global KW prefix.

Yeah you could train but apart from Precise which for me is oddly named I don’t think there is a KWS that could use it and if you do choose then you want a unique KW that doesn’t really rhyme of have the phonetic syllable sequence to any other common phrase or name.

Picovoice is OK but it benchmarks are against pretty much dead legacy KWS systems and it can be prone to noise that you or media may use.

Hi @rolyan_trauts,

Thanks for the reply. I like the idea of “Hey Rhasspy” :).
If I understand correctly, using Raven is not a good idea.
I would be best off using PicoVoice, but cannot train it. Unfortunately that means using one of the available options. I do not think my family will be open to saying “porcupine” any time soon.
I can train Precise, but I did not really understand the “there is no KWS that could use it” part.

Did I get that right?

No you can use it but its just not that good as a KWS its designed to get you running super quick and also can collect KW as you use it.
You can train Precise by collecting KW with Raven or use Picovoice as they do a custom fre KW service (not that I have used it) .

I had forgot about Precise well actually think its not that great hence why say oddly named.

Following up on your suggestions, I’ve been trying out some options.
For some reason neither Porcupine nor Precise seems to be able to detect any of the wake words I select.
In my profile folder (/profiles/nl/…) I do not see any of the files required, should these download automatically?
Is there any way to debug the wake word detection?

I didn’t really make any suggestions (disclaimer!) :slight_smile:

I don’t know about the above as haven’t used for a while as my catchword is bemused which I am at how overly complex the serial processing chain of Rhasspy is.
I don’t use Precise or Picovoice and haven’t even run up an instance of Rhasspy for a long time.

Don will prob give you some guidance @donburch I am currently creating some kw models that maybe synesthesiam might use on esp32 just as ready test models.
I don’t know the current Rhasspy state of play with either.

Ok, thanks for the support.

Google colab now lets you payasyougo so I could prob show how to train a model as guess you could just create your own.

https://rhasspy.readthedocs.io/en/latest/wake-word/#command

To get a good kw its quite a bit of work really its creating a dataset as creating models is just running code and waiting.
I will prob post a few models later today or tomoz of ‘hey marvin’ of various model types that I have been doing as examples to port to a esp32-s3.
Run them and try them out as most tiny quantised low load models but check out load and how well they do as will add some simple scripts to run them and you can judge for yourself as even though I can knock up a dataset in a day or 2 for some it could be a bit daunting.

1 Like

Sorry I’m well out of my depth on this one :frowning: I tried the “Recommended” options and they seemed to work good enough, and I have stuck with them. I went on to setup my Home Assistant automations, and haven’t come back to fiddly with different audio options … and I probably couldn’t tell the difference even if I did.

I tried Pico, I even updated to the latest Pico, but I hate the “online” sdk portion of that; So I wrote my own with Tensorflow and built a working training and detection model; I had problems with Raven; either too many false positives or it just took me saying the keyword twice correctly; or didn’t work for other members of my family. Now I have a royalty free solution that works for everyone in the family.

1 Like

I am not really sure what the ‘recommended’

The key core in the docs seems to be

The following table summarizes the key characteristics of each wake word system:

System Performance Training to Customize Online Sign Up
raven moderate yes, offline no
porcupine excellent yes, offline no
snowboy good yes, offline no
precise moderate yes, offline no
pocketsphinx poor no no

The table really only ranks by load with pocketsphinx being the worst offender as its an old ASR that recently had an update to 5.0 but think its still a working piece of history than anything of any use and as a ASR its very lightweight and the idea is you create a phonetic keyword on the fly.
As an ASR its a interesting fossil with one of the 1st opensource voice systems and sort of works as you would expect.
Snowboy sort of similar one of the 1st models to be used based on a DNN and was shipped as opensource source when its commercial value became obsolete.
Raven actually uses a much older method and like PocketSphinx its one of the early attempts of capturing Keyword by approximation envelopes that its only worth now is as a KW dataset capture device as in use it can be pretty painful.
That leaves Mycroft’s Precise which finally got a tensorflow light version but in Rhasspy we could well still be using the full TF version but its still a GRU layer and they are not lite or easily quantised to 8bit.
Even stranger with a classification model which really all KWS models are where it has just 2 classifications KW & Everything where the guidance is just put anything not KW in Unknown and that will do, whilst anyone with a smattering of ML classification will be raising an eyebrow of WTF?!

Much of the fiddly audio is caused by Rhasspy by taking control and dictating formats and its use of docker as out of rhasspy and docker anyone can run a python script or application or by the cli with arecord or parec depending on preference quite easily so its a pretty safe bet to say most of those fiddly problems are caused by Rhasspy itself.
Rhasspy also has this boarderline insanity raw wav MQTT broadcast that has all nodes decrypting packet headers to see if its a broadcast message they subscribe to at raw audio level through put that is so mismatched doing the groceries in a rocket ship would not look out of place.
Hardware wise the respeaker mic hats are very much fiddly of poorly supported and documented overly complex alsa settings that some of us have tamed but the idea that a optimised voice system can be a mixture of bring your own with consideration to others is deeply flawed.
The PSEye3 which I have always been deeply critical is there are now some upstream filters that don’t require the clock drift sync that AEC demands audio in/out be on the same card.
Also I had a Eurica moment for the DiYers as if you desolder the 4th mic and apply your dac to it you have a 3 channel mic with a 4th hardware loopback channel.

Every bit of hardware and process imposes a fingerprint that with fixed hardware you can train and optimise for which commercial offerings do and Rhasspy doesn’t.
This all doesn’t matter though as the nature of dictating methods in a field that is moving at an unprecedented speed as we enter a new age of the like of chatGPT and ML code writing code to things like whisper a defining framework for such a dynamic rapidly changing field, trying to be a CISC like framework is not what is needed as its doesn’t work well. Its not surprising you end up with a list of obsolete.
The voice workflow for most parts is distinct and a serial chain and all we need is a series of routers and queues not much different to gstreamer in use where a RISC like structure of simple can be built up into the complex if you wish so but doesn’t dictate the complex for the simple.

The same simple /etc config queuing and routing app could be used at every stage of the voice workflow by using simple low latency 1-to-1 TCP connections because its a serial process anyway.
Likely wenet/runtime/core/websocket at main · wenet-e2e/wenet · GitHub has an excellent base as I have no qualms about giving recognition to Apache2.0 licenses as its just opinion be searching permissive MIT code to refactor and rebrand as own just aint my style.
But its also stranger as all we need is a model runner as that is all a KWS is and there should be no dictate to what model is of use just purely a path and classification index of use such as this.

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())
      
def sd_callback(rec, frames, time, status):
    global gain, max_rec, kw_hit, kw_hit_rbuff, vad_hit_rbuff, rec_rbuff, rec_samples
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, rec_samples))
    rec = np.multiply(rec, gain)
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
      
    rec_rbuff = np.roll(rec_rbuff, rec_samples)
    rec_rbuff[0:rec_samples] = rec
    
    out_softmax = softmax_stable(output_data[0])
    
    kw_hit_rbuff = np.roll(kw_hit_rbuff, 1)
    kw_hit_rbuff[0] = out_softmax[0]
    kw_prob = np.mean(kw_hit_rbuff)
    
    if out_softmax[0] > 0.95:
      vad_hit_rbuff = np.multiply(vad_hit_rbuff, 0)
        
    vad_hit_rbuff = np.roll(vad_hit_rbuff, 1)
    vad_hit_rbuff[0] = out_softmax[1]
    vad_prob = np.mean(vad_hit_rbuff)
       
    if vad_prob > 0.95:
      print("Vad:", vad_prob, kw_prob, lvl)
      kw_hit = False
      kw_hit_rbuff = np.multiply(kw_hit_rbuff, 0)
      max_rec = 0.0
      kw_count = 0
      kw_prob = 0     

    if kw_prob > 0.95:
      if kw_hit == False:
        print("Marvin:", kw_prob, lvl)
        print(kw_hit_rbuff)
        kw_hit = True
             
# Parameters
kw_duration = 1.0
rec_duration = 0.020
vad_duration = 0.20
sample_rate = 16000

rec_samples = int((sample_rate * kw_duration) * rec_duration)
vad_hit_samples = int((sample_rate * kw_duration) / ((sample_rate * kw_duration) * vad_duration))
kw_hit_samples = int((sample_rate * kw_duration) / ((sample_rate * kw_duration) * rec_duration))
kw_samples = int(sample_rate * kw_duration)
num_channels = 1
gain = 4
max_rec = 0.0
kw_hit = False

kw_hit_rbuff = np.zeros(kw_hit_samples, dtype=np.float32)
vad_hit_rbuff = np.zeros(vad_hit_samples, dtype=np.float32)
rec_rbuff = np.zeros(kw_samples, dtype=np.float32)

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors, really should be static copies to use as KW resets
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

Really it shouldn’t be Python as for DSP handling and iteration performance wise it absolutely sucks but hey what ever you use something like the above is needed and its purely a model runner not some form of weird and wonderful branded KWS.

https://drive.google.com/file/d/1-2EL-61EUzPu_1Y_eUcOhYVdvAMeKATu/view?usp=share_link

stream_state_external.tflite in the crnn_heymarvin/quantize_opt_for_size_tflite_stream_state_external folder is only 482.7kb in size and just change ‘interpreter1 = tf.lite.Interpreter(model_path=’ to the model location and that is it.
Labels.txt is in index order and its the 1st itteration of some model examples I am doing that with a few tweaks and curates of the dataset maybe get extra(s) % accuracy but the size and params is relatively static.
CRNN was an evolution of the GRU model (precise) as it was realised a hybrid CNN/GRU reduced params by half and was more accurate and Precise itself is using an obselete model.
Equally the CRNN could be considered obsolete by extremely low bc-resnet models or highly accurate transformer models but there is a plethora to choose from of an ever growing list and a model runner not a branded KWS is all that is needed.
Models are easy to create its the dataset collection and creation that isn’t much fun.

There is very little that is fiddly that isn’t made so.
Much of the load is Python itself iterating chunked audio, but still in terms of the above tiny, but the proof of the pudding would be to do a head2head being fed something like librispeech.
But you can test the above as English HeyMarvin so just give it a try as should just work on the current alsa default.

I like this rant.

Would you mind sharing your keras layers you landed on? (just curious how far off I am) Update: I got what I wanted from your download thanks.

When I wrote my first python iteration of this it was pegging the CPU 100% on all cores. I have since optimized it with asyncio to get decent performance. (something I can live with);

I also feel the Docker container for Rhasspy for a Satelite can be slimmed down to a bare metal key word detector/ play wav/ client. (stripping the remaining crap out) I am currently working on a minimal tensorflow light docker image I am hoping to achieve greatness trying to optimize a few items to reduce my dependencies on additional libs;

I put in a number of optimizations as well for only processing the keras modal after processing with Viterbi decoding from discriminative state predictions at a workable threshold.

I then perform a number of normalization on the audio before building the spectrogram.

1 Like

In each folder there is a ‘model_summary.txt’

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_audio (InputLayer)       [(1, 320)]           0           []                               
                                                                                                  
 speech_features (SpeechFeature  (1, 1, 20)          0           ['input_audio[0][0]']            
 s)                                                                                               
                                                                                                  
 tf_op_layer_ExpandDims (Tensor  (1, 1, 20, 1)       0           ['speech_features[0][0]']        
 FlowOpLayer)                                                                                     
                                                                                                  
 stream (Stream)                (1, 1, 18, 16)       160         ['tf_op_layer_ExpandDims[0][0]'] 
                                                                                                  
 stream_1 (Stream)              (1, 1, 16, 16)       3856        ['stream[0][0]']                 
                                                                                                  
 reshape (Reshape)              (1, 1, 256)          0           ['stream_1[0][0]']               
                                                                                                  
 tf_op_layer_streaming/gru_1/Sq  [(1, 256)]          0           ['reshape[0][0]']                
 ueeze (TensorFlowOpLayer)                                                                        
                                                                                                  
 gru_1input_state (InputLayer)  [(1, 256)]           0           []                               
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 768)]          0           ['tf_op_layer_streaming/gru_1/Squ
 ll/MatMul (TensorFlowOpLayer)                                   eeze[0][0]']                     
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 768)]          0           ['gru_1input_state[0][0]']       
 ll/MatMul_1 (TensorFlowOpLayer                                                                   
 )                                                                                                
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 768)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/BiasAdd (TensorFlowOpLayer)                                  l/MatMul[0][0]']                 
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 768)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/BiasAdd_1 (TensorFlowOpLaye                                  l/MatMul_1[0][0]']               
 r)                                                                                               
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256),          0           ['tf_op_layer_streaming/gru_1/cel
 ll/split (TensorFlowOpLayer)    (1, 256),                       l/BiasAdd[0][0]']                
                                 (1, 256)]                                                        
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256),          0           ['tf_op_layer_streaming/gru_1/cel
 ll/split_1 (TensorFlowOpLayer)   (1, 256),                      l/BiasAdd_1[0][0]']              
                                 (1, 256)]                                                        
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/add_1 (TensorFlowOpLayer)                                    l/split[0][1]',                  
                                                                  'tf_op_layer_streaming/gru_1/cel
                                                                 l/split_1[0][1]']                
                                                                                                  
 gru_1 (GRU)                    (1, 1, 256)          394752      ['reshape[0][0]']                
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/Sigmoid_1 (TensorFlowOpLaye                                  l/add_1[0][0]']                  
 r)                                                                                               
                                                                                                  
 stream_2 (Stream)              (1, 256)             0           ['gru_1[0][0]']                  
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/add (TensorFlowOpLayer)                                      l/split[0][0]',                  
                                                                  'tf_op_layer_streaming/gru_1/cel
                                                                 l/split_1[0][0]']                
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/mul (TensorFlowOpLayer)                                      l/Sigmoid_1[0][0]',              
                                                                  'tf_op_layer_streaming/gru_1/cel
                                                                 l/split_1[0][2]']                
                                                                                                  
 dropout (Dropout)              (1, 256)             0           ['stream_2[0][0]']               
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/Sigmoid (TensorFlowOpLayer)                                  l/add[0][0]']                    
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/add_2 (TensorFlowOpLayer)                                    l/split[0][2]',                  
                                                                  'tf_op_layer_streaming/gru_1/cel
                                                                 l/mul[0][0]']                    
                                                                                                  
 dense (Dense)                  (1, 128)             32896       ['dropout[0][0]']                
                                                                                                  
 data_frame_1input_state (Input  [(1, 640)]          0           []                               
 Layer)                                                                                           
                                                                                                  
 lambda_8 (Lambda)              (1, 320)             0           ['input_audio[0][0]']            
                                                                                                  
 stream/ExternalState (InputLay  [(1, 3, 20, 1)]     0           []                               
 er)                                                                                              
                                                                                                  
 stream_1/ExternalState (InputL  [(1, 5, 18, 16)]    0           []                               
 ayer)                                                                                            
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/sub (TensorFlowOpLayer)                                      l/Sigmoid[0][0]']                
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/Tanh (TensorFlowOpLayer)                                     l/add_2[0][0]']                  
                                                                                                  
 stream_2/ExternalState (InputL  [(1, 1, 256)]       0           []                               
 ayer)                                                                                            
                                                                                                  
 dense_1 (Dense)                (1, 256)             33024       ['dense[0][0]']                  
                                                                                                  
 tf_op_layer_streaming/speech_f  [(1, 320)]          0           ['data_frame_1input_state[0][0]']
 eatures/data_frame_1/strided_s                                                                   
 lice (TensorFlowOpLayer)                                                                         
                                                                                                  
 lambda_7 (Lambda)              (1, 320)             0           ['lambda_8[0][0]']               
                                                                                                  
 tf_op_layer_streaming/stream/s  [(1, 2, 20, 1)]     0           ['stream/ExternalState[0][0]']   
 trided_slice (TensorFlowOpLaye                                                                   
 r)                                                                                               
                                                                                                  
 tf_op_layer_streaming/stream_1  [(1, 4, 18, 16)]    0           ['stream_1/ExternalState[0][0]'] 
 /strided_slice (TensorFlowOpLa                                                                   
 yer)                                                                                             
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/mul_1 (TensorFlowOpLayer)                                    l/Sigmoid[0][0]',                
                                                                  'gru_1input_state[0][0]']       
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/mul_2 (TensorFlowOpLayer)                                    l/sub[0][0]',                    
                                                                  'tf_op_layer_streaming/gru_1/cel
                                                                 l/Tanh[0][0]']                   
                                                                                                  
 tf_op_layer_streaming/stream_2  [(1, 0, 256)]       0           ['stream_2/ExternalState[0][0]'] 
 /strided_slice (TensorFlowOpLa                                                                   
 yer)                                                                                             
                                                                                                  
 dense_2 (Dense)                (1, 7)               1799        ['dense_1[0][0]']                
                                                                                                  
 tf_op_layer_streaming/speech_f  [(1, 640)]          0           ['tf_op_layer_streaming/speech_fe
 eatures/data_frame_1/concat (T                                  atures/data_frame_1/strided_slice
 ensorFlowOpLayer)                                               [0][0]',                         
                                                                  'lambda_7[0][0]']               
                                                                                                  
 tf_op_layer_streaming/stream/c  [(1, 3, 20, 1)]     0           ['tf_op_layer_streaming/stream/st
 oncat (TensorFlowOpLayer)                                       rided_slice[0][0]',              
                                                                  'tf_op_layer_ExpandDims[0][0]'] 
                                                                                                  
 tf_op_layer_streaming/stream_1  [(1, 5, 18, 16)]    0           ['tf_op_layer_streaming/stream_1/
 /concat (TensorFlowOpLayer)                                     strided_slice[0][0]',            
                                                                  'stream[0][0]']                 
                                                                                                  
 tf_op_layer_streaming/gru_1/ce  [(1, 256)]          0           ['tf_op_layer_streaming/gru_1/cel
 ll/add_3 (TensorFlowOpLayer)                                    l/mul_1[0][0]',                  
                                                                  'tf_op_layer_streaming/gru_1/cel
                                                                 l/mul_2[0][0]']                  
                                                                                                  
 tf_op_layer_streaming/stream_2  [(1, 1, 256)]       0           ['tf_op_layer_streaming/stream_2/
 /concat (TensorFlowOpLayer)                                     strided_slice[0][0]',            
                                                                  'gru_1[0][0]']                  
                                                                                                  
==================================================================================================
Total params: 466,487
Trainable params: 466,487
Non-trainable params: 0

A rough guide to load is number of params where 466,487 is much more than say the 30k of a BC_resnet2 but like the relatively old Arm ML KWS report still much lighter than a GRU

I use an already built framework google-research/kws_streaming at master · google-research/google-research · GitHub Apache2.0 opensource


From

Why create your own when the biggest names in tech are doing it for us and releasing opensource?

The above is an ultra sexy accurate KWS but its a transformer so huge to train…

I ran GoogleKWS with these flags but much is my custom dataset as by far the biggest effect on accuracy is dataset whilst modern classification models all grasp for fractions of percent which is currently around the 97% level.

Namespace(data_url='', data_dir='/home/stuart/GoogleKWS/data2/', lr_schedule='exp', optimizer='adam', background_volume=0.1, l2_weight_decay=0.0, background_frequency=0.0, split_data=0, silence_percentage=10.0, unknown_percentage=10.0, time_shift_ms=100.0, sp_time_shift_ms=0.0, testing_percentage=10, validation_percentage=10, how_many_training_steps='20000,20000,20000,20000', eval_step_interval=400, learning_rate='0.001,0.0005,0.0001,0.00002', batch_size=100, wanted_words='heymarvin,noise,unk,h1m1,h2m1,h1m2,h2m2', train_dir='/home/stuart/GoogleKWS/models2/crnn/', save_step_interval=100, start_checkpoint='', verbosity=0, optimizer_epsilon=1e-08, resample=0.0, sp_resample=0.0, volume_resample=0.0, train=1, sample_rate=16000, clip_duration_ms=1000, window_size_ms=40.0, window_stride_ms=20.0, preprocess='raw', feature_type='mfcc_op', preemph=0.0, window_type='hann', mel_lower_edge_hertz=20.0, mel_upper_edge_hertz=7600.0, micro_enable_pcan=1, micro_features_scale=0.0390625, micro_min_signal_remaining=0.05, micro_out_scale=1, log_epsilon=1e-12, dct_num_features=20, use_tf_fft=0, mel_non_zero_only=1, fft_magnitude_squared=True, mel_num_bins=40, use_spec_augment=1, time_masks_number=2, time_mask_max_size=10, frequency_masks_number=2, frequency_mask_max_size=5, use_spec_cutout=0, spec_cutout_masks_number=3, spec_cutout_time_mask_size=10, spec_cutout_frequency_mask_size=5, return_softmax=0, novograd_beta_1=0.95, novograd_beta_2=0.5, novograd_weight_decay=0.001, novograd_grad_averaging=0, pick_deterministically=0, causal_data_frame_padding=0, wav=1, quantize=0, use_quantize_nbit=0, nbit_activation_bits=8, nbit_weight_bits=8, data_stride=1, restore_checkpoint=0, model_name='crnn', cnn_filters='16,16', cnn_kernel_size='(3,3),(5,3)', cnn_act="'relu','relu'", cnn_dilation_rate='(1,1),(1,1)', cnn_strides='(1,1),(1,1)', gru_units='256', return_sequences='0', stateful=0, dropout1=0.1, units1='128,256', act1="'linear','relu'", label_count=7, desired_samples=16000, window_size_samples=640, window_stride_samples=320, spectrogram_length=49, data_frame_padding=None, summaries_dir='/home/stuart/GoogleKWS/models2/crnn/logs/', training=True)

These resources are great and not ones that I found when I started building my model it will take me a bit to digest them.

I built my own because I wanted to understand how it works and be able to tweak it. I tried so many broken examples on Github that I just decided to dig in and do it myself.

I am doing stuff at the moment so ask away my MS means 6 month much will be forgotten so ask when still fresh.

The dataset is what matters and so you have an exact ref its important to also have the same
https://drive.google.com/file/d/1dreV5fBIwzdcJnXEueYwc4NeWCyufdS-/view?usp=share_link

I think actually from training I am a tad short by maybe another 50-70% of data as training accuracy is quite a bit above validation. Not to bad but it is showing slight signs of over-fitting.
When training you have to take the metrics with a grain of salt as we don’t know how narrow the classification is.
An analogy would be to sit next to a swimming pool and with 100% accuracy hit the pool with a tennis ball, but is that as accurate as hitting a dartboard bullseye at the same distance.
In a standard binary classification the choice of data can give very little cross entropy and x2 huge pools that yes 100% training accuracy is hard not to get whilst in use as a KWS it will still be steaming :poop:

There is nothing clever about a classification model as it classifies based on the classes you give it.
Hence why I use a far more classifications 'wanted_words=‘heymarvin,noise,unk,h1m1,h2m1,h1m2,h2m2’` Index[0] Heymarvin the KW, Index[1] noise or call it voice_silence as its a classification of no voice, Index[2] unk unknown voice which contains similar syllable words that are phonetically different to the KW, h1m2 (Hey1Marvin2) are phonetically similar Hey & Marvin combinations that don’t contain Hey Marvin to deliberately create cross entropy with the KW and make that swimming pool small, whilst being just unknown subsets not contained in unk.

Most problems are due to volume level as there are two problems as often mics are not very sensitive and AGC is not set automatically on some and also when we do speak its often not realised how much we stress the opening phone to the rest of a sentence of in amplitude of several orders. The ‘H’ of Hey can be 4x or more in amplitude than the rest of the word.
A print(lvl) in the above code should show this and you can watch and add debug as you go.
Also on KW if you use soundfile

sf.write('kw.wav', rec_rbuff, 16000) you can have a look at the wavfile as a wav file after.

Python is great for RAD, Research or a hobby but hopefully someone will create TFLite & Onnx runners in either Rust or C/C++ as this DSP is often repeated and the focus platform is embedded, the code here is purely a mix of hobby & research.

@shellcode
You could test against a wav dataset with an adaption of this code as doesn’t have to be by mic

import tensorflow as tf
import soundfile as sf
import numpy as np
import glob
import os
from playsound import playsound
import sys,tty,termios

def getkey():
    old_settings = termios.tcgetattr(sys.stdin)
    tty.setcbreak(sys.stdin.fileno())
    try:
        while True:
            b = os.read(sys.stdin.fileno(), 3).decode()
            if len(b) == 3:
                k = ord(b[2])
            else:
                k = ord(b)
            key_mapping = {
                127: 'backspace',
                10: 'return',
                32: 'space',
                9: 'tab',
                27: 'esc',
                65: 'up',
                66: 'down',
                67: 'right',
                68: 'left'
            }
            return key_mapping.get(k, chr(k))
    finally:
        termios.tcsetattr(sys.stdin, termios.TCSADRAIN, old_settings)

def kw_detect(rec, sample_rate ,duration):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    return output_data[0][1]

        
# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1
kw_path = "../ProjectEars/dataset/trim-combine/h1"

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn-quant/tflite_non_stream/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
start = 0
count = 0  
kw_files = glob.glob(os.path.join(kw_path, '*.wav'))
if len(kw_files) == 0:
  print('No files found')
  
for kw_wav in kw_files:
  key_quit = False
  frame = 0
  found_start = False
  found_end = False
  data, samplerate = sf.read(kw_wav, dtype='float32')
  print(kw_wav, data.shape)
  while frame < 100:
    start = 320 * frame
    rec = data[start:start + 320]
    #print(rec.shape, start)
    if len(rec) < 320:
      break
    kw_prob = kw_detect(rec, sample_rate ,duration)
    if kw_prob > 0.01: #and found_start == False:
      print(kw_wav, kw_prob, frame)
      playsound(kw_wav)
      try:
        while True:
          k = getkey()
          if k == 'd':
            #os.remove(kw_wav)
            key_quit = True
            break
          else:
            print(k)
            key_quit = True
            break
      except (KeyboardInterrupt, SystemExit):
        os.system('stty sane')
      print('stopping.')
    if key_quit == True:
      break
    frame += 1
  for s in range(100):
    rec = np.zeros(320, dtype=np.float32)
    kw_prob = kw_detect(rec, sample_rate ,duration)

  count += 1
print(count)

And noticed I was recording backwards :slight_smile: so try this instead

import tensorflow as tf
import sounddevice as sd
import soundfile as sf
import numpy as np
import threading
import uuid

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())
      
def sd_callback(rec, frames, time, status):
    global gain, max_rec, kw_hit, kw_hit_rbuff, vad_hit_rbuff, rec_rbuff, rec_samples, rec_out, kw_samples
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, rec_samples))
    rec = np.multiply(rec, gain)
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
      
    rec_rbuff = np.roll(rec_rbuff, -rec_samples)
    rec_rbuff[len(rec_rbuff) - rec_samples:len(rec_rbuff)] = rec
    
    out_softmax = softmax_stable(output_data[0])
         
    kw_hit_rbuff = np.roll(kw_hit_rbuff, 1)
    kw_hit_rbuff[0] = out_softmax[0]
    kw_prob = np.mean(kw_hit_rbuff)
    
    if out_softmax[0] > 0.95:
      vad_hit_rbuff = np.multiply(vad_hit_rbuff, 0)
        
    vad_hit_rbuff = np.roll(vad_hit_rbuff, 1)
    vad_hit_rbuff[0] = out_softmax[1]
    vad_prob = np.mean(vad_hit_rbuff)
       
    if vad_prob > 0.95:
      print("Vad:", vad_prob, kw_prob, lvl)
      kw_hit = False
      kw_hit_rbuff = np.multiply(kw_hit_rbuff, 0)
      max_rec = 0.0
      kw_count = 0
      kw_prob = 0
      rec_out = False 

    if kw_prob > 0.90:
      if kw_hit == False:
        print("Marvin:", kw_prob, lvl)
        print(kw_hit_rbuff)
        kw_hit = True
        if rec_out == False:
          sf.write(uuid.uuid4().hex + '.wav', rec_rbuff[0:16000], 16000)
          rec_out = True
        
             
# Parameters
kw_duration = 1.0
kw_latency = 0.6
rec_duration = 0.020
vad_duration = 0.20
sample_rate = 16000

rec_samples = int((sample_rate * kw_duration) * rec_duration)
vad_hit_samples = int((sample_rate * kw_duration) / ((sample_rate * kw_duration) * vad_duration))
kw_hit_samples = int((sample_rate * kw_duration) / ((sample_rate * kw_duration) * rec_duration))
kw_samples = int(sample_rate * kw_duration)
latency_samples = int(kw_duration * ( kw_latency / rec_duration)) * rec_samples

num_channels = 1
gain = 5
max_rec = 0.0
kw_hit = False
rec_out = False
kw_hit_rbuff = np.zeros(kw_hit_samples, dtype=np.float32)
vad_hit_rbuff = np.zeros(vad_hit_samples, dtype=np.float32)
rec_rbuff = np.zeros(kw_samples + latency_samples, dtype=np.float32)

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors, really should be static copies to use as KW resets
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(rec_samples),
                    callback=sd_callback):
    threading.Event().wait()

Its sort of working back to front as would act as a cancel signal if not hit as the 1st detection latency is extremely low but until you have analysed all you never know.
It does give direct wav’s so you can actually see and hear what you are getting.

Again much is dataset as forgot to mention the dataset is also mixed with I think its -10.6 db noise (my memory lols) but because of the dataset and not the model it should be more resilient to small levels of noise.
Really it needs a head2head false positives/negatives on Librispeech clean and with it modified with -10.6 noise (dirty)
Still a 1st attempt dataset though so will get some optimisations

For google-kws I just create some source files to source filename on the cli

so if using a venv source venv/bin/activate
then setup.text to get the paths

KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models2
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"

Then the model type which was with the above dataset link and models provided

$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn/ \
--split_data 0 \
--wanted_words 'heymarvin,noise,unk,h1m1,h2m1,h1m2,h2m2' \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 20000,20000,20000,20000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.0 \
--background_frequency 0.0 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
--feature_type 'mfcc_op' \
--fft_magnitude_squared 1 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.1 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 0

PS if you are running any training you are going to have to go old school as GoogleKWS is benchmarking accuracy as have no earlystopping method included.
Whilst its training open another cli with the same venv and tensorboard dev upload --logdir logs with the right logs dir and you will be able to watch the training vs validation graph on Tensorboard.dev

I am still getting my memory up to speed on the right params and dataset but the basics of underfitting (training accuracy is low) or overfitting(training accuracy is quite a bit higher than validation) are quite simple concepts.
The 1st of not enough training and underfitting a model where both training and validation accuracy are low is fairly obvious.
Overfitting is less so but its just when you have trained too much on your dataset qty and rather than being a guiding fit your dataset has become an exact fit hence why your own trained (validation) will likely be much lower and not correlate to training accuracy.
The GoogleKWS framework works on best validation only even if its gone far past the point of overfitting.
All the weights are saved so you can copy over and change what you consider the best weights.

I probably won’t use googleKWS upon review of the code; I have a working model that does what I need with 98% accuracy.

Cool as what is ever fit for purpose for you, I am still trying to get up to scratch with my memory as all my ringbuffers where going the wrong direction on inspect, which happens with me if I stop doing something for a while.

I have finally cracked the noise problem for KWS on arm but don’t know what it will run on yet as not sure of load.

Give DeepFilterNet a go as if it will run its results are RTXNoise level for Linux
/etc/asound.conf

pcm.ladspa {
    type ladspa
    slave.pcm "agcin";
    path "/usr/lib/ladspa";
    plugins [{
        label deep_filter_mono
        input {
            control [ "Attenuation Limit (dB)" 28 ]
        }
    }]
}

pcm.deepfilter {
    type=plug
    slave = {
    pcm "ladspa"
    channels 1
    rate 48000}
}

pcm.agcin {
 type speex
 slave.pcm "plughw:1,0"
 agc 1
 agc_level 8000
 denoise no
 dereverb no
}

pcm.agcout {
 type speex
 slave.pcm "deepfilter"
 agc 1
 agc_level 8000
 denoise no
 dereverb no
}

Also I have speex agc loaded as its more tuned for voice but to my amazement even though decades old speex is still installed as rc1 whilst alsa-plugins looks for the release so the plugins don’t get installed

Explains it but aplay --version will tell you what userspace version you are using

I think I will stick with GoogleKWS as the code can be quite daunting at 1st but the ability to create a dataset then train off a whole range of models makes it indispensable for me even if it has a bit of a steep learning curve.

I have recreated a bigger dataset and got some of the params current this time and things are looking exceptionally well even when I remember to use the crrn-state model not crnn.

I am sat here due to DeepFilterNet with 3rd party music blasting and happily recognising KW :slight_smile: which I am delighted about

KWS was created as a research project to test many different models/options/configurations; and is overly complicated. My setup is simple and straight forward. Once my prototype matures (as time permits) I’ll post it to github.

It has a sample generation mode and works fairly well on my Raspberry Pi; I optimized it to work with Tensor-flow GPU (so can train on a laptop with a faster processor and the move to an endpoint).

That is why I use it as creating the dataset is all important and time consuming.
So when the dataset is created I can then create many different models/options/configs from the same framework and dataset.

Whatever works for you but great to have someone creating models as they are the heart of any KWS in fact are the KWS.
I have almost got things right as I start to remember but still haven’t decided on the kw_hit code yet, but when done we shall have to have a head to head on false positives / negatives test as hacking something together to test.

I got round to running DeepFilterNet on a Pi02W and yeah its just too much on a single core and the Tract ML framework doesn’t support threading but a bit like a transformer it has an encoder and decoder that likely could be split to 2 threads as its just absolutely awesome.