Suggestions for Dutch wake word detection for newbie

I am doing stuff at the moment so ask away my MS means 6 month much will be forgotten so ask when still fresh.

The dataset is what matters and so you have an exact ref its important to also have the same
https://drive.google.com/file/d/1dreV5fBIwzdcJnXEueYwc4NeWCyufdS-/view?usp=share_link

I think actually from training I am a tad short by maybe another 50-70% of data as training accuracy is quite a bit above validation. Not to bad but it is showing slight signs of over-fitting.
When training you have to take the metrics with a grain of salt as we don’t know how narrow the classification is.
An analogy would be to sit next to a swimming pool and with 100% accuracy hit the pool with a tennis ball, but is that as accurate as hitting a dartboard bullseye at the same distance.
In a standard binary classification the choice of data can give very little cross entropy and x2 huge pools that yes 100% training accuracy is hard not to get whilst in use as a KWS it will still be steaming :poop:

There is nothing clever about a classification model as it classifies based on the classes you give it.
Hence why I use a far more classifications 'wanted_words=‘heymarvin,noise,unk,h1m1,h2m1,h1m2,h2m2’` Index[0] Heymarvin the KW, Index[1] noise or call it voice_silence as its a classification of no voice, Index[2] unk unknown voice which contains similar syllable words that are phonetically different to the KW, h1m2 (Hey1Marvin2) are phonetically similar Hey & Marvin combinations that don’t contain Hey Marvin to deliberately create cross entropy with the KW and make that swimming pool small, whilst being just unknown subsets not contained in unk.

Most problems are due to volume level as there are two problems as often mics are not very sensitive and AGC is not set automatically on some and also when we do speak its often not realised how much we stress the opening phone to the rest of a sentence of in amplitude of several orders. The ‘H’ of Hey can be 4x or more in amplitude than the rest of the word.
A print(lvl) in the above code should show this and you can watch and add debug as you go.
Also on KW if you use soundfile

sf.write('kw.wav', rec_rbuff, 16000) you can have a look at the wavfile as a wav file after.

Python is great for RAD, Research or a hobby but hopefully someone will create TFLite & Onnx runners in either Rust or C/C++ as this DSP is often repeated and the focus platform is embedded, the code here is purely a mix of hobby & research.

@shellcode
You could test against a wav dataset with an adaption of this code as doesn’t have to be by mic

import tensorflow as tf
import soundfile as sf
import numpy as np
import glob
import os
from playsound import playsound
import sys,tty,termios

def getkey():
    old_settings = termios.tcgetattr(sys.stdin)
    tty.setcbreak(sys.stdin.fileno())
    try:
        while True:
            b = os.read(sys.stdin.fileno(), 3).decode()
            if len(b) == 3:
                k = ord(b[2])
            else:
                k = ord(b)
            key_mapping = {
                127: 'backspace',
                10: 'return',
                32: 'space',
                9: 'tab',
                27: 'esc',
                65: 'up',
                66: 'down',
                67: 'right',
                68: 'left'
            }
            return key_mapping.get(k, chr(k))
    finally:
        termios.tcsetattr(sys.stdin, termios.TCSADRAIN, old_settings)

def kw_detect(rec, sample_rate ,duration):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    return output_data[0][1]

        
# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1
kw_path = "../ProjectEars/dataset/trim-combine/h1"

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn-quant/tflite_non_stream/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
start = 0
count = 0  
kw_files = glob.glob(os.path.join(kw_path, '*.wav'))
if len(kw_files) == 0:
  print('No files found')
  
for kw_wav in kw_files:
  key_quit = False
  frame = 0
  found_start = False
  found_end = False
  data, samplerate = sf.read(kw_wav, dtype='float32')
  print(kw_wav, data.shape)
  while frame < 100:
    start = 320 * frame
    rec = data[start:start + 320]
    #print(rec.shape, start)
    if len(rec) < 320:
      break
    kw_prob = kw_detect(rec, sample_rate ,duration)
    if kw_prob > 0.01: #and found_start == False:
      print(kw_wav, kw_prob, frame)
      playsound(kw_wav)
      try:
        while True:
          k = getkey()
          if k == 'd':
            #os.remove(kw_wav)
            key_quit = True
            break
          else:
            print(k)
            key_quit = True
            break
      except (KeyboardInterrupt, SystemExit):
        os.system('stty sane')
      print('stopping.')
    if key_quit == True:
      break
    frame += 1
  for s in range(100):
    rec = np.zeros(320, dtype=np.float32)
    kw_prob = kw_detect(rec, sample_rate ,duration)

  count += 1
print(count)

And noticed I was recording backwards :slight_smile: so try this instead

import tensorflow as tf
import sounddevice as sd
import soundfile as sf
import numpy as np
import threading
import uuid

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())
      
def sd_callback(rec, frames, time, status):
    global gain, max_rec, kw_hit, kw_hit_rbuff, vad_hit_rbuff, rec_rbuff, rec_samples, rec_out, kw_samples
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, rec_samples))
    rec = np.multiply(rec, gain)
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
      
    rec_rbuff = np.roll(rec_rbuff, -rec_samples)
    rec_rbuff[len(rec_rbuff) - rec_samples:len(rec_rbuff)] = rec
    
    out_softmax = softmax_stable(output_data[0])
         
    kw_hit_rbuff = np.roll(kw_hit_rbuff, 1)
    kw_hit_rbuff[0] = out_softmax[0]
    kw_prob = np.mean(kw_hit_rbuff)
    
    if out_softmax[0] > 0.95:
      vad_hit_rbuff = np.multiply(vad_hit_rbuff, 0)
        
    vad_hit_rbuff = np.roll(vad_hit_rbuff, 1)
    vad_hit_rbuff[0] = out_softmax[1]
    vad_prob = np.mean(vad_hit_rbuff)
       
    if vad_prob > 0.95:
      print("Vad:", vad_prob, kw_prob, lvl)
      kw_hit = False
      kw_hit_rbuff = np.multiply(kw_hit_rbuff, 0)
      max_rec = 0.0
      kw_count = 0
      kw_prob = 0
      rec_out = False 

    if kw_prob > 0.90:
      if kw_hit == False:
        print("Marvin:", kw_prob, lvl)
        print(kw_hit_rbuff)
        kw_hit = True
        if rec_out == False:
          sf.write(uuid.uuid4().hex + '.wav', rec_rbuff[0:16000], 16000)
          rec_out = True
        
             
# Parameters
kw_duration = 1.0
kw_latency = 0.6
rec_duration = 0.020
vad_duration = 0.20
sample_rate = 16000

rec_samples = int((sample_rate * kw_duration) * rec_duration)
vad_hit_samples = int((sample_rate * kw_duration) / ((sample_rate * kw_duration) * vad_duration))
kw_hit_samples = int((sample_rate * kw_duration) / ((sample_rate * kw_duration) * rec_duration))
kw_samples = int(sample_rate * kw_duration)
latency_samples = int(kw_duration * ( kw_latency / rec_duration)) * rec_samples

num_channels = 1
gain = 5
max_rec = 0.0
kw_hit = False
rec_out = False
kw_hit_rbuff = np.zeros(kw_hit_samples, dtype=np.float32)
vad_hit_rbuff = np.zeros(vad_hit_samples, dtype=np.float32)
rec_rbuff = np.zeros(kw_samples + latency_samples, dtype=np.float32)

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors, really should be static copies to use as KW resets
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(rec_samples),
                    callback=sd_callback):
    threading.Event().wait()

Its sort of working back to front as would act as a cancel signal if not hit as the 1st detection latency is extremely low but until you have analysed all you never know.
It does give direct wav’s so you can actually see and hear what you are getting.

Again much is dataset as forgot to mention the dataset is also mixed with I think its -10.6 db noise (my memory lols) but because of the dataset and not the model it should be more resilient to small levels of noise.
Really it needs a head2head false positives/negatives on Librispeech clean and with it modified with -10.6 noise (dirty)
Still a 1st attempt dataset though so will get some optimisations

For google-kws I just create some source files to source filename on the cli

so if using a venv source venv/bin/activate
then setup.text to get the paths

KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models2
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"

Then the model type which was with the above dataset link and models provided

$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn/ \
--split_data 0 \
--wanted_words 'heymarvin,noise,unk,h1m1,h2m1,h1m2,h2m2' \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 20000,20000,20000,20000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.0 \
--background_frequency 0.0 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
--feature_type 'mfcc_op' \
--fft_magnitude_squared 1 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.1 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 0

PS if you are running any training you are going to have to go old school as GoogleKWS is benchmarking accuracy as have no earlystopping method included.
Whilst its training open another cli with the same venv and tensorboard dev upload --logdir logs with the right logs dir and you will be able to watch the training vs validation graph on Tensorboard.dev

I am still getting my memory up to speed on the right params and dataset but the basics of underfitting (training accuracy is low) or overfitting(training accuracy is quite a bit higher than validation) are quite simple concepts.
The 1st of not enough training and underfitting a model where both training and validation accuracy are low is fairly obvious.
Overfitting is less so but its just when you have trained too much on your dataset qty and rather than being a guiding fit your dataset has become an exact fit hence why your own trained (validation) will likely be much lower and not correlate to training accuracy.
The GoogleKWS framework works on best validation only even if its gone far past the point of overfitting.
All the weights are saved so you can copy over and change what you consider the best weights.

I probably won’t use googleKWS upon review of the code; I have a working model that does what I need with 98% accuracy.

Cool as what is ever fit for purpose for you, I am still trying to get up to scratch with my memory as all my ringbuffers where going the wrong direction on inspect, which happens with me if I stop doing something for a while.

I have finally cracked the noise problem for KWS on arm but don’t know what it will run on yet as not sure of load.

Give DeepFilterNet a go as if it will run its results are RTXNoise level for Linux
/etc/asound.conf

pcm.ladspa {
    type ladspa
    slave.pcm "agcin";
    path "/usr/lib/ladspa";
    plugins [{
        label deep_filter_mono
        input {
            control [ "Attenuation Limit (dB)" 28 ]
        }
    }]
}

pcm.deepfilter {
    type=plug
    slave = {
    pcm "ladspa"
    channels 1
    rate 48000}
}

pcm.agcin {
 type speex
 slave.pcm "plughw:1,0"
 agc 1
 agc_level 8000
 denoise no
 dereverb no
}

pcm.agcout {
 type speex
 slave.pcm "deepfilter"
 agc 1
 agc_level 8000
 denoise no
 dereverb no
}

Also I have speex agc loaded as its more tuned for voice but to my amazement even though decades old speex is still installed as rc1 whilst alsa-plugins looks for the release so the plugins don’t get installed

Explains it but aplay --version will tell you what userspace version you are using

I think I will stick with GoogleKWS as the code can be quite daunting at 1st but the ability to create a dataset then train off a whole range of models makes it indispensable for me even if it has a bit of a steep learning curve.

I have recreated a bigger dataset and got some of the params current this time and things are looking exceptionally well even when I remember to use the crrn-state model not crnn.

I am sat here due to DeepFilterNet with 3rd party music blasting and happily recognising KW :slight_smile: which I am delighted about

KWS was created as a research project to test many different models/options/configurations; and is overly complicated. My setup is simple and straight forward. Once my prototype matures (as time permits) I’ll post it to github.

It has a sample generation mode and works fairly well on my Raspberry Pi; I optimized it to work with Tensor-flow GPU (so can train on a laptop with a faster processor and the move to an endpoint).

That is why I use it as creating the dataset is all important and time consuming.
So when the dataset is created I can then create many different models/options/configs from the same framework and dataset.

Whatever works for you but great to have someone creating models as they are the heart of any KWS in fact are the KWS.
I have almost got things right as I start to remember but still haven’t decided on the kw_hit code yet, but when done we shall have to have a head to head on false positives / negatives test as hacking something together to test.

I got round to running DeepFilterNet on a Pi02W and yeah its just too much on a single core and the Tract ML framework doesn’t support threading but a bit like a transformer it has an encoder and decoder that likely could be split to 2 threads as its just absolutely awesome.

@shellcode How are you getting on with your KW?
I think I have porcupine beat GitHub - Picovoice/wake-word-benchmark: wake word engine benchmark framework
With 2.57% miss rate with (1 in 10 hours false alarms) like there test in the above.
Also I think I can improve on that but training takes a day so slow going.
Someone should actually run the test than take there word for it and also maybe precise or others we have.
I have been augmenting my dataset with echo and reverb and don’t think it makes any recognition improvement and why that is such a burning issue for recognition.
Echo and reverb just mash the signal it becomes something else in a spectrogram and think trying to be clever by adding it just doesn’t work, so will drop that and likely get quite a bit more accuracy where take you pick you can lower false alarms or shift it the other way and lower the miss rate.

https://drive.google.com/file/d/1BzmK37V4CV_aBkV4vcM-d7cgzbqU8K_W/view?usp=share_link
There is 1400 ‘heymarvin’ with noise mixed in there and the tflite index[0] is the kw
The 100 hours of LibriSpeech (test_clean portion) is available for false positives.

KW hit code again stopped being clever and just if kw_prob > 0.9997: gave the above

So I ingested the google kws into my project; it was a little painful as they are using tensorflow v1… :frowning:

But I wanted to play around with the attention and streaming layers they had to see if that makes a big difference. I don’t have a fully trained model as the google kws crashes during training on loop 600ish; Predictions seem a tad bit slower then my model: I don’t have it quantized or converted to tflite yet but that shouldn’t be required. I tried the ds_tc_resnet.

id: 0 label: _silence_ (52.4%) time: 717.518ms count: 16
id: 0 label: _silence_ (60.8%) time: 715.485ms count: 17
id: 0 label: _silence_ (61.2%) time: 717.410ms count: 18
id: 0 label: _silence_ (60.6%) time: 718.500ms count: 19
id: 0 label: _silence_ (70.6%) time: 716.438ms count: 20
id: 0 label: _silence_ (51.3%) time: 717.889ms count: 21
id: 0 label: _silence_ (89.8%) time: 714.925ms count: 22
id: 2 label: yes (44.0%) time: 708.930ms count: 23
id: 0 label: _silence_ (61.5%) time: 717.330ms count: 24
id: 0 label: _silence_ (83.2%) time: 713.857ms count: 25
id: 2 label: yes (42.5%) time: 716.303ms count: 26
id: 0 label: _silence_ (80.2%) time: 717.322ms count: 27
id: 0 label: _silence_ (36.6%) time: 711.317ms count: 28
id: 0 label: _silence_ (51.4%) time: 715.419ms count: 29

I only trained 2 words above on the kws model with 10 epochs; so I don’t expect it to perform very well.

Here is how my model performs:

id: 2 label: noise (100.0%) time: 230.710ms count: 0
id: 2 label: noise (100.0%) time: 46.513ms count: 1
id: 2 label: noise (100.0%) time: 32.334ms count: 2
id: 2 label: noise (100.0%) time: 39.197ms count: 3
id: 2 label: noise (99.9%) time: 31.632ms count: 4
id: 2 label: noise (100.0%) time: 29.898ms count: 5
id: 2 label: noise (100.0%) time: 31.591ms count: 6
id: 2 label: noise (100.0%) time: 39.528ms count: 7
id: 2 label: noise (99.7%) time: 31.933ms count: 8
id: 2 label: noise (100.0%) time: 29.871ms count: 9
id: 2 label: noise (100.0%) time: 29.429ms count: 10
id: 1 label: hey (99.3%) time: 37.315ms count: 11
id: 2 label: noise (100.0%) time: 30.291ms count: 12
id: 2 label: noise (100.0%) time: 31.868ms count: 13
id: 2 label: noise (100.0%) time: 31.167ms count: 14
id: 1 label: hey (79.8%) time: 32.522ms count: 15
id: 1 label: hey (72.2%) time: 31.797ms count: 16
id: 2 label: noise (100.0%) time: 32.040ms count: 17
id: 2 label: noise (100.0%) time: 30.590ms count: 18
id: 2 label: noise (100.0%) time: 28.992ms count: 19
id: 2 label: noise (100.0%) time: 31.423ms count: 20
id: 2 label: noise (100.0%) time: 29.940ms count: 21

What perf do you see on your google KWS model?

Wow the tflite version is super fast:

id: 0 label: _silence_ (91.5%) time: 7.822ms count: 4
id: 0 label: _silence_ (80.6%) time: 7.499ms count: 5
id: 0 label: _silence_ (81.0%) time: 7.354ms count: 6
id: 0 label: _silence_ (74.3%) time: 7.522ms count: 7
id: 0 label: _silence_ (89.6%) time: 7.792ms count: 8
id: 0 label: _silence_ (88.3%) time: 8.068ms count: 9
id: 0 label: _silence_ (86.1%) time: 7.745ms count: 10
id: 0 label: _silence_ (59.7%) time: 7.436ms count: 11
id: 0 label: _silence_ (90.8%) time: 7.729ms count: 12
id: 0 label: _silence_ (51.5%) time: 7.428ms count: 13
id: 0 label: _silence_ (33.7%) time: 7.164ms count: 14
id: 0 label: _silence_ (69.3%) time: 7.138ms count: 15
id: 2 label: yes (41.7%) time: 7.133ms count: 16
id: 0 label: _silence_ (51.7%) time: 7.352ms count: 17
id: 0 label: _silence_ (54.6%) time: 8.099ms count: 18

Its very dependent on model and the number of params gives a rough guide to speed even if some layers are more performant than others.
So far I only have a CRNN done as stuck with the streaming models.

Its a ‘hey marvin’ https://drive.google.com/file/d/1bGf_b8imzPZJNYDUWR94mWuV0deSMVzM/view?usp=share_link
Index[0] is the ‘heymarvin’ kw
Not really sure about time but on my pc its about 0.014763929 seconds per 1sec KW but that doesn’t mean all that much as its more a measurement of my pc.
There is a benchmark tool Performance measurement  |  TensorFlow Lite just never used it

Usually I am aiming for a Pi02W as think they make great satellites when you can purchase them.
Its running @ 25% on a single core and with python and everything htop says 165mb

The main thing is that it runs and is low latency but really going off what Google have already benched.

crnn_state
parameters: 467K
float accuracy: 97.1; model size: 1800KB; latency 7.1ms
quant accuracy: 96.9; model size: 593KB; latency 2.6ms
stream float accuracy: 96.3; model size: 1700KB; latency 0.2ms
stream quant accuracy: 95.8; model size: 472KB; latency 0.1ms

The only test I can think of is running a 100 hours of librispeech through it as soon as you start adding anything random such as noise or supply your own KW’s it all starts getting very subjective.
I have a bit of a hack here https://www.openslr.org/resources/12/train-clean-100.tar.gz

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(13, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
    dirtxt = os.path.dirname(txtfile)
    with open(txtfile) as f:
      lines = f.readlines()
      for line in lines:
       frame = 0
       kw_count = 0
       kw_hit = False
       content = line.split(" ", 1)
       flacfile = dirtxt + '/' + content[0] + '.flac'
       data, samplerate = sf.read(flacfile, dtype='float32')
       total_duration = total_duration + (len(data) / samplerate)
       while frame < 100:
         start = 320 * frame
         rec = data[start:start + 320]
         if len(rec) < 320:
           break
         kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
         if kw_prob > 0.9999:
           kw_hit = True
           reset_state = True
           kw_hit_rbuff = np.zeros(13, dtype=np.float32)
           print(flacfile, kw_prob, frame)
         else:
           reset_state = False 
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

kw is far more subjective but these are the 1400 I had in the test set

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf
import time


def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   


def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    # Make prediction from model
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []

start_time = time.time()
for kwfile in glob.glob(os.path.join('../GoogleKWS/data2/testing/heymarvin', '*.wav')):
  reset_state = True
  frame = 0
  kw_count = 0
  kw_hit = False
  data, samplerate = sf.read(kwfile, dtype='float32')
  total_duration = total_duration + (len(data) / samplerate)
  while frame < 100:
    start = 320 * frame
    rec = data[start:start + 320]
    if len(rec) < 320:
      break
    kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
    if kw_prob > 0.9999:
        kw_hit = True
        reset_state = True

    else:
      reset_state = False
      kw_count = 0      
    frame += 1
  if kw_hit == False:
    kw_hit_qty += 1
    hit_txt.append(kwfile)
  #print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)
print(time.time() - start_time)

With if kw_prob > 0.999: it gives 12 (100 hours) false negatives for 1.21% false positives (1400 kw)
As with if kw_prob > 0.9999: it gives 1 (100 hours) false negative for 3.78 % false positives (1400 kw)
The 1400 kw files are here https://drive.google.com/file/d/1dreV5fBIwzdcJnXEueYwc4NeWCyufdS-/view?usp=share_link

Those are the benchmarks to beat false positives/negatives and you need to quote together as its swings and roundabouts as less false negatives will give more false positives but in respect to Picovoice I can seem to match them approx for false positives but 10x better on false negatives or match them on false negatives with x3 less false positives.
Guess just test and see what you think

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading


def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())
      
def sd_callback(rec, frames, time, status):
    global max_rec, rec_samples
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, rec_samples))
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
         
    out_softmax = softmax_stable(output_data[0])   
    if out_softmax[0] > 0.999:
      print("Marvin:", out_softmax[0], max_rec)
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
      max_rec = 0.0
                  
# Parameters
kw_duration = 1.0
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
max_rec = 0.0
rec_samples = int((sample_rate * kw_duration) * rec_duration)

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
sd.default.device = 'donglein'

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors, really should be static copies to use as KW resets
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=rec_samples,
                    callback=sd_callback):
    threading.Event().wait()

All the scripts are super hacky just for tests but will prob create c/c++ runner eventually as python just sucks for any DSP even if it is very light using soundevice as the MFCC is embedded in the model so no external MFCC.

The problem with non streaming KWS is that audio is a stream whilst say video is stream of images but you can use a single image but a single sample means nothing in audio.
So you have to run a non streaming KWS as you ‘inch’ the incoming stream and you can quickly pick up a lot of load or not get the granularity of the KW in the right position and not detect it.
The streaming model I am using does it in 20ms chunks so that would mean running a non-streaming kws x50 a second. Likely less as your samples can shift but that often can reduce accuracy, generally my samples are fairly centered but doesn’t matter due to the 20ms sample rate whilst the lower you go the more chance you may miss the optimal recognition position.

PS if I had upped the kw_prob to just > 0.999915 it would be zero false negatives for 100 hours :slight_smile: for 4.21% false positives which is what I am working on as if I use on-device training of in-use captured KW then likely I can retain close to that level of false negatives whilst reducing false positives greatly.
Thats what is in the pipeline anyway.

lol, it only took me 24 hours to fully train the kws model on my other computer; Happy to share if you want to try a different model.

Tensorflow + M1 macbook + Metal. No bueno. Works ok for Development but it just crashes during training.

Ah yeah dunno how solid Tensorflow is on m1 as awesome ML machines but prob still a little fresh for support as metal and all is new.
I have an eye on a 16gb mini if 2nd user prices drop when the m2 comes out :slight_smile:

Yeah share it as still undecided but what the GoogleKWS gives me is this great datum to test and maybe I will pick a model and try and break out the code at later stage.

PS have you tried Whisper on your m1 as also GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ is really good but boy the small & medium models are awesome
I didn’t realise but Apple created extra ML Arm instructions for there cpu’s and boy do they shine.

I played around with the other whisper transcribe command; but it’s too slow for my tastes.

I will check out the link you shared above.

With the full on the m1 if you ever get the time would be interested if you bench with ```
whisper --best_of None --beam_size None

@shellcode

  '--restore_checkpoint',
  type=int,
  default=0,
  help='If 1 it will restore a checkpoint and resume the training '
  'by initializing model weights and optimizer with checkpoint values. '
  'It will use learning rate and number of training iterations from '
  '--learning_rate and --how_many_training_steps accordinlgy. '
  'This option is useful in cases when training was interrupted. '
  'With it you should adjust learning_rate and how_many_training_steps.'

If you crash again, you have to look at the train dir of the model or get the last train step from the cli and just modify the learning rate and steps as it starts again with whatever figures are there but subtracting and removing steps / learning rate is relatively painless before you restart.

ps tried another way with a non stream but inching x5 per second with a bcresnet as super low parameters so should be light.
Dunno as not sure about the framework as end train accuracy and validation seems far off quoted from the framework.
There is a pytorch version on github I have noticed and might give it a try.
https://drive.google.com/file/d/1k4UTgJ-3L2ItczgpcNspjycEtu4Q5S6X/view?usp=share_link

Don’t think this is reflective of non stream and the model isn’t quantised so heavier than could be, but the hack tests for non stream are here.
libri voice test

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 1.0
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_1/tflite_non_stream/non_stream.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(4, dtype=np.float32)
reset_rec =np.zeros(16000, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
    dirtxt = os.path.dirname(txtfile)
    with open(txtfile) as f:
      lines = f.readlines()
      for line in lines:
       frame = 0
       kw_count = 0
       kw_hit = False
       content = line.split(" ", 1)
       flacfile = dirtxt + '/' + content[0] + '.flac'
       data, samplerate = sf.read(flacfile, dtype='float32')
       max_amp = np.max(np.abs(data))
       data = np.multiply(data, 0.8 / max_amp)
       total_duration = total_duration + (len(data) / samplerate)
       max_amp = np.max(np.abs(data))
       while frame < int(len(data) / 3200):
         start = 3200 * frame
         if start < 16000:
           rec = data[0:start + 3200]
           rec = np.append(rec, np.zeros([1, 16000 - (3200 * (frame + 1))], dtype=np.float32))
         else:
           rec = data[start - 12800:start + 3200]
         kw_prob = kw_detect(rec, sample_rate ,duration, False)
         if kw_prob > 0.99:
           kw_hit = True
           print(flacfile, kw_prob, frame)
           kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

kw test

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf
import time


def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   


def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    # Make prediction from model
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 1.0
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_1/tflite_non_stream/non_stream.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
start_time = time.time()
kw_miss = []
reset_rec =np.zeros(16000, dtype=np.float32)
for kwfile in glob.glob(os.path.join('../GoogleKWS/data2/testing/heymarvin', '*.wav')):
  reset_state = False
  frame = 0
  kw_hit = False
  #print(kwfile)
  data, samplerate = sf.read(kwfile, dtype='float32')
  total_duration = total_duration + (len(data) / samplerate)
  while frame < 5:
    start = 3200 * frame
    rec = data[0:start + 3200]
    rec = np.append(rec, np.zeros([1, 16000 - (3200 * (frame + 1))], dtype=np.float32))
    kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
    #print(kw_prob, frame)
    if kw_prob > 0.99:
      kw_hit = True
      print('hey marvin', kw_prob)
      #reset_state = True 
    frame += 1
  if kw_hit == True:
    kw_hit_qty += 1
    hit_txt.append(kwfile)
  else:
    kw_miss.append([kw_prob])
  kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
  #print(kw_hit_qty, total_duration / 3600)
for kw in kw_miss:
  print(kw)       
print(1400 - kw_hit_qty, total_duration / 3600)
print(time.time() - start_time)

This model is as smooth as butter and works great while watching a loud movie in the same room. With long gating the words and everything. This is definitely a winner.

My framework makes it trivial to add new words into the mix; ie it picks up both words trained separately “hey”, “alice” said together. Then I said one → nine; while the moving was playing as well.

Plus the performance is amazing.

You still haven’t shared a model to test?
Do the 100 hour benchmark so we have those figures as that is about the only one that is not subjective with librispeech clean dataset as that is set in stone.

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_stateb3/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(10, dtype=np.float32)
reset_rec =np.zeros(320, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
    dirtxt = os.path.dirname(txtfile)
    with open(txtfile) as f:
      lines = f.readlines()
      for line in lines:
       frame = 0
       kw_count = 0
       kw_hit = False
       content = line.split(" ", 1)
       flacfile = dirtxt + '/' + content[0] + '.flac'
       data, samplerate = sf.read(flacfile, dtype='float32')
       max_amp = np.max(np.abs(data))
       data = np.multiply(data, 0.8 / max_amp)
       total_duration = total_duration + (len(data) / samplerate)
       max_amp = np.max(np.abs(data))
       while frame < int(len(data) / 320):
         start = 320 * frame
         rec = data[start:start + 320]
         if len(rec) < 320:
           break
         kw_prob = kw_detect(rec, sample_rate ,duration, False)
         kw_hit_rbuff = np.roll(kw_hit_rbuff, -1)
         kw_hit_rbuff[len(kw_hit_rbuff) - 1] = kw_prob
         kw_hit_prob = np.mean(kw_hit_rbuff)
         if kw_prob > 0.99995:
           kw_hit = True
           kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
           kw_hit_rbuff = np.zeros(10, dtype=np.float32)
           print(flacfile, kw_prob, frame)
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

I also noticed I had copied and pasted a wrong line as was not running for 100 hours :slight_smile:
I know with mine I have still a lot of work getting an optimal dataset as its not really the model or framework the majority is dataset.
You seem to be using GSC ‘Google Speech Commands’ which is a benchmark dataset and deliberately bad to give a benchmark. (I dunno if really it was just a 1st bad attempt that got further use :slight_smile: )
Even with ML-commons open source is still at a huge disadvantage to big data as what we have contains much errors and we have near zero meta-data for criteria and even dispersion spread.
I try to create my own through analysis but boy for one person with a single 6th gen xeon workstation its a lot of long winded and boring work.
I end up creating multi-stage models to filter the dataset, which is also changing as I update the dataset creation tools and have not even got to a model framework yet.

Prob will settle on https://zenodo.org/record/2529934#.Y6JLqNLP2RQ FSDnoisy18k
So its just a slight change to the benchmark and run @ 0.8 vol & 0.2 vol just for tests as no agc on the input of my test kws scripts, but good for analysis of model perf as It make it as 41.1 hours so doubling up is almost 100 hours also.

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_stateb3/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(10, dtype=np.float32)
reset_rec =np.zeros(320, dtype=np.float32)
for flacfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/FSDnoisy18k.audio_train/*.wav'):
       frame = 0
       kw_count = 0
       kw_hit = False
       data, samplerate = sf.read(flacfile, dtype='float32')
       max_amp = np.max(np.abs(data))
       data = np.multiply(data, 0.8 / max_amp)
       total_duration = total_duration + (len(data) / samplerate)
       while frame < int(len(data) / 320):
         start = 320 * frame
         rec = data[start:start + 320]
         if len(rec) < 320:
           break
         kw_prob = kw_detect(rec, sample_rate ,duration, False)
         kw_hit_rbuff = np.roll(kw_hit_rbuff, -1)
         kw_hit_rbuff[len(kw_hit_rbuff) - 1] = kw_prob
         kw_hit_prob = np.mean(kw_hit_rbuff)
         if kw_prob > 0.99995:
           kw_hit = True
           kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
           kw_hit_rbuff = np.zeros(10, dtype=np.float32)
           print(flacfile, kw_prob, frame)
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

What would be really good is for someone to actually do some tests with the above librispeech and fsdnoisy on picovoice & mycroft and create a 3rd party benchmark table
I don’t think many will fail on fsdnoisy as if they do they are doing something very wrong really :slight_smile:
But also needs to be supplied false positive reject rate at the same sensitivity, still thinking about false positives as its very subjective as you can just supply and already filtered dataset as an example but likely no set stone dataset will exist.

Ok I have a complete working prototype that replaces rhasspy server on the satellite. Which handles the wake word and communicates to a rhasspy backend/home assistant mqtt backend. It has an improved wake word detection system and removes 90% of the bulk that is not required on the client. Allows you to train new wake word models; and is working great in my setup. with a yaml backed configuration file for customizations.

Once I have polished up the code I will post to github with a simple trained model for demoing and testing.

I may start a home assistant project for collecting and training a community model; this works better than picovoice and the other solutions currently available.

Here is the model and code as promised.

Hi guys finally back and not done much really as situation deflated me slightly.

I have just been browsing some topics and getting lost in hugging face models :slight_smile:

Whilst I was browsing anton-l/wav2vec2-base-ft-keyword-spotting · Hugging Face haven’t a clue but Arm do a demo for ArmNN with wav2vec on a PI and it can also be super lite.

When you are looking at something often it can provide lateral thought and was just thinking that rather than KWS @ the ‘satelite’ word maybe it should just be a simple VAD model that activates a audio stream to a central KWS.

You could quickly gain tolerance and weight to user voices by on device training you could even create a base model of captured audio and deliberately overfit to user(s) voice.
Then the KW probability of wav2vec could make a stream decision and also could be prefiltered with something like DeepFilternet.

I will get back to you both about KWS when I regain some momentum but got lost checking the most downloaded models from hugging face for certain application types.