Suggestions for Dutch wake word detection for newbie

rolyan_trauts · December 14, 2022, 12:29am

@shellcode How are you getting on with your KW?
I think I have porcupine beat GitHub - Picovoice/wake-word-benchmark: wake word engine benchmark framework
With 2.57% miss rate with (1 in 10 hours false alarms) like there test in the above.
Also I think I can improve on that but training takes a day so slow going.
Someone should actually run the test than take there word for it and also maybe precise or others we have.
I have been augmenting my dataset with echo and reverb and don’t think it makes any recognition improvement and why that is such a burning issue for recognition.
Echo and reverb just mash the signal it becomes something else in a spectrogram and think trying to be clever by adding it just doesn’t work, so will drop that and likely get quite a bit more accuracy where take you pick you can lower false alarms or shift it the other way and lower the miss rate.

https://drive.google.com/file/d/1BzmK37V4CV_aBkV4vcM-d7cgzbqU8K_W/view?usp=share_link
There is 1400 ‘heymarvin’ with noise mixed in there and the tflite index[0] is the kw
The 100 hours of LibriSpeech (test_clean portion) is available for false positives.

KW hit code again stopped being clever and just if kw_prob > 0.9997: gave the above

shellcode · December 18, 2022, 3:08am

So I ingested the google kws into my project; it was a little painful as they are using tensorflow v1…

But I wanted to play around with the attention and streaming layers they had to see if that makes a big difference. I don’t have a fully trained model as the google kws crashes during training on loop 600ish; Predictions seem a tad bit slower then my model: I don’t have it quantized or converted to tflite yet but that shouldn’t be required. I tried the ds_tc_resnet.

id: 0 label: _silence_ (52.4%) time: 717.518ms count: 16
id: 0 label: _silence_ (60.8%) time: 715.485ms count: 17
id: 0 label: _silence_ (61.2%) time: 717.410ms count: 18
id: 0 label: _silence_ (60.6%) time: 718.500ms count: 19
id: 0 label: _silence_ (70.6%) time: 716.438ms count: 20
id: 0 label: _silence_ (51.3%) time: 717.889ms count: 21
id: 0 label: _silence_ (89.8%) time: 714.925ms count: 22
id: 2 label: yes (44.0%) time: 708.930ms count: 23
id: 0 label: _silence_ (61.5%) time: 717.330ms count: 24
id: 0 label: _silence_ (83.2%) time: 713.857ms count: 25
id: 2 label: yes (42.5%) time: 716.303ms count: 26
id: 0 label: _silence_ (80.2%) time: 717.322ms count: 27
id: 0 label: _silence_ (36.6%) time: 711.317ms count: 28
id: 0 label: _silence_ (51.4%) time: 715.419ms count: 29

I only trained 2 words above on the kws model with 10 epochs; so I don’t expect it to perform very well.

Here is how my model performs:

id: 2 label: noise (100.0%) time: 230.710ms count: 0
id: 2 label: noise (100.0%) time: 46.513ms count: 1
id: 2 label: noise (100.0%) time: 32.334ms count: 2
id: 2 label: noise (100.0%) time: 39.197ms count: 3
id: 2 label: noise (99.9%) time: 31.632ms count: 4
id: 2 label: noise (100.0%) time: 29.898ms count: 5
id: 2 label: noise (100.0%) time: 31.591ms count: 6
id: 2 label: noise (100.0%) time: 39.528ms count: 7
id: 2 label: noise (99.7%) time: 31.933ms count: 8
id: 2 label: noise (100.0%) time: 29.871ms count: 9
id: 2 label: noise (100.0%) time: 29.429ms count: 10
id: 1 label: hey (99.3%) time: 37.315ms count: 11
id: 2 label: noise (100.0%) time: 30.291ms count: 12
id: 2 label: noise (100.0%) time: 31.868ms count: 13
id: 2 label: noise (100.0%) time: 31.167ms count: 14
id: 1 label: hey (79.8%) time: 32.522ms count: 15
id: 1 label: hey (72.2%) time: 31.797ms count: 16
id: 2 label: noise (100.0%) time: 32.040ms count: 17
id: 2 label: noise (100.0%) time: 30.590ms count: 18
id: 2 label: noise (100.0%) time: 28.992ms count: 19
id: 2 label: noise (100.0%) time: 31.423ms count: 20
id: 2 label: noise (100.0%) time: 29.940ms count: 21

What perf do you see on your google KWS model?

shellcode · December 18, 2022, 4:24am

Wow the tflite version is super fast:

id: 0 label: _silence_ (91.5%) time: 7.822ms count: 4
id: 0 label: _silence_ (80.6%) time: 7.499ms count: 5
id: 0 label: _silence_ (81.0%) time: 7.354ms count: 6
id: 0 label: _silence_ (74.3%) time: 7.522ms count: 7
id: 0 label: _silence_ (89.6%) time: 7.792ms count: 8
id: 0 label: _silence_ (88.3%) time: 8.068ms count: 9
id: 0 label: _silence_ (86.1%) time: 7.745ms count: 10
id: 0 label: _silence_ (59.7%) time: 7.436ms count: 11
id: 0 label: _silence_ (90.8%) time: 7.729ms count: 12
id: 0 label: _silence_ (51.5%) time: 7.428ms count: 13
id: 0 label: _silence_ (33.7%) time: 7.164ms count: 14
id: 0 label: _silence_ (69.3%) time: 7.138ms count: 15
id: 2 label: yes (41.7%) time: 7.133ms count: 16
id: 0 label: _silence_ (51.7%) time: 7.352ms count: 17
id: 0 label: _silence_ (54.6%) time: 8.099ms count: 18

rolyan_trauts · December 18, 2022, 5:11am

Its very dependent on model and the number of params gives a rough guide to speed even if some layers are more performant than others.
So far I only have a CRNN done as stuck with the streaming models.

Its a ‘hey marvin’ https://drive.google.com/file/d/1bGf_b8imzPZJNYDUWR94mWuV0deSMVzM/view?usp=share_link
Index[0] is the ‘heymarvin’ kw
Not really sure about time but on my pc its about 0.014763929 seconds per 1sec KW but that doesn’t mean all that much as its more a measurement of my pc.
There is a benchmark tool Performance measurement | TensorFlow Lite just never used it

Usually I am aiming for a Pi02W as think they make great satellites when you can purchase them.
Its running @ 25% on a single core and with python and everything htop says 165mb

The main thing is that it runs and is low latency but really going off what Google have already benched.

crnn_state
parameters: 467K
float accuracy: 97.1; model size: 1800KB; latency 7.1ms
quant accuracy: 96.9; model size: 593KB; latency 2.6ms
stream float accuracy: 96.3; model size: 1700KB; latency 0.2ms
stream quant accuracy: 95.8; model size: 472KB; latency 0.1ms

The only test I can think of is running a 100 hours of librispeech through it as soon as you start adding anything random such as noise or supply your own KW’s it all starts getting very subjective.
I have a bit of a hack here https://www.openslr.org/resources/12/train-clean-100.tar.gz

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(13, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
    dirtxt = os.path.dirname(txtfile)
    with open(txtfile) as f:
      lines = f.readlines()
      for line in lines:
       frame = 0
       kw_count = 0
       kw_hit = False
       content = line.split(" ", 1)
       flacfile = dirtxt + '/' + content[0] + '.flac'
       data, samplerate = sf.read(flacfile, dtype='float32')
       total_duration = total_duration + (len(data) / samplerate)
       while frame < 100:
         start = 320 * frame
         rec = data[start:start + 320]
         if len(rec) < 320:
           break
         kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
         if kw_prob > 0.9999:
           kw_hit = True
           reset_state = True
           kw_hit_rbuff = np.zeros(13, dtype=np.float32)
           print(flacfile, kw_prob, frame)
         else:
           reset_state = False 
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

kw is far more subjective but these are the 1400 I had in the test set

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf
import time


def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   


def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    # Make prediction from model
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []

start_time = time.time()
for kwfile in glob.glob(os.path.join('../GoogleKWS/data2/testing/heymarvin', '*.wav')):
  reset_state = True
  frame = 0
  kw_count = 0
  kw_hit = False
  data, samplerate = sf.read(kwfile, dtype='float32')
  total_duration = total_duration + (len(data) / samplerate)
  while frame < 100:
    start = 320 * frame
    rec = data[start:start + 320]
    if len(rec) < 320:
      break
    kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
    if kw_prob > 0.9999:
        kw_hit = True
        reset_state = True

    else:
      reset_state = False
      kw_count = 0      
    frame += 1
  if kw_hit == False:
    kw_hit_qty += 1
    hit_txt.append(kwfile)
  #print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)
print(time.time() - start_time)

With if kw_prob > 0.999: it gives 12 (100 hours) false negatives for 1.21% false positives (1400 kw)
As with if kw_prob > 0.9999: it gives 1 (100 hours) false negative for 3.78 % false positives (1400 kw)
The 1400 kw files are here https://drive.google.com/file/d/1dreV5fBIwzdcJnXEueYwc4NeWCyufdS-/view?usp=share_link

Those are the benchmarks to beat false positives/negatives and you need to quote together as its swings and roundabouts as less false negatives will give more false positives but in respect to Picovoice I can seem to match them approx for false positives but 10x better on false negatives or match them on false negatives with x3 less false positives.
Guess just test and see what you think

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading


def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())
      
def sd_callback(rec, frames, time, status):
    global max_rec, rec_samples
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, rec_samples))
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
         
    out_softmax = softmax_stable(output_data[0])   
    if out_softmax[0] > 0.999:
      print("Marvin:", out_softmax[0], max_rec)
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
      max_rec = 0.0
                  
# Parameters
kw_duration = 1.0
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
max_rec = 0.0
rec_samples = int((sample_rate * kw_duration) * rec_duration)

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
sd.default.device = 'donglein'

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors, really should be static copies to use as KW resets
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=rec_samples,
                    callback=sd_callback):
    threading.Event().wait()

All the scripts are super hacky just for tests but will prob create c/c++ runner eventually as python just sucks for any DSP even if it is very light using soundevice as the MFCC is embedded in the model so no external MFCC.

The problem with non streaming KWS is that audio is a stream whilst say video is stream of images but you can use a single image but a single sample means nothing in audio.
So you have to run a non streaming KWS as you ‘inch’ the incoming stream and you can quickly pick up a lot of load or not get the granularity of the KW in the right position and not detect it.
The streaming model I am using does it in 20ms chunks so that would mean running a non-streaming kws x50 a second. Likely less as your samples can shift but that often can reduce accuracy, generally my samples are fairly centered but doesn’t matter due to the 20ms sample rate whilst the lower you go the more chance you may miss the optimal recognition position.

PS if I had upped the kw_prob to just > 0.999915 it would be zero false negatives for 100 hours for 4.21% false positives which is what I am working on as if I use on-device training of in-use captured KW then likely I can retain close to that level of false negatives whilst reducing false positives greatly.
Thats what is in the pipeline anyway.

shellcode · December 19, 2022, 1:38am

lol, it only took me 24 hours to fully train the kws model on my other computer; Happy to share if you want to try a different model.

Tensorflow + M1 macbook + Metal. No bueno. Works ok for Development but it just crashes during training.

rolyan_trauts · December 19, 2022, 1:55am

Ah yeah dunno how solid Tensorflow is on m1 as awesome ML machines but prob still a little fresh for support as metal and all is new.
I have an eye on a 16gb mini if 2nd user prices drop when the m2 comes out

Yeah share it as still undecided but what the GoogleKWS gives me is this great datum to test and maybe I will pick a model and try and break out the code at later stage.

PS have you tried Whisper on your m1 as also GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ is really good but boy the small & medium models are awesome
I didn’t realise but Apple created extra ML Arm instructions for there cpu’s and boy do they shine.

shellcode · December 19, 2022, 2:03am

I played around with the other whisper transcribe command; but it’s too slow for my tastes.

I will check out the link you shared above.

rolyan_trauts · December 19, 2022, 2:10am

With the full on the m1 if you ever get the time would be interested if you bench with ```
whisper --best_of None --beam_size None

@shellcode

  '--restore_checkpoint',
  type=int,
  default=0,
  help='If 1 it will restore a checkpoint and resume the training '
  'by initializing model weights and optimizer with checkpoint values. '
  'It will use learning rate and number of training iterations from '
  '--learning_rate and --how_many_training_steps accordinlgy. '
  'This option is useful in cases when training was interrupted. '
  'With it you should adjust learning_rate and how_many_training_steps.'

If you crash again, you have to look at the train dir of the model or get the last train step from the cli and just modify the learning rate and steps as it starts again with whatever figures are there but subtracting and removing steps / learning rate is relatively painless before you restart.

rolyan_trauts · December 20, 2022, 2:57am

ps tried another way with a non stream but inching x5 per second with a bcresnet as super low parameters so should be light.
Dunno as not sure about the framework as end train accuracy and validation seems far off quoted from the framework.
There is a pytorch version on github I have noticed and might give it a try.
https://drive.google.com/file/d/1k4UTgJ-3L2ItczgpcNspjycEtu4Q5S6X/view?usp=share_link

Don’t think this is reflective of non stream and the model isn’t quantised so heavier than could be, but the hack tests for non stream are here.
libri voice test

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 1.0
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_1/tflite_non_stream/non_stream.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(4, dtype=np.float32)
reset_rec =np.zeros(16000, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
    dirtxt = os.path.dirname(txtfile)
    with open(txtfile) as f:
      lines = f.readlines()
      for line in lines:
       frame = 0
       kw_count = 0
       kw_hit = False
       content = line.split(" ", 1)
       flacfile = dirtxt + '/' + content[0] + '.flac'
       data, samplerate = sf.read(flacfile, dtype='float32')
       max_amp = np.max(np.abs(data))
       data = np.multiply(data, 0.8 / max_amp)
       total_duration = total_duration + (len(data) / samplerate)
       max_amp = np.max(np.abs(data))
       while frame < int(len(data) / 3200):
         start = 3200 * frame
         if start < 16000:
           rec = data[0:start + 3200]
           rec = np.append(rec, np.zeros([1, 16000 - (3200 * (frame + 1))], dtype=np.float32))
         else:
           rec = data[start - 12800:start + 3200]
         kw_prob = kw_detect(rec, sample_rate ,duration, False)
         if kw_prob > 0.99:
           kw_hit = True
           print(flacfile, kw_prob, frame)
           kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

kw test

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf
import time


def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   


def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    # Make prediction from model
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 1.0
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_1/tflite_non_stream/non_stream.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
start_time = time.time()
kw_miss = []
reset_rec =np.zeros(16000, dtype=np.float32)
for kwfile in glob.glob(os.path.join('../GoogleKWS/data2/testing/heymarvin', '*.wav')):
  reset_state = False
  frame = 0
  kw_hit = False
  #print(kwfile)
  data, samplerate = sf.read(kwfile, dtype='float32')
  total_duration = total_duration + (len(data) / samplerate)
  while frame < 5:
    start = 3200 * frame
    rec = data[0:start + 3200]
    rec = np.append(rec, np.zeros([1, 16000 - (3200 * (frame + 1))], dtype=np.float32))
    kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
    #print(kw_prob, frame)
    if kw_prob > 0.99:
      kw_hit = True
      print('hey marvin', kw_prob)
      #reset_state = True 
    frame += 1
  if kw_hit == True:
    kw_hit_qty += 1
    hit_txt.append(kwfile)
  else:
    kw_miss.append([kw_prob])
  kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
  #print(kw_hit_qty, total_duration / 3600)
for kw in kw_miss:
  print(kw)       
print(1400 - kw_hit_qty, total_duration / 3600)
print(time.time() - start_time)

shellcode · December 20, 2022, 5:53pm

This model is as smooth as butter and works great while watching a loud movie in the same room. With long gating the words and everything. This is definitely a winner.

My framework makes it trivial to add new words into the mix; ie it picks up both words trained separately “hey”, “alice” said together. Then I said one → nine; while the moving was playing as well.

Plus the performance is amazing.

rolyan_trauts · December 20, 2022, 9:55pm

You still haven’t shared a model to test?
Do the 100 hour benchmark so we have those figures as that is about the only one that is not subjective with librispeech clean dataset as that is set in stone.

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_stateb3/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(10, dtype=np.float32)
reset_rec =np.zeros(320, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
    dirtxt = os.path.dirname(txtfile)
    with open(txtfile) as f:
      lines = f.readlines()
      for line in lines:
       frame = 0
       kw_count = 0
       kw_hit = False
       content = line.split(" ", 1)
       flacfile = dirtxt + '/' + content[0] + '.flac'
       data, samplerate = sf.read(flacfile, dtype='float32')
       max_amp = np.max(np.abs(data))
       data = np.multiply(data, 0.8 / max_amp)
       total_duration = total_duration + (len(data) / samplerate)
       max_amp = np.max(np.abs(data))
       while frame < int(len(data) / 320):
         start = 320 * frame
         rec = data[start:start + 320]
         if len(rec) < 320:
           break
         kw_prob = kw_detect(rec, sample_rate ,duration, False)
         kw_hit_rbuff = np.roll(kw_hit_rbuff, -1)
         kw_hit_rbuff[len(kw_hit_rbuff) - 1] = kw_prob
         kw_hit_prob = np.mean(kw_hit_rbuff)
         if kw_prob > 0.99995:
           kw_hit = True
           kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
           kw_hit_rbuff = np.zeros(10, dtype=np.float32)
           print(flacfile, kw_prob, frame)
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

I also noticed I had copied and pasted a wrong line as was not running for 100 hours
I know with mine I have still a lot of work getting an optimal dataset as its not really the model or framework the majority is dataset.
You seem to be using GSC ‘Google Speech Commands’ which is a benchmark dataset and deliberately bad to give a benchmark. (I dunno if really it was just a 1st bad attempt that got further use )
Even with ML-commons open source is still at a huge disadvantage to big data as what we have contains much errors and we have near zero meta-data for criteria and even dispersion spread.
I try to create my own through analysis but boy for one person with a single 6th gen xeon workstation its a lot of long winded and boring work.
I end up creating multi-stage models to filter the dataset, which is also changing as I update the dataset creation tools and have not even got to a model framework yet.

Prob will settle on https://zenodo.org/record/2529934#.Y6JLqNLP2RQ FSDnoisy18k
So its just a slight change to the benchmark and run @ 0.8 vol & 0.2 vol just for tests as no agc on the input of my test kws scripts, but good for analysis of model perf as It make it as 41.1 hours so doubling up is almost 100 hours also.

import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf

def softmax_stable(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())   

def kw_detect(rec, sample_rate ,duration, reset_state):


    rec = np.reshape(rec, (1, int(sample_rate * duration)))
    #rec = np.multiply(rec, 8)
    if reset_state:
      for s in range(len(input_details1)):
        inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    out_softmax = softmax_stable(output_data[0])
    return out_softmax[0]

# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_stateb3/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(10, dtype=np.float32)
reset_rec =np.zeros(320, dtype=np.float32)
for flacfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/FSDnoisy18k.audio_train/*.wav'):
       frame = 0
       kw_count = 0
       kw_hit = False
       data, samplerate = sf.read(flacfile, dtype='float32')
       max_amp = np.max(np.abs(data))
       data = np.multiply(data, 0.8 / max_amp)
       total_duration = total_duration + (len(data) / samplerate)
       while frame < int(len(data) / 320):
         start = 320 * frame
         rec = data[start:start + 320]
         if len(rec) < 320:
           break
         kw_prob = kw_detect(rec, sample_rate ,duration, False)
         kw_hit_rbuff = np.roll(kw_hit_rbuff, -1)
         kw_hit_rbuff[len(kw_hit_rbuff) - 1] = kw_prob
         kw_hit_prob = np.mean(kw_hit_rbuff)
         if kw_prob > 0.99995:
           kw_hit = True
           kw_prob = kw_detect(reset_rec, sample_rate ,duration, True)
           kw_hit_rbuff = np.zeros(10, dtype=np.float32)
           print(flacfile, kw_prob, frame)
         frame += 1
       if kw_hit == True:
         kw_hit_qty += 1
         hit_txt.append(flacfile)
       print(kw_hit_qty, total_duration / 3600)
       
print(kw_hit_qty, total_duration / 3600)

What would be really good is for someone to actually do some tests with the above librispeech and fsdnoisy on picovoice & mycroft and create a 3rd party benchmark table
I don’t think many will fail on fsdnoisy as if they do they are doing something very wrong really
But also needs to be supplied false positive reject rate at the same sensitivity, still thinking about false positives as its very subjective as you can just supply and already filtered dataset as an example but likely no set stone dataset will exist.

shellcode · December 30, 2022, 2:00am

Ok I have a complete working prototype that replaces rhasspy server on the satellite. Which handles the wake word and communicates to a rhasspy backend/home assistant mqtt backend. It has an improved wake word detection system and removes 90% of the bulk that is not required on the client. Allows you to train new wake word models; and is working great in my setup. with a yaml backed configuration file for customizations.

Once I have polished up the code I will post to github with a simple trained model for demoing and testing.

I may start a home assistant project for collecting and training a community model; this works better than picovoice and the other solutions currently available.

shellcode · January 1, 2023, 8:28am

Here is the model and code as promised.

rolyan_trauts · January 5, 2023, 4:16pm

Hi guys finally back and not done much really as situation deflated me slightly.

I have just been browsing some topics and getting lost in hugging face models

Whilst I was browsing anton-l/wav2vec2-base-ft-keyword-spotting · Hugging Face haven’t a clue but Arm do a demo for ArmNN with wav2vec on a PI and it can also be super lite.

When you are looking at something often it can provide lateral thought and was just thinking that rather than KWS @ the ‘satelite’ word maybe it should just be a simple VAD model that activates a audio stream to a central KWS.

You could quickly gain tolerance and weight to user voices by on device training you could even create a base model of captured audio and deliberately overfit to user(s) voice.
Then the KW probability of wav2vec could make a stream decision and also could be prefiltered with something like DeepFilternet.

I will get back to you both about KWS when I regain some momentum but got lost checking the most downloaded models from hugging face for certain application types.