Raven in production

synesthesiam · February 17, 2021, 1:18am

I tried out the code from simple_audio_tensorflow today, and it worked well! I’m planning to try it out against the Picovoice benchmark.

Have you always trained from scratch, or have you ever tried fine-tuning a pre-trained model?

rolyan_trauts · February 17, 2021, 2:39pm

@synesthesiam Always from scratch as to be honest without the chkpoints and stuff not sure how to fine tune a pretrained model.

Apols about the code it was purely interest that MFCC is now part of the tensorflow framework and I haven’t perf tested as for a Pi it could slower than Librosa as it can use both SIMD and GPU and likely the optimisation was there.
Linto who I think also have doubts about Sonopy have created https://github.com/linto-ai/sfeatpy and need to check that as feature extraction of audio is a heavy process and why a pipeline on a single MFCC process is of interest.

I prob wouldn’t use a CNN for KWS as there are better tensorflow-lite models but a low latency CNN for VAD, but you could check a CNN against the above but prob better comparing to Raven as in terms or WER and noise the Raven method is likely worst of all.

I was also checking the framework and training speed as a CRNN or DS-CNN are likely to be comparative in training just never got round to hacking a model out and CNN code was available.
Currently because there is a tool for creating a GRU its prob easier just to test the Linto HMG model as it already has a mic interface where you could do some accurate real world tests and any easily created model.

My hacks on https://www.tensorflow.org/tutorials/audio/simple_audio with a CNN was a thought to seeing how it copes with VAD for each frame that feeds a KWS.
Its not KW but Voice as the hit label and !Voice as the !label just haven’t got round to looking at the NN VAD examples I supplied or creating a dataset as presume need to get pysox and do some silence trimming and concatenation.

With Linto also adding https://github.com/linto-ai/sfeatpy I have that to try as well.

I would be interested if you did something ‘Raven-like’ with a CNN and compared to Raven though, but likely better to do the head scratching to create one of the latter models from https://github.com/google-research/google-research/tree/master/kws_streaming for KWS.

Snowboy which is in the above is a DNN which https://github.com/ARM-software/ML-KWS-for-MCU should give you an aproximate to a CNN.
Also a DNN is also in the graphs of https://github.com/google-research/google-research/tree/master/kws_streaming

Ravens False Alarm rate per Hour is 2.06 or 20.6 per 10 hours as above is a state of the art worst but its a KW gatherer that works quickly and no haven’t tried with a CNN but expecting it or any NN above even with relatively low sample counts of ‘own voice’ will exceed it greatly.
No-one has ever seemed to create a NN Raven style and likely it could, I have always done universal for both KW & !KW.

You could prob set up a web routine ‘Big words on a web page’ that gives guidance and records at 3 mic positions .3m, 1m, 3m of KW and some words of some phonetic pangrams then pysox those via pitch shift and padding into a much larger dataset but one of KW & !KW with noise also added to KW then train.

Using the https://github.com/linto-ai/linto-desktoptools-hmg and a GRU for that with its visual feedback is likely to be less painful and easier to use for purposes of test.

Where I got the CNN code https://www.tensorflow.org/tutorials/audio/simple_audio is a quick universal model on 1000 samples per label so I am not sure what the minimum for a custom ‘own voice’ is but after a bit of pysox manipulation & noise addition they do quickly build up (Couple of hundred).
The main thing is its a custom KW & !KW dataset not a custom KW label with a universal !KW label.
Also that if both VAD & KWS use MFCC we can supply both from the same feed and negate separate processing.

KiboOst · February 18, 2021, 6:32am

Talking about Raven settings, does setting minimum match to 2 instead of 1 will increase cpu load ? And should I disable avarrage then ?
Should it work better, lowering sensitivity I guess ?
Having three custom wakeword is rather cpu intensive ever.

rolyan_trauts · February 18, 2021, 4:32pm

Apols @KiboOst but just a last mention to @synesthesiam

Just pushed some more rough scripts to https://github.com/StuartIanNaylor/simple_audio_tensorflow

dcnn.py is the NN KWS model I was thinking about as well as converting to TFL its also runs on TF4MC as in the repo you will see dcnn.tflite

Also another horrid hacky script but with pysox this script will take 1 input file and create 20 variations on it mixing pitch, tempo & padding.

audio_vary.py

import sox
import numpy as np
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--input", help="input file name")
args = parser.parse_args()
    
np1 = np.arange(start=-1.0, stop=1.1, step=0.10)
np2 = np.arange(start=-1.0, stop=1.1, step=0.10)
np3 = np.arange(start=-1.0, stop=1.1, step=0.10)
np.random.shuffle(np1)
np.random.shuffle(np2)
np.random.shuffle(np3)

tfm = sox.Transformer()
tfm.silence(1, 0.1, 0.01)
tfm.silence(-1, 0.1, 0.01)
tfm.build_file(args.input, 'silence-strip.wav')
stat = sox.file_info.stat('silence-strip.wav')
duration = stat['Length (seconds)']


x = 0

while x < 21:
  tfm1 = sox.Transformer()
  pitch_offset = round(np1[x],1)
  tempo_offset = round(np2[x],1)
  pad_offset = round(np3[x],1)
  
  tfm1.norm(-3)
  tfm1.pitch(pitch_offset)
  pad = (1 - duration)
  if tempo_offset < 0:
    tempo = 1 - (abs(tempo_offset) / 10)
  else:
    tempo = 1 + (tempo_offset / 10)
    
  if pad_offset < 0:
    startpad = abs(pad - (pad * abs(pad_offset)) / 2)
    endpad = pad - startpad
  else:
    startpad = abs(pad * pad_offset) / 2
    endpad = pad - startpad  
        
  tfm1.tempo(tempo, 's')
  tfm1.pad(startpad, endpad)
  tfm1.trim(0, 1)
  tfm1.build_file('silence-strip.wav', 'pp' + str(x) + '-' + args.input)
  stat = sox.file_info.stat('pp' + str(x) + '-' + args.input)
  x = x + 1

So even if you have a small number of samples you can still quickly build quite a decent dataset as 10 recordings becomes 200 variation.
Prob if 5 - 10 recording where made @ 0.3m, 1m & 3m you would then have 300 - 600 KW label items.
Just a quick hack but there just as a demo that if 10 recordings where made at 3 mic distances and run through training for TFL then is that not more valid than Raven as users can be added with no perf hit as you just add to the dataset and retrain and also through use you can capture data and autotrain.

I didn’t do noise addition but the KW are normalised @ -3dB so the whole dataset can be duplicated and split with mixes of noise @ 33% -8dB, 33% -13dB, 33% -18db.
The result with noise is a dataset of 600-1200 KW but also the steps can be decreased and range increased to give more.