Google KWS TFL flexdelegates custom layers

rolyan_trauts · March 8, 2021, 8:23pm

Google have created a repo of all state of art NN for KWS.

Really there are only 3 natural streaming KWS GRU, SVDF & CRNN but suggest a look at CRNN.

Its all google stuff but to make things easier its been extracted from the googleresearch repo and a little guide on how to install on Pi3/4 Arm64

The training and a simple KWS test tfl-stream.py is included with a simple custom dataset than the universal google command set.

I have no idea if it will detect you on your mic as its been trained for me on my mic but the example is there.
It depends on how close your voice and accent is to mine but you can look at the output it gives.
It creates a silence label so does not need vad and runs on a single core of a Pi3, which is exciting for me as finally I can run 2x instances on a Pi3 with 2x directional mics and use the best audio stream for static hardware beamforming.

Raven would be a great dataset collector for this NN but unfortunately it only collects keyword and really with a custom KW you should also provide custom !kw, but guess you could you the universal google command set items for !kw but as said your own would be much better especially for silence detection as we can do something better and detect when you are not speaking rather than silence.

Pi3b (!+) KWS running

1xPi3b 2x KWS running

The current input is using sounddevice which creates an array frames, channels but strangely the model wants the axis swapped to 1,320.

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="/home/pi/google-kws/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

last_argmax = 0
out_max = 0
hit_tensor = []
inputs = []
for s in range(len(input_details)):
  inputs.append(np.zeros(input_details[s]['shape'], dtype=np.float32))
    
def sd_callback(rec, frames, time, status):

    global last_argmax
    global out_max
    global hit_tensor
    global inputs
    
    
    # Notify if errors
    if status:
        print('Error:', status)

     
    rec = np.reshape(rec, (1, 320))
    
    # Make prediction from model
    interpreter.set_tensor(input_details[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details)):
      interpreter.set_tensor(input_details[s]['index'], inputs[s])
  
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs[s] = interpreter.get_tensor(output_details[s]['index'])
      
    out_tflite_argmax = np.argmax(output_data)
    if last_argmax == out_tflite_argmax:
      if output_data[0][out_tflite_argmax] > out_max:
        out_max = output_data[0][out_tflite_argmax]
        hit_tensor = output_data
    else:
      print(last_argmax, out_max, hit_tensor)
      out_max = 0
    
    last_argmax = out_tflite_argmax
    


# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

The above just a pure hack by me and presume a more optimised pythonic script can be made.
But really I shouldn’t be needing to do rec = np.reshape(rec, (1, 320)) but guess the overhead is low in comparison to inference.

rolyan_trauts · March 9, 2021, 3:13pm

The current model is low latency 320 chunk size vs 2048 of precise and so is doing 6.4x inferences per second more than Precise.
I will have a go at reordering the top layers for a 1920 chunk size so its comparable to what Precise does.
I am having a rest for a while as to be honest tensorflow twists my brain. The upper streaming layers seem to be static on parameters but will have a look at giving 1920, 960 & 320 chunk sizes that should greatly effect latency and load.
320 is as is with lowest latency.
Also swap the training around so that the array is channels, frame so it does not need to be reshaped.

@koan If you are having a play have a look at the outputs of the crnn-state as with the vectors of _silence, kw, !kw with a 20ms timebase we should be able to have a super accurate very low latency KWS.

I have been wondering if the tensor should be feed into a fuzzylogic argument as been puzzling whats the best and must optimised way to process the tensor envelope returned?
If you have any ideas post as my math is as good as health.
I guess its just the sum of the difference between KW vector and !KW vector (kw-!kw), but thinking also the timescale of inference is also good data and maybe something more eleoquent.
The sum of the difference is prob enough and like usual I am overthinking as the threshold with that is just a simple static variable.

rolyan_trauts · March 10, 2021, 4:49am

If you want to test one of the non-stream models you can try this.

import sounddevice as sd
import numpy as np
import tensorflow as tf


# Parameters
rec_duration = 0.5
sample_rate = 16000
num_channels = 1

sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
sd.default.device = 'cap1'

# Sliding window
window = np.zeros((int(rec_duration * sample_rate) * 2), np.float32)

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="/home/pi/google-kws/tensorflow-lite/cnn/quantize_opt_for_size_tflite_non_stream/non_stream.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

last_argmax = 0
out_max = 0
hit_tensor = []
inputs = []
for s in range(len(input_details)):
  inputs.append(np.zeros(input_details[s]['shape'], dtype=np.float32))
    
def sd_callback(rec, frames, time, status):

    global last_argmax
    global out_max
    global hit_tensor
    global inputs

    
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.squeeze(rec)
    # Save recording onto sliding window
    window[:len(window)//2] = window[len(window)//2:]
    window[len(window)//2:] = rec[:]
    chunk = np.reshape(window, (1, 16000)) 
    
    # Make prediction from model
    interpreter.set_tensor(input_details[0]['index'], chunk)
    # set input states (index 1...)
    for s in range(1, len(input_details)):
      interpreter.set_tensor(input_details[s]['index'], inputs[s])
  
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs[s] = interpreter.get_tensor(output_details[s]['index'])
      
    out_tflite_argmax = np.argmax(output_data)
    out_max = output_data[0][out_tflite_argmax]
    hit_tensor = output_data[0]
    print(out_tflite_argmax, out_max, hit_tensor)
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

rolyan_trauts · March 10, 2021, 8:34pm

I was wondering how much was inference and how much was mfcc in the google example
Apols once more about the code and if something stupid as prob is then shout
I will just do a single mfcc calc and reuse on inference on a 20ms loop.

You can create the model with :-
parser.add_argument(
‘–preprocess’,
type=str,
default=‘raw’,
help=‘Supports raw, mfcc, micro as input features for neural net’
'raw - model is built end to end ’
‘mfcc - model divided into mfcc feature extractor and neural net.’
‘micro - model divided into micro feature extractor and neural net.’
'if mfcc/micro is selected user has to manage speech feature extractor ’
‘and feed extracted features into neural net on device.’
)

So we will reuse the same without mfcc calc overhead

import tensorflow.compat.v1 as tf
import sounddevice as sd
import numpy as np
import sfeatpy
import time

rd_signal = np.random.random(320)

# Parameters
rec_duration = 0.020
num_channels = 1
sd.default.device = 'cap1'

sample_rate = 16000
window_length = 320
window_stride = 160
fft_size = 1024
min_freq = 120
max_freq = 7800
num_filter = 40
num_coef = 20
windowFun = 1
preEmp = None
keep_first_value = False

res = sfeatpy.mfcc(rd_signal,           # audio signal
                   sample_rate,         # sample_rate -- Audio sampling rate (default 16000)  
                   window_length,       # window_length -- window size in sample (default 1024)  
                   window_stride,       # window_stride -- window stride in sample (default 512)  
                   fft_size,            # fft_size -- fft number of points (default 1024) 
                   min_freq,            # min_freq -- minimum frequency in hertz (default 20) 
                   max_freq,            # max_freq -- maximum frequency in hertz (default 7000) 
                   num_filter,          # num_filter -- number of MEL bins (default 40) 
                   num_coef,            # num_coef -- number of output coeficients (default 20) 
                   windowFun,           # windowFun -- window function: 0- None | 1- hamming (default 0) 
                   preEmp,              # preEmp -- preEmphasis factor ignored on None (default 0.97) 
                   keep_first_value     # keep_first_value -- if False discard first MFCC value (default False)
                   )
print(res.shape)

res = np.reshape(res, (1, 1, 20))

print(res.shape)

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="/home/pi/google-kws/tensorflow-lite/crnn/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

last_argmax = 0
out_max = 0
hit_tensor = []
inputs = []
for s in range(len(input_details)):
  inputs.append(np.zeros(input_details[s]['shape'], dtype=np.float32))

starttime = time.time()
while True:
    # Make prediction from model
    interpreter.set_tensor(input_details[0]['index'], res.astype(np.float32))
    # set input states (index 1...)
    for s in range(1, len(input_details)):
      interpreter.set_tensor(input_details[s]['index'], inputs[s])
  
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs[s] = interpreter.get_tensor(output_details[s]['index'])
      
    out_tflite_argmax = np.argmax(output_data)
    out_max = output_data[0][out_tflite_argmax]
    hit_tensor = output_data
    print(last_argmax, out_max, hit_tensor)
        
    last_argmax = out_tflite_argmax
    
    time.sleep(0.02 - ((time.time() - starttime) % 0.02))

That right ?

rolyan_trauts · March 11, 2021, 3:10am

! Librosa test

Always a bit of a pain as needs numba, numba needs llvm-lite and that needs llvm and raspberries versions is too old.
Head to https://apt.llvm.org/
sudo apt-get install python3-sklearn python3-sklearn-lib

wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 9
export LLVM_CONFIG=/usr/bin/llvm-config-9
 pip install librosa

fingers crossed

Doesn’t really cope as losing frames but aint bad as my bad tensorflow script seemed to lose less frames but cause more load. Maybe a different audio framework is needed than sounddevice?

import tensorflow as tf
import sounddevice as sd


rec_duration = 0.020
sample_rate = 16000
num_channels = 1
sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
sd.default.device = 'cap1'


def get_spectrogram(waveform):
  sample_rate = 16000.0
  waveform = tf.squeeze(waveform)
  # Padding for files with less than 16000 samples
  zero_padding = tf.zeros([320] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(equal_length, frame_length=320, frame_step=80)

  spectrogram = tf.abs(spectrogram)
  
  # Warp the linear scale spectrograms into the mel-scale.
  num_spectrogram_bins = spectrogram.shape[-1]
  lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
  linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,  upper_edge_hertz)
  mel_spectrogram = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
  mel_spectrogram.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

  # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
  log_mel_spectrogram = tf.math.log(mel_spectrogram + 1e-6)

  # Compute MFCCs from log_mel_spectrograms and take the first 13.
  spectrogram = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)[..., :13]

  return spectrogram


def sd_callback(rec, frames, time, status):
    
    # Notify if errors
    if status:
        print('Error:', status)
        
    mfcc = get_spectrogram(rec)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

So yeah most of the load is actually MFCC calc.

@synesthesiam

There is something really screwy going on maybe 64bit arm and this pi3?

import tensorflow as tf
import sounddevice as sd
import numpy as np

rec_duration = 0.25
sample_rate = 16000.0
num_channels = 1

def get_mfcc(waveform):

  waveform = tf.squeeze(waveform, axis=1)
  spectrogram = tf.signal.stft(waveform, frame_length=2048, frame_step=512, window_fn=tf.signal.hann_window)

  spectrogram = tf.abs(spectrogram)
  
  # Warp the linear scale spectrograms into the mel-scale.
  num_spectrogram_bins = spectrogram.shape[-1]
  lower_edge_hertz, upper_edge_hertz, num_mel_bins = 60.0, 7600.0, 40
  linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,  upper_edge_hertz)
  mel_spectrogram = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
  mel_spectrogram.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

  # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
  log_mel_spectrogram = tf.math.log(mel_spectrogram + 1e-6)

  # Compute MFCCs from log_mel_spectrograms and take the first 13.
  mfccs = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)[..., :13]

  return mfccs



def sd_callback(rec, frames, time, status):
    
    # Notify if errors
    if status:
        print('Error:', status)
    if not hasattr(sd_callback, "counter"):
         sd_callback.counter = 0
         sd_callback.buffer = [np.empty([13,8]), np.empty([13,8]), np.empty([13,8]), np.empty([13,8])]
    
    #sd_callback.buffer[sd_callback.counter] = get_mfcc(rec)[:]
    sd.wait()
    #print(sd_callback.buffer[sd_callback.counter])
    sd_callback.counter += 1
    if sd_callback.counter == 4:
        sd_callback.counter = 0
        mfccs = np.concatenate((sd_callback.buffer[0], sd_callback.buffer[1],sd_callback.buffer[2],sd_callback.buffer[3]), axis=1)
        #print(mfccs, mfccs.shape)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

I have remarked out the mfcc call and still have 100% load

synesthesiam · March 13, 2021, 9:56pm

It’s your main loop (while True) causing the “load”. You can either do a sleep or this:

import threading

# Replace loop
# while True:
#    pass
# with this
threading.Event().wait()

rolyan_trauts · March 13, 2021, 10:14pm

Librosa is still pretty stink but was thinking its possible to do it that way or even on a lesser .5 sec scale.

No inference there but actually that should run pretty smooth.

I can now run some tests on a simple cnn with xnnpack and stoof

Sort of wierd as external & inference is looking like it will run faster than there model with raw built in.

I had to check a couple of times that was not the pi4 but the audio.ops seems to provide best perf but the tensorflow stuff is far faster than librosa

import tensorflow.compat.v1 as tf
from tensorflow.python.ops import gen_audio_ops as audio_ops
import sounddevice as sd
import numpy as np
import threading

rec_duration = 0.25
sample_rate = 16000
num_channels = 1

def get_mfcc(waveform):
        # Run the spectrogram and MFCC ops to get a 2D audio: Short-time FFTs
        # background_clamp dims: [time, channels]
        sample_rate = 16000

        spectrogram = audio_ops.audio_spectrogram(
            waveform,
            window_size=1366,
            stride=342)
        # spectrogram: [channels/batch, frames, fft_feature]

        # extract mfcc features from spectrogram by audio_ops.mfcc:
        # 1 Input is spectrogram frames.
        # 2 Weighted spectrogram into bands using a triangular mel filterbank
        # 3 Logarithmic scaling
        # 4 Discrete cosine transform (DCT), return lowest dct_coefficient_count
        mfccs = audio_ops.mfcc(
            spectrogram=spectrogram,
            sample_rate=sample_rate,
            upper_frequency_limit=7600,
            lower_frequency_limit=60,
            filterbank_channel_count=40,
            dct_coefficient_count=13)
        # mfcc: [channels/batch, frames, dct_coefficient_count]
        # remove channel dim
        mfccs = tf.squeeze(mfccs, axis=0)
        return mfccs



def sd_callback(rec, frames, time, status):
    
    # Notify if errors
    if status:
        print('Error:', status)
    if not hasattr(sd_callback, "counter"):
         sd_callback.counter = 0
         sd_callback.buffer = [np.empty([13,8]), np.empty([13,8]), np.empty([13,8]), np.empty([13,8])]
    
    sd_callback.buffer[sd_callback.counter] = get_mfcc(rec)[:]
    sd.wait()
    
    print(sd_callback.buffer[sd_callback.counter],sd_callback.buffer[sd_callback.counter].shape )
    sd_callback.counter += 1
    if sd_callback.counter == 4:
        sd_callback.counter = 0
        mfccs = np.concatenate((sd_callback.buffer[0], sd_callback.buffer[1],sd_callback.buffer[2],sd_callback.buffer[3]), axis=1)
        print(mfccs, mfccs.shape)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

I hacked that from the google-kws and its ops to go in a model that obviously you can call direct.
It actually produces a slightly different mfcc than tf.signal.mfccs_from_log_mel_spectrograms( log_mel_spectrograms, name=None )
I have been having a terrible time with external mfcc as any difference is huge in inference.

But wow that is low load guess I have to test the models again with threading.

rolyan_trauts · March 14, 2021, 12:55am

Ok so apols about the last couple of days as just have not been with it but have some really excellent load paramters to pass.

1st is a Pi3B(!+) running a tensorflow-lite model streaming at 16khz 20ms samples.

And here is the is it running of non stream running with a .5 samples 16khz.

Now the mighty PiZero and yes it can steam 16khz @ 20ms

forgot to do asound.conf but seemed to work (seriously I can do no right at the moment!)

pcm.!default {
  type asym
  playback.pcm "play"
  capture.pcm "cap"
}


pcm.play {
  type plug
  slave {
    pcm "plughw:2,0"
  }
}



pcm.cap {
  type plug
  slave {
    pcm "plughw:2,0"
    }
}

defaults.pcm.rate_converter "speexrate"

speexrate is the lowest quality highest perf sample rate convertor on linux and when my memory returns I will try to use

The above is my default setting as keep forgetting to say plughw rather than hw does have overhead but most cards are 44Khz and many don’t do 16Khz and you will get problems if you you specify a hardware device of software.
sudo apt-get install libasound2-plugins as not installed as standard
Have no idea as had to set the default sd.device to ‘cap’ on the zero? Plughw:1 doh! again, but might even make things worse?

Strange as seems to be the same, even more.

rolyan_trauts · March 14, 2021, 3:23am

Just some mfcc tests on the Pi0

import tensorflow.compat.v1 as tf
#import tensorflow as tf
from tensorflow.python.ops import gen_audio_ops as audio_ops
import sounddevice as sd
import threading

#tf.compat.v1.disable_eager_execution()
rec_duration = 0.020
sample_rate = 16000
num_channels = 1

sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

def get_mfcc(waveform):
        # Run the spectrogram and MFCC ops to get a 2D audio: Short-time FFTs
        # background_clamp dims: [time, channels]
        

        spectrogram = audio_ops.audio_spectrogram(
            waveform,
            window_size=320,
            stride=160)
        # spectrogram: [channels/batch, frames, fft_feature]

        # extract mfcc features from spectrogram by audio_ops.mfcc:
        # 1 Input is spectrogram frames.
        # 2 Weighted spectrogram into bands using a triangular mel filterbank
        # 3 Logarithmic scaling
        # 4 Discrete cosine transform (DCT), return lowest dct_coefficient_count
        mfccs = audio_ops.mfcc(
            spectrogram=spectrogram,
            sample_rate=sample_rate,
            upper_frequency_limit=7600,
            lower_frequency_limit=60,
            filterbank_channel_count=40,
            dct_coefficient_count=13)
        # mfcc: [channels/batch, frames, dct_coefficient_count]
        # remove channel dim
        mfccs = tf.squeeze(mfccs, axis=0)
        return mfccs

def sd_callback(rec, frames, time, status):

    # Notify if errors
    if status:
        print('Error:', status)
    mfcc = get_mfcc(rec)[:]
    sd.wait()
    #print(mfcc)
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

import tensorflow.compat.v1 as tf
#import tensorflow as tf
from tensorflow.python.ops import gen_audio_ops as audio_ops
import sounddevice as sd
import threading
import numpy as np

#tf.compat.v1.disable_eager_execution()
rec_duration = 0.25
sample_rate = 16000
num_channels = 1

sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

def get_mfcc(waveform):
        # Run the spectrogram and MFCC ops to get a 2D audio: Short-time FFTs
        # background_clamp dims: [time, channels]
        

        spectrogram = audio_ops.audio_spectrogram(
            waveform,
            window_size=1366,
            stride=342)
        # spectrogram: [channels/batch, frames, fft_feature]

        # extract mfcc features from spectrogram by audio_ops.mfcc:
        # 1 Input is spectrogram frames.
        # 2 Weighted spectrogram into bands using a triangular mel filterbank
        # 3 Logarithmic scaling
        # 4 Discrete cosine transform (DCT), return lowest dct_coefficient_count
        mfccs = audio_ops.mfcc(
            spectrogram=spectrogram,
            sample_rate=sample_rate,
            upper_frequency_limit=7600,
            lower_frequency_limit=60,
            filterbank_channel_count=40,
            dct_coefficient_count=13)
        # mfcc: [channels/batch, frames, dct_coefficient_count]
        # remove channel dim
        mfccs = tf.squeeze(mfccs, axis=0)
        return mfccs

def sd_callback(rec, frames, time, status):

    # Notify if errors
    if status:
        print('Error:', status)
    if not hasattr(sd_callback, "counter"):
         sd_callback.counter = 0
         sd_callback.buffer = [np.empty([8,13]), np.empty([8,13]), np.empty([8,13]), np.empty([8,13])]
    
    mfcc = get_mfcc(rec)[:]
    print(mfcc.shape, mfcc)
    sd_callback.buffer[0] = mfcc
    sd.wait()

    #print(sd_callback.buffer[sd_callback.counter])
    sd_callback.counter += 1
    if sd_callback.counter == 4:
        sd_callback.counter = 0
        mfccs = np.concatenate((sd_callback.buffer[0], sd_callback.buffer[1],sd_callback.buffer[2],sd_callback.buffer[3]), axis=1)
        #print(mfccs, mfccs.shape)
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

There is a way with the pi zero to not use eager and use openblas… which is faster.
I guess it depends on the wheel you install and how its been setup.
Also the optimisation flags where set to default so it balances latency with size you might get a bit more if optimised purely for latency.
I you unremake #tf.compat.v1.disable_eager_execution() then you get complaints about

" a NumPy call, which is not supported".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (strided_slice_3:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

Which is prob due to the mfcc python.ops call and not a tf.signal.mfcc

rolyan_trauts · March 14, 2021, 5:58am

Last bit of checking as the Pi3 above was running Aarch64 whilst this is the same Pi3 exact same model and script running 200% slower on armv7.

So I did hear 2-3x but here its does look like 2x at least by the approx load.

So again running the same non-stream script and model and here it is near 3x so yeah seems true tensorflow lite aarch64 is 2-3x faster than armv7

rolyan_trauts · March 15, 2021, 6:58pm

A final one as just a test on a Pi3A+ (my pick for satellite) looks about right as the + has a marginal gain on the 3b I tested earlier.
Not going to bother with the non-stream as before the load was nothing

sanebow · May 13, 2021, 6:07am

Nice work! How’s the performance comparing to other kws like Porcupine? Can it spot keywords with background noise or other people speaking?
It seems a lot of work preparing the training samples. Can you share a pretrained model for a quick testing?
I also find this project https://github.com/Turing311/Realtime_AudioDenoise_EchoCancellation, which is based on DTLN. And there’s another DTLN-aec project. I tried it on my MacBook it seems a lot better than voiceengine ec or webrtc aec. But I haven’t been able to set it up on my Pi. Looks like the setup is similar to the google-kws you have done here, maybe you can have a try?

rolyan_trauts · May 13, 2021, 6:13am

I will have a go just at this moment doing the automated dataset-maker which I am not really a dev or really if it gives you any timescale I was quite exceptionally good with M$ com and then they through .Net @ me.

So I am plodding through but I mangled a dataset and training with some scripts and audacity and got some extremely good results with high noise levels.
Its took me much longer to hack python into a automated dataset builder but will post something.
Its a custom dataset and its my voice really but you might find it easier than you think.

Its just prrof of concept as sure someone more tal;ented than me will make it all pythonic and wonderfull.

I will have a look at the above but there are a few apps that seem to work on X86 and other platforms but the Archv7 raspberry ports don’t seem to work that well maybe just enough clock?

rolyan_trauts · May 13, 2021, 6:42am

PS if your up to it give this a go and give feedback.

The idea is to get a start dataset that works and then as you use to replace the original augmented dataset with actual usage capture and train when idle which gives updates over days / weeks.

I am just testing initial datasets and seeing how its going.

There is reader.py python3 reader.py that throws up words and the KW onto the CLI and records.
Set you mic up with AGC gain as high as it goes and see what you get in the rec folder.

I have a prefabed background noise folder but really this should be tailored to suit the environ but its just actual recording of your background noise.

https://drive.google.com/file/d/1qyV2hsM8ODbfyFHdc_L0PrfEOcdWqr_F/view?usp=sharing

If your short on samples mix.py will fail but had enough for today but really should use the above noise folder anyway.

Then run mix.py python3 mix.py really they are supposed to be web scripts as if they can be automated CLI then web should be no problem.

Then its training with the https://github.com/StuartIanNaylor/g-kws

Which is a install mainly as training is a single command and wait.

If you use the Aarch64 version of raspios its near 3x speedup over the Armv7 distro.

There is a model here but its me and everything is bleeding edge but as a 1st test seems to be fine.

https://drive.google.com/file/d/1ik1xP64HhaP3iVyLMjWMxSmvI1Q_PeNf/view?usp=sharing

Dataset here with samples so you can test even if not your voice.

https://drive.google.com/file/d/1w23VFwZK_aHPBnqQE4r5_cb-seZCv8jS/view?usp=sharing

for install

Tensorflow 2.5 should be any time soon as we now up to RC4 for tflite the MLIR backend is default and supposedly more optimised for post quantisation which will be interesting to see if its speeded anything up.
Also intel optimisation is built in and turned on by and environment variable and no need for the intel optimised version.

There where also some training aware quantisation options that haven’t got round to see how much they affect but quite a lot of interesting stuff for TF especially embedded linux / microcontrollers.

sanebow · May 13, 2021, 2:20pm

Thanks a lot. I am downloading your model for a try first. Will use your tool for my custom dataset later if I have time to setup the training environment.

Meanwhile, I had some experiment with the DTLN. The stream noise suppression on my Pi 3B+ itself is working very nicely. However when I tried to chain it after EC it starts to behave strangely. I created a separate post (here)[DTLN Noise Suppression Setup] with details.

sanebow · May 13, 2021, 3:22pm

Tried your pretrained model. Even playing your samples of “raspberry” won’t trigger it. Seems it’s really overfitted to your voice. Will try to train with my voice later.

rolyan_trauts · May 13, 2021, 7:11pm

Easiest way is to see what you are delivery record it capturing and send the samples like I did the model via Gdrive or something.

The augmentation has a number of settings but here I am recognized no problem so likely your hardware settings and recording and comparing to the dataset will give info to way.
You can record yourself of remix with different parameters.

ython3 mix.py --help
usage: mix.py [-h] [-b BACKGROUND_DIR] [-r REC_DIR] [-R BACKGROUND_RATIO] [-d BACKGROUND_DURATION] [-p PITCH] [-t TEMPO] [-D DESTINATION] [-a ATTENUATION]
              [-B BACKGROUND_PERCENT] [-T TESTING_PERCENT] [-v VALIDATION_PERCENT] [-S SILENCE_PERCENT] [-n NOTKW_PERCENT]

optional arguments:
  -h, --help            show this help message and exit
  -b BACKGROUND_DIR, --background_dir BACKGROUND_DIR
                        background noise directory
  -r REC_DIR, --rec_dir REC_DIR
                        recorded samples directory
  -R BACKGROUND_RATIO, --background_ratio BACKGROUND_RATIO
                        background ratio to foreground
  -d BACKGROUND_DURATION, --background_duration BACKGROUND_DURATION
                        background split duration
  -p PITCH, --pitch PITCH
                        pitch semitones range
  -t TEMPO, --tempo TEMPO
                        tempo percentage range
  -D DESTINATION, --destination DESTINATION
                        destination directory
  -a ATTENUATION, --attenuation ATTENUATION
                        random attenuation range
  -B BACKGROUND_PERCENT, --background_percent BACKGROUND_PERCENT
                        Background noise percentage
  -T TESTING_PERCENT, --testing_percent TESTING_PERCENT
                        dataset testing percent
  -v VALIDATION_PERCENT, --validation_percent VALIDATION_PERCENT
                        dataset validation percentage
  -S SILENCE_PERCENT, --silence_percent SILENCE_PERCENT
                        dataset silence percentage
  -n NOTKW_PERCENT, --notkw_percent NOTKW_PERCENT
                        dataset notkw percentage

Not sure but until I can have a look at your recorded stream can not really say.

https://drive.google.com/file/d/1UQKA7fWJSbub8_c_p-h-gOn1MTEnOdnr/view?usp=sharing might be better but if you would post your hardware recording it would be a really good piece off info.

It was very hot of the press and have been tweaking some of the defaults and at a very boring stage of doing that and seeing how it affects to gauge some settings.
Might of been not enough attenuation in that 1st model and we have a lot of difference in mic input but post.

I need to pull the latest Google-kws as also they made some changes but here I am using the CRNN tflite quantised model but you can run any model from the dataset.
There are some sample inference code in the g-kws repo tfl-stream.py and the threshold is hardcoded currently but that is where I will likely move next after some tidying of the dataset-builder.

I wasted a couple of days trying to find an alternative to Sox and then realise use /tmp with sox and just create what I want with silence detect in stages.
So it got derailed and rewrote and is still very hot off the press haven’t really done much testing myself.

2hrs old so the dataset_builder can wait as going to have a play with the new version.

sanebow · May 14, 2021, 3:11am

Here’s the recording: https://paste.c-net.org/MediocreAllergic
Maybe because of too much background noise?

BTW, also PR to your git repo for the script with recording function

rolyan_trauts · May 14, 2021, 3:31am

Dunno the later models might be better but if you ever get the time send a dataset and model of your own as that will be loads of info and have much to work with.

This is the problem when your one and it works for you.

The pitch does seem to me different maybe the proximity affect of the cardioid of record to one now in use but couldnt say that model was one of not the first.

This is a later one think the latest https://drive.google.com/file/d/1UQKA7fWJSbub8_c_p-h-gOn1MTEnOdnr/view?usp=sharing

sanebow · May 14, 2021, 3:54am

Just tried your latest model and it’s a lot better. Even with my voice it can hit your default kw threshold sometimes.
The previous one didn’t even trigger the output in kw_count > 3.

Interestingly, playing your samples from my laptop didn’t work as well as me speaking. Maybe the speaker changes some characteristics of the audio?

Also, when I have my vacuum working it can detect nothing. So I guess background noise matters in training set really matters.