Google KWS TFL flexdelegates custom layers

Google have created a repo of all state of art NN for KWS.

Really there are only 3 natural streaming KWS GRU, SVDF & CRNN but suggest a look at CRNN.

Its all google stuff but to make things easier its been extracted from the googleresearch repo and a little guide on how to install on Pi3/4 Arm64

The training and a simple KWS test tfl-stream.py is included with a simple custom dataset than the universal google command set.

I have no idea if it will detect you on your mic as its been trained for me on my mic but the example is there.
It depends on how close your voice and accent is to mine but you can look at the output it gives.
It creates a silence label so does not need vad and runs on a single core of a Pi3, which is exciting for me as finally I can run 2x instances on a Pi3 with 2x directional mics and use the best audio stream for static hardware beamforming.

Raven would be a great dataset collector for this NN but unfortunately it only collects keyword and really with a custom KW you should also provide custom !kw, but guess you could you the universal google command set items for !kw but as said your own would be much better especially for silence detection as we can do something better and detect when you are not speaking rather than silence.

Pi3b (!+) KWS running


1xPi3b 2x KWS running

The current input is using sounddevice which creates an array frames, channels but strangely the model wants the axis swapped to 1,320.

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="/home/pi/google-kws/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

last_argmax = 0
out_max = 0
hit_tensor = []
inputs = []
for s in range(len(input_details)):
  inputs.append(np.zeros(input_details[s]['shape'], dtype=np.float32))
    
def sd_callback(rec, frames, time, status):

    global last_argmax
    global out_max
    global hit_tensor
    global inputs
    
    
    # Notify if errors
    if status:
        print('Error:', status)

     
    rec = np.reshape(rec, (1, 320))
    
    # Make prediction from model
    interpreter.set_tensor(input_details[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details)):
      interpreter.set_tensor(input_details[s]['index'], inputs[s])
  
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs[s] = interpreter.get_tensor(output_details[s]['index'])
      
    out_tflite_argmax = np.argmax(output_data)
    if last_argmax == out_tflite_argmax:
      if output_data[0][out_tflite_argmax] > out_max:
        out_max = output_data[0][out_tflite_argmax]
        hit_tensor = output_data
    else:
      print(last_argmax, out_max, hit_tensor)
      out_max = 0
    
    last_argmax = out_tflite_argmax
    


# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

The above just a pure hack by me and presume a more optimised pythonic script can be made.
But really I shouldn’t be needing to do rec = np.reshape(rec, (1, 320)) but guess the overhead is low in comparison to inference.

1 Like

The current model is low latency 320 chunk size vs 2048 of precise and so is doing 6.4x inferences per second more than Precise.
I will have a go at reordering the top layers for a 1920 chunk size so its comparable to what Precise does.
I am having a rest for a while as to be honest tensorflow twists my brain. The upper streaming layers seem to be static on parameters but will have a look at giving 1920, 960 & 320 chunk sizes that should greatly effect latency and load.
320 is as is with lowest latency.
Also swap the training around so that the array is channels, frame so it does not need to be reshaped.

@koan If you are having a play have a look at the outputs of the crnn-state as with the vectors of _silence, kw, !kw with a 20ms timebase we should be able to have a super accurate very low latency KWS.

I have been wondering if the tensor should be feed into a fuzzylogic argument as been puzzling whats the best and must optimised way to process the tensor envelope returned?
If you have any ideas post as my math is as good as health.
I guess its just the sum of the difference between KW vector and !KW vector (kw-!kw), but thinking also the timescale of inference is also good data and maybe something more eleoquent.
The sum of the difference is prob enough and like usual I am overthinking as the threshold with that is just a simple static variable.

If you want to test one of the non-stream models you can try this.

import sounddevice as sd
import numpy as np
import tensorflow as tf


# Parameters
rec_duration = 0.5
sample_rate = 16000
num_channels = 1

sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
sd.default.device = 'cap1'

# Sliding window
window = np.zeros((int(rec_duration * sample_rate) * 2), np.float32)

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="/home/pi/google-kws/tensorflow-lite/cnn/quantize_opt_for_size_tflite_non_stream/non_stream.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

last_argmax = 0
out_max = 0
hit_tensor = []
inputs = []
for s in range(len(input_details)):
  inputs.append(np.zeros(input_details[s]['shape'], dtype=np.float32))
    
def sd_callback(rec, frames, time, status):

    global last_argmax
    global out_max
    global hit_tensor
    global inputs

    
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.squeeze(rec)
    # Save recording onto sliding window
    window[:len(window)//2] = window[len(window)//2:]
    window[len(window)//2:] = rec[:]
    chunk = np.reshape(window, (1, 16000)) 
    
    # Make prediction from model
    interpreter.set_tensor(input_details[0]['index'], chunk)
    # set input states (index 1...)
    for s in range(1, len(input_details)):
      interpreter.set_tensor(input_details[s]['index'], inputs[s])
  
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs[s] = interpreter.get_tensor(output_details[s]['index'])
      
    out_tflite_argmax = np.argmax(output_data)
    out_max = output_data[0][out_tflite_argmax]
    hit_tensor = output_data[0]
    print(out_tflite_argmax, out_max, hit_tensor)
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

I was wondering how much was inference and how much was mfcc in the google example
Apols once more about the code and if something stupid as prob is then shout
I will just do a single mfcc calc and reuse on inference on a 20ms loop.

You can create the model with :-
parser.add_argument(
‘–preprocess’,
type=str,
default=‘raw’,
help=‘Supports raw, mfcc, micro as input features for neural net’
'raw - model is built end to end ’
‘mfcc - model divided into mfcc feature extractor and neural net.’
‘micro - model divided into micro feature extractor and neural net.’
'if mfcc/micro is selected user has to manage speech feature extractor ’
‘and feed extracted features into neural net on device.’
)

So we will reuse the same without mfcc calc overhead

import tensorflow.compat.v1 as tf
import sounddevice as sd
import numpy as np
import sfeatpy
import time

rd_signal = np.random.random(320)

# Parameters
rec_duration = 0.020
num_channels = 1
sd.default.device = 'cap1'

sample_rate = 16000
window_length = 320
window_stride = 160
fft_size = 1024
min_freq = 120
max_freq = 7800
num_filter = 40
num_coef = 20
windowFun = 1
preEmp = None
keep_first_value = False

res = sfeatpy.mfcc(rd_signal,           # audio signal
                   sample_rate,         # sample_rate -- Audio sampling rate (default 16000)  
                   window_length,       # window_length -- window size in sample (default 1024)  
                   window_stride,       # window_stride -- window stride in sample (default 512)  
                   fft_size,            # fft_size -- fft number of points (default 1024) 
                   min_freq,            # min_freq -- minimum frequency in hertz (default 20) 
                   max_freq,            # max_freq -- maximum frequency in hertz (default 7000) 
                   num_filter,          # num_filter -- number of MEL bins (default 40) 
                   num_coef,            # num_coef -- number of output coeficients (default 20) 
                   windowFun,           # windowFun -- window function: 0- None | 1- hamming (default 0) 
                   preEmp,              # preEmp -- preEmphasis factor ignored on None (default 0.97) 
                   keep_first_value     # keep_first_value -- if False discard first MFCC value (default False)
                   )
print(res.shape)

res = np.reshape(res, (1, 1, 20))

print(res.shape)

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="/home/pi/google-kws/tensorflow-lite/crnn/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

last_argmax = 0
out_max = 0
hit_tensor = []
inputs = []
for s in range(len(input_details)):
  inputs.append(np.zeros(input_details[s]['shape'], dtype=np.float32))

starttime = time.time()
while True:
    # Make prediction from model
    interpreter.set_tensor(input_details[0]['index'], res.astype(np.float32))
    # set input states (index 1...)
    for s in range(1, len(input_details)):
      interpreter.set_tensor(input_details[s]['index'], inputs[s])
  
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs[s] = interpreter.get_tensor(output_details[s]['index'])
      
    out_tflite_argmax = np.argmax(output_data)
    out_max = output_data[0][out_tflite_argmax]
    hit_tensor = output_data
    print(last_argmax, out_max, hit_tensor)
        
    last_argmax = out_tflite_argmax
    
    time.sleep(0.02 - ((time.time() - starttime) % 0.02))

That right ?

! Librosa test :slight_smile:

Always a bit of a pain as needs numba, numba needs llvm-lite and that needs llvm and raspberries versions is too old.
Head to https://apt.llvm.org/
sudo apt-get install python3-sklearn python3-sklearn-lib

wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 9
export LLVM_CONFIG=/usr/bin/llvm-config-9
 pip install librosa

fingers crossed :slight_smile:

Doesn’t really cope as losing frames but aint bad as my bad tensorflow script seemed to lose less frames but cause more load. Maybe a different audio framework is needed than sounddevice?

import tensorflow as tf
import sounddevice as sd


rec_duration = 0.020
sample_rate = 16000
num_channels = 1
sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
sd.default.device = 'cap1'


def get_spectrogram(waveform):
  sample_rate = 16000.0
  waveform = tf.squeeze(waveform)
  # Padding for files with less than 16000 samples
  zero_padding = tf.zeros([320] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(equal_length, frame_length=320, frame_step=80)

  spectrogram = tf.abs(spectrogram)
  
  # Warp the linear scale spectrograms into the mel-scale.
  num_spectrogram_bins = spectrogram.shape[-1]
  lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
  linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,  upper_edge_hertz)
  mel_spectrogram = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
  mel_spectrogram.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

  # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
  log_mel_spectrogram = tf.math.log(mel_spectrogram + 1e-6)

  # Compute MFCCs from log_mel_spectrograms and take the first 13.
  spectrogram = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)[..., :13]

  return spectrogram


def sd_callback(rec, frames, time, status):
    
    # Notify if errors
    if status:
        print('Error:', status)
        
    mfcc = get_spectrogram(rec)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

So yeah most of the load is actually MFCC calc.

@synesthesiam

There is something really screwy going on maybe 64bit arm and this pi3?

import tensorflow as tf
import sounddevice as sd
import numpy as np

rec_duration = 0.25
sample_rate = 16000.0
num_channels = 1

def get_mfcc(waveform):

  waveform = tf.squeeze(waveform, axis=1)
  spectrogram = tf.signal.stft(waveform, frame_length=2048, frame_step=512, window_fn=tf.signal.hann_window)

  spectrogram = tf.abs(spectrogram)
  
  # Warp the linear scale spectrograms into the mel-scale.
  num_spectrogram_bins = spectrogram.shape[-1]
  lower_edge_hertz, upper_edge_hertz, num_mel_bins = 60.0, 7600.0, 40
  linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,  upper_edge_hertz)
  mel_spectrogram = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
  mel_spectrogram.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

  # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
  log_mel_spectrogram = tf.math.log(mel_spectrogram + 1e-6)

  # Compute MFCCs from log_mel_spectrograms and take the first 13.
  mfccs = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)[..., :13]

  return mfccs



def sd_callback(rec, frames, time, status):
    
    # Notify if errors
    if status:
        print('Error:', status)
    if not hasattr(sd_callback, "counter"):
         sd_callback.counter = 0
         sd_callback.buffer = [np.empty([13,8]), np.empty([13,8]), np.empty([13,8]), np.empty([13,8])]
    
    #sd_callback.buffer[sd_callback.counter] = get_mfcc(rec)[:]
    sd.wait()
    #print(sd_callback.buffer[sd_callback.counter])
    sd_callback.counter += 1
    if sd_callback.counter == 4:
        sd_callback.counter = 0
        mfccs = np.concatenate((sd_callback.buffer[0], sd_callback.buffer[1],sd_callback.buffer[2],sd_callback.buffer[3]), axis=1)
        #print(mfccs, mfccs.shape)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

I have remarked out the mfcc call and still have 100% load

It’s your main loop (while True) causing the “load”. You can either do a sleep or this:

import threading

# Replace loop
# while True:
#    pass
# with this
threading.Event().wait()
1 Like

Librosa is still pretty stink but was thinking its possible to do it that way or even on a lesser .5 sec scale.

No inference there but actually that should run pretty smooth.

I can now run some tests on a simple cnn with xnnpack and stoof

Sort of wierd as external & inference is looking like it will run faster than there model with raw built in.

I had to check a couple of times that was not the pi4 but the audio.ops seems to provide best perf but the tensorflow stuff is far faster than librosa

import tensorflow.compat.v1 as tf
from tensorflow.python.ops import gen_audio_ops as audio_ops
import sounddevice as sd
import numpy as np
import threading

rec_duration = 0.25
sample_rate = 16000
num_channels = 1

def get_mfcc(waveform):
        # Run the spectrogram and MFCC ops to get a 2D audio: Short-time FFTs
        # background_clamp dims: [time, channels]
        sample_rate = 16000

        spectrogram = audio_ops.audio_spectrogram(
            waveform,
            window_size=1366,
            stride=342)
        # spectrogram: [channels/batch, frames, fft_feature]

        # extract mfcc features from spectrogram by audio_ops.mfcc:
        # 1 Input is spectrogram frames.
        # 2 Weighted spectrogram into bands using a triangular mel filterbank
        # 3 Logarithmic scaling
        # 4 Discrete cosine transform (DCT), return lowest dct_coefficient_count
        mfccs = audio_ops.mfcc(
            spectrogram=spectrogram,
            sample_rate=sample_rate,
            upper_frequency_limit=7600,
            lower_frequency_limit=60,
            filterbank_channel_count=40,
            dct_coefficient_count=13)
        # mfcc: [channels/batch, frames, dct_coefficient_count]
        # remove channel dim
        mfccs = tf.squeeze(mfccs, axis=0)
        return mfccs



def sd_callback(rec, frames, time, status):
    
    # Notify if errors
    if status:
        print('Error:', status)
    if not hasattr(sd_callback, "counter"):
         sd_callback.counter = 0
         sd_callback.buffer = [np.empty([13,8]), np.empty([13,8]), np.empty([13,8]), np.empty([13,8])]
    
    sd_callback.buffer[sd_callback.counter] = get_mfcc(rec)[:]
    sd.wait()
    
    print(sd_callback.buffer[sd_callback.counter],sd_callback.buffer[sd_callback.counter].shape )
    sd_callback.counter += 1
    if sd_callback.counter == 4:
        sd_callback.counter = 0
        mfccs = np.concatenate((sd_callback.buffer[0], sd_callback.buffer[1],sd_callback.buffer[2],sd_callback.buffer[3]), axis=1)
        print(mfccs, mfccs.shape)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

I hacked that from the google-kws and its ops to go in a model that obviously you can call direct.
It actually produces a slightly different mfcc than tf.signal.mfccs_from_log_mel_spectrograms( log_mel_spectrograms, name=None )
I have been having a terrible time with external mfcc as any difference is huge in inference.

But wow that is low load guess I have to test the models again with threading.

Ok so apols about the last couple of days as just have not been with it but have some really excellent load paramters to pass.

1st is a Pi3B(!+) running a tensorflow-lite model streaming at 16khz 20ms samples.

And here is the is it running of non stream running with a .5 samples 16khz.

Now the mighty PiZero and yes it can steam 16khz @ 20ms

forgot to do asound.conf but seemed to work (seriously I can do no right at the moment!)

pcm.!default {
  type asym
  playback.pcm "play"
  capture.pcm "cap"
}


pcm.play {
  type plug
  slave {
    pcm "plughw:2,0"
  }
}



pcm.cap {
  type plug
  slave {
    pcm "plughw:2,0"
    }
}

defaults.pcm.rate_converter "speexrate"

speexrate is the lowest quality highest perf sample rate convertor on linux and when my memory returns I will try to use :slight_smile:

The above is my default setting as keep forgetting to say plughw rather than hw does have overhead but most cards are 44Khz and many don’t do 16Khz and you will get problems if you you specify a hardware device of software.
sudo apt-get install libasound2-plugins as not installed as standard
Have no idea as had to set the default sd.device to ‘cap’ on the zero? Plughw:1 doh! again, but might even make things worse?

Strange as seems to be the same, even more.

Just some mfcc tests on the Pi0

import tensorflow.compat.v1 as tf
#import tensorflow as tf
from tensorflow.python.ops import gen_audio_ops as audio_ops
import sounddevice as sd
import threading

#tf.compat.v1.disable_eager_execution()
rec_duration = 0.020
sample_rate = 16000
num_channels = 1

sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

def get_mfcc(waveform):
        # Run the spectrogram and MFCC ops to get a 2D audio: Short-time FFTs
        # background_clamp dims: [time, channels]
        

        spectrogram = audio_ops.audio_spectrogram(
            waveform,
            window_size=320,
            stride=160)
        # spectrogram: [channels/batch, frames, fft_feature]

        # extract mfcc features from spectrogram by audio_ops.mfcc:
        # 1 Input is spectrogram frames.
        # 2 Weighted spectrogram into bands using a triangular mel filterbank
        # 3 Logarithmic scaling
        # 4 Discrete cosine transform (DCT), return lowest dct_coefficient_count
        mfccs = audio_ops.mfcc(
            spectrogram=spectrogram,
            sample_rate=sample_rate,
            upper_frequency_limit=7600,
            lower_frequency_limit=60,
            filterbank_channel_count=40,
            dct_coefficient_count=13)
        # mfcc: [channels/batch, frames, dct_coefficient_count]
        # remove channel dim
        mfccs = tf.squeeze(mfccs, axis=0)
        return mfccs

def sd_callback(rec, frames, time, status):

    # Notify if errors
    if status:
        print('Error:', status)
    mfcc = get_mfcc(rec)[:]
    sd.wait()
    #print(mfcc)
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

import tensorflow.compat.v1 as tf
#import tensorflow as tf
from tensorflow.python.ops import gen_audio_ops as audio_ops
import sounddevice as sd
import threading
import numpy as np

#tf.compat.v1.disable_eager_execution()
rec_duration = 0.25
sample_rate = 16000
num_channels = 1

sd.default.never_drop_input= False
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

def get_mfcc(waveform):
        # Run the spectrogram and MFCC ops to get a 2D audio: Short-time FFTs
        # background_clamp dims: [time, channels]
        

        spectrogram = audio_ops.audio_spectrogram(
            waveform,
            window_size=1366,
            stride=342)
        # spectrogram: [channels/batch, frames, fft_feature]

        # extract mfcc features from spectrogram by audio_ops.mfcc:
        # 1 Input is spectrogram frames.
        # 2 Weighted spectrogram into bands using a triangular mel filterbank
        # 3 Logarithmic scaling
        # 4 Discrete cosine transform (DCT), return lowest dct_coefficient_count
        mfccs = audio_ops.mfcc(
            spectrogram=spectrogram,
            sample_rate=sample_rate,
            upper_frequency_limit=7600,
            lower_frequency_limit=60,
            filterbank_channel_count=40,
            dct_coefficient_count=13)
        # mfcc: [channels/batch, frames, dct_coefficient_count]
        # remove channel dim
        mfccs = tf.squeeze(mfccs, axis=0)
        return mfccs

def sd_callback(rec, frames, time, status):

    # Notify if errors
    if status:
        print('Error:', status)
    if not hasattr(sd_callback, "counter"):
         sd_callback.counter = 0
         sd_callback.buffer = [np.empty([8,13]), np.empty([8,13]), np.empty([8,13]), np.empty([8,13])]
    
    mfcc = get_mfcc(rec)[:]
    print(mfcc.shape, mfcc)
    sd_callback.buffer[0] = mfcc
    sd.wait()

    #print(sd_callback.buffer[sd_callback.counter])
    sd_callback.counter += 1
    if sd_callback.counter == 4:
        sd_callback.counter = 0
        mfccs = np.concatenate((sd_callback.buffer[0], sd_callback.buffer[1],sd_callback.buffer[2],sd_callback.buffer[3]), axis=1)
        #print(mfccs, mfccs.shape)
    
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

There is a way with the pi zero to not use eager and use openblas… which is faster.
I guess it depends on the wheel you install and how its been setup.
Also the optimisation flags where set to default so it balances latency with size you might get a bit more if optimised purely for latency.
I you unremake #tf.compat.v1.disable_eager_execution() then you get complaints about

" a NumPy call, which is not supported".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (strided_slice_3:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

Which is prob due to the mfcc python.ops call and not a tf.signal.mfcc

Last bit of checking as the Pi3 above was running Aarch64 whilst this is the same Pi3 exact same model and script running 200% slower on armv7.

So I did hear 2-3x but here its does look like 2x at least by the approx load.

So again running the same non-stream script and model and here it is near 3x so yeah seems true tensorflow lite aarch64 is 2-3x faster than armv7

A final one as just a test on a Pi3A+ (my pick for satellite) looks about right as the + has a marginal gain on the 3b I tested earlier.
Not going to bother with the non-stream as before the load was nothing :slight_smile:

1 Like