Raven in production

I should be reporting the probability threshold (sensitivity) for each individual wake word. There may be a bug here; I will need to investigate more.

Thanks!

1 Like

Iya!

I have a Pi4 as master, a Pi3B+ as satellite.
Both Raspbian Buster up to date with Rhasspy 2.5.9, Docker
Both Raven, 3 custom wakewords, average template, min matches 1, vad sens 1, udp audio

The Pi4 popup in less than a second after saying wakeword, the Pi3 takes around 4 to 5 second to turn leds, sound, and into listen state.

Is this something normal ? After dropping Pi0, should I also drop Pi3 for Rhasspy :face_with_monocle:

EDIT: after some times, reven wake as fast as Pi4. Strange, maybe some loading stuff take more time on the Pi3.

Instead of dropping the Pi3, I need to rewrite Raven to be faster. I’ve asked around, but no one has been interested in porting it to C++ (or something like Julia).

I might get to it myself one of these days; I’m currently down the rabbit hole with speeding up Larynx’s text to speech so it’s actually usable :stuck_out_tongue:

1 Like

@synesthesiam

I don’t think python-speech-features is that performant for mfcc its a vague memory from sonopy perf testing that compared also against librosa and think it was python-speech-features that was the worst.
I never did check the requirements.txt of what version python-speech-features was but it was the slowest.

If you created a mfcc pipeline that was firstly used by vad then passed to raven you would get the benefits of neural net vad that has less overhead than current as its likely a low latency cnn will be less load than WebRtcVad.

There is a duplication in audio processing of raven mfcc and the very similar filterbanks of webrtcvad that could be done once with a mfcc pipeline that both use.
You would get less load and better vad that can also be custom trained.

For mfcc both tensorflow & pytorch now have inbuilt math libs for mfcc and haven’t done any specific perf tests but pretty sure they are much lighter.

pytorch.audio and hopefully this is wrong but last time I tested its either intel_mkl or nvidia math libs and on arm without a jetson your stuffed.
Maybe I was compiling it wrong but for me it kept banging out on a Pi complaining about missing intel_mkl libs.

1 Like

@synesthesiam

There are some things that still puzzle me as Precise as a KWS is a confusion as it uses the full tensorflow for inference rather than converting and quantising for tensorflow-lite.
In essence its using the full training framework for KW inference whilst the accuracy loss if miniscule compared to load reduction of using tensorflow-lite.

I am also completely bemused why you would port Raven to C++ or Julia as the brightest minds of the biggest corporations and academic institutes of the world shelved the methods of Raven several years ago.
Raven is not a clever Snips method its just something they adopted as I have seen similar methods that predate the Snips era.
Its great to quickly be able record your voice to create an instant profile but in terms of accuracy and load its a cul-de-sac and why generally that style of method has been abandoned.
Its also confusing as the full tensorflow framework does run on a Pi so you can train on a PI and a small dataset of a limited training run still beats the WER (Word Error Rate) of Raven.

You can take say 10 recordings of KW mention and with pitch shift, stretch, padding movement and noise you can quickly create several hundred KW samples.
The time of training for a tensorflow GRU NN to beat Raven is not long at all, but from that point its not a cul-de-sac as the user has choice for more accurate involved training.

Also training is a step process and can be paused and restarted at any time and could be triggered to be a constant idle task that uses usage data for constant improvement.

Its absolutely bat shit crazy to port Raven when the best minds have ported state of the art to C++ in frameworks such as tensorflow and pytorch.
Apols Batman but why would you port Raven is much as a bemusement as Precise running on the full tensorflow whilst running Linux embedded.

@rolyan_trauts When there is a solution that:

  • works better than Raven
  • runs on small devices
  • is as simple to use as Raven

This would make me very happy. Sadly right now we do not have this solution and so we work with the things we do have.
Raven faces a performance problem, that could probably be improved by porting a big part of it to a compiled language -> it is a improvement to the current status quo. I personally do not think it is crazy to improve the current situation just because there might be a better solution at some point in time.

1 Like

An equally easy to use alternative if your using nodered is node-red-contrib-personal-wake-word which wraps the great node-personal-wakeword by @fastjack which itself is based on the snips personal wake word principles. (disclaimer im the maintainer of the node-red integration)

1 Like

Yes Raven is mostly a Python port of node-personal-wakeword

1 Like

i know just wanted to throw it in the mix :+1:t2:

Thats my point there are solutions that work better than raven and its tensorflow-lite or pytorch mobile and they work better than Raven, run on small devices and would work exactly the same without the cul-de-sac of bad WER.
One thing for sure is that Raven will not run on small devices as it would have to be ported to each specific platform whilst the major frameworks already exist on nearly all platforms and all you have to do is create your model.

We do have this solution we all have this solution as the like of Google & Facebook provide it with there frameworks its just not being used.

Its crazy as the methods Raven uses are antiquated and there has been a whole boom of AI technology that is better, more accurate with less load.
So yeah spend time porting something worse and ignoring what is better whilst believing that is rational if you wish.

Mycroft Precise would be far leaner than Raven and all it needs is the model converting to tensorflow-lite and quantising down.
It will not run on TF4Micro as RNNs such as LSTM & GRU are not in the model subset.
Other models will.

PS When it comes to small devices the ESP32-S3 software libs are looking really impressive.

@rolyan_trauts, I agree that a neural net solution is going to ultimately be the best, especially with the performance improvements TensorFlow and PyTorch have gotten on ARM (mostly arm64).

Have you specifically found a TF/PyTorch KWS project that actually trains on a dozen or so examples on a Pi 3/4, and has good performance (both CPU cost and low false alarms)? Everything I see ends up needing one or more of:

  • A large dataset (usually Google Speech Commands)
  • A good amount of “not wake word” examples too (Precise)
  • An hour or more of training for good accuracy on a desktop CPU (Precise)

Maybe Precise has gotten better recently, but training its GRU from scratch needed quite a few examples. I actually Skyped with Mycroft’s CEO and the creator of Precise last year, and they said this was the number one complaint they had with it. Of course, they’re going for “universal” vs. “personal” wake words with it mostly.

I know you were looking into Linto at one point; is this what you were able to train on a Pi quickly with few examples?

2 Likes

Also, if you don’t believe I’m all in on the neural networks, I’m currently installing water blocks on my new and used GPUs :stuck_out_tongue:

3 Likes

Yes I find their libs really impressive as well. Sadly last time I checked it was not possible to create wakewords yourself. (Things might have changed since I looked at it around a year ago)

Yeah that is a big problem as esp-sr, esp-who, esp-skainet are not opensource and that is also batshit crazy and we will have to wait and see.

The vector instructions on the esp32-s3 looks like it will be capable of maybe even running inference simultaneously on both cores, but the lx7 looks like it hits and exceeds a perfect spot for edge KWS.

@synesthesiam

I think there is something hooky with there dataset or something as with a good unique KW and loss-bias weighted to false negatives you should be easily be able to create accurate KWS.
The problem is people believe the Google Speech Commands dataset makes a good training set as it doesn’t as its specifically a testing set, to test how well a KWS can work that contains approx 10% of bad recordings, trims and padding for purposes of comparison.

That is not the point as that dataset and mycroft try to encompass all users, genders, accents in a single model, but raven you don’t have to do that and you can create a much smaller dataset on specific samples provided.

I have stayed with TF as Pytorch seems to be making some odd vendor lib choices that are Intel & Nvidia biased, at least pytroch.audio seems to be that way.

There is a whole repo from google of just about every KWS model you can think of.
Streaming Aware neural network models

For tensorflow-lite all run as TFL is 100% like for like its tensorflow4microcontrollers that is a subset and so far due to a lack of recurrent layers a DS-CNN is my only tested and working model.
A Pi can run either full tensorflow or tensorflow-lite and recurrent LSTM & GRU are supported.
But a basic CNN from the current tensorflow tutorial can outperform Ravens WER if its presented with a decent dataset which might be a specific recording or the Google Command Set with inference run on its self and bad samples pruned out.
There has also been some really bad methodology of unnormalised noise and KW mixed without any knowledge of the SNR levels it produces and I think yeah much is due to bad methodology of dataset and training has caused many of there results.

Yeah I really like the HMG of Linto because it has a GUI with visual results and tests for false negatives/positives that actually gives you feedback to where you can be going wrong.

90% of the training of a model is eeking out 1-2% increase over what the 1st 10% might achieve but on a desktop Linto HMG will produce a Google Command Set based KWS in 5-10 minutes depending on your hardware.
Keras optimises as it goes along but mainly it keeps retraining and merely picks the lotto winner of best results.
The Mycroft MFCC lib Sonopy is actually broken, always has been so its no wonder.
Run it yourself and play with the parameters as you can get it to beat Librosa by 400% in performance and its not wonder code, its because its broken and you had a wasted Skype call.

Precise is the most oddly named KWS but for training you should balance your KW & !KW in number. Your KW should be fairly unique with a good number of phones 3 is good as 'HeyMycroft, HeyGoogle & Alexa" all are.
Your !KW should try to contain all phones for your language but if you are going to start with personal as you do with Raven you should be able to proceed with and extremely small dataset and the length of training is choice as some might not be so bothered about those last fractions of accuracy.

The WER error rate with Raven is really high and its ability to cope with noise is low and you could quickly train a NN with a small dataset that would be comparable to Raven.
Read out your KW and a couple of phonetic pangrams for your language and word extract for !KW.

Again its down to methodology but if you are doing personal that means both KW & !KW and with the some pitch and padding shifting you can quickly make a dataset to match your KW.

1 Like
(venv) pi@raspberrypi:~/googlekws/simple_audio_tensorflow $ python3 simple_audio_mfcc_frame_length1024_frame_step512.py
Commands: ['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py']
Number of total examples: 10114
Number of examples per label: 1000
Example file tensor: tf.Tensor(b'data/mini_speech_commands/left/4beff0c5_nohash_1.wav', shape=(), dtype=string)
Training set size 6400
Validation set size 800
Test set size 800
Run time 0.13212920299997677
Run time 0.6761154660000557
2021-02-14 10:06:37.656765: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-02-14 10:06:37.659695: W tensorflow/core/platform/profile_utils/cpu_utils.cc:116] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
Input shape: (30, 13, 1)
Run time 5.733527788999936
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
normalization (Normalization (None, 30, 13, 1)         3
_________________________________________________________________
conv2d (Conv2D)              (None, 28, 11, 32)        320
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 26, 9, 64)         18496
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 4, 64)         0
_________________________________________________________________
dropout (Dropout)            (None, 13, 4, 64)         0
_________________________________________________________________
flatten (Flatten)            (None, 3328)              0
_________________________________________________________________
dense (Dense)                (None, 128)               426112
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290
=================================================================
Total params: 446,221
Trainable params: 446,218
Non-trainable params: 3
_________________________________________________________________
Epoch 1/1000
100/100 [==============================] - 63s 601ms/step - loss: 1.8292 - accuracy: 0.3305 - val_loss: 1.0994 - val_accuracy: 0.6363
Epoch 2/1000
100/100 [==============================] - 30s 299ms/step - loss: 1.1535 - accuracy: 0.5955 - val_loss: 0.8443 - val_accuracy: 0.7175
Epoch 3/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.9190 - accuracy: 0.6740 - val_loss: 0.6876 - val_accuracy: 0.7775
Epoch 4/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.7892 - accuracy: 0.7273 - val_loss: 0.6035 - val_accuracy: 0.7987
Epoch 5/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.6533 - accuracy: 0.7606 - val_loss: 0.5486 - val_accuracy: 0.8100
Epoch 6/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.6117 - accuracy: 0.7831 - val_loss: 0.4823 - val_accuracy: 0.8500
Epoch 7/1000
100/100 [==============================] - 30s 300ms/step - loss: 0.5309 - accuracy: 0.8207 - val_loss: 0.4395 - val_accuracy: 0.8612
Epoch 8/1000
100/100 [==============================] - 30s 300ms/step - loss: 0.4771 - accuracy: 0.8333 - val_loss: 0.4316 - val_accuracy: 0.8612
Epoch 9/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.4371 - accuracy: 0.8485 - val_loss: 0.3950 - val_accuracy: 0.8763
Epoch 10/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3972 - accuracy: 0.8630 - val_loss: 0.3770 - val_accuracy: 0.8850
Epoch 11/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3604 - accuracy: 0.8745 - val_loss: 0.3590 - val_accuracy: 0.8938
Epoch 12/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3476 - accuracy: 0.8784 - val_loss: 0.3630 - val_accuracy: 0.8850
Epoch 13/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3174 - accuracy: 0.8832 - val_loss: 0.3481 - val_accuracy: 0.8888
Epoch 14/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3106 - accuracy: 0.8928 - val_loss: 0.3483 - val_accuracy: 0.9050
Epoch 15/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2803 - accuracy: 0.9049 - val_loss: 0.3573 - val_accuracy: 0.8875
Epoch 16/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2600 - accuracy: 0.9064 - val_loss: 0.3422 - val_accuracy: 0.9025
Epoch 17/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2419 - accuracy: 0.9138 - val_loss: 0.3672 - val_accuracy: 0.8900
Epoch 18/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2296 - accuracy: 0.9213 - val_loss: 0.3688 - val_accuracy: 0.8900
Epoch 19/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2125 - accuracy: 0.9234 - val_loss: 0.3620 - val_accuracy: 0.8975
Epoch 20/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1991 - accuracy: 0.9227 - val_loss: 0.3705 - val_accuracy: 0.8963
Epoch 21/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1880 - accuracy: 0.9331 - val_loss: 0.3890 - val_accuracy: 0.9000
Epoch 22/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1780 - accuracy: 0.9355 - val_loss: 0.3813 - val_accuracy: 0.9013
Epoch 23/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1744 - accuracy: 0.9380 - val_loss: 0.3512 - val_accuracy: 0.9087
Epoch 24/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1588 - accuracy: 0.9452 - val_loss: 0.3666 - val_accuracy: 0.8938
Epoch 25/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1539 - accuracy: 0.9453 - val_loss: 0.3481 - val_accuracy: 0.9025
Epoch 26/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1572 - accuracy: 0.9423 - val_loss: 0.3882 - val_accuracy: 0.9050
Epoch 00026: early stopping
Run time 856.116808603
Test set accuracy: 90%
Predictions for "no"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[9.5505527e-07 7.6688975e-02 4.6016984e-03 8.7443608e-01 6.1507470e-09
 4.1621499e-02 1.2028771e-03 7.7026243e-08 1.4477677e-03 6.7970650e-11], shape=(10,), dtype=float32)
Predictions for "hey-marvin"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[1.4215793e-23 3.6277120e-28 7.7683000e-33 4.2175242e-28 1.0000000e+00
 1.9384911e-25 4.4126632e-22 6.7758306e-20 2.7613047e-26 1.4793614e-35], shape=(10,), dtype=float32)
Predictions for "left"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[6.2980646e-08 4.8209145e-08 7.2652750e-07 7.3871712e-07 3.9954323e-10
 9.1625368e-10 9.9980372e-01 7.3275419e-06 1.8732932e-04 1.6446810e-13], shape=(10,), dtype=float32)
Predictions for "go"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[2.7044329e-11 7.6081965e-06 3.2807577e-05 4.8282389e-03 4.0133222e-12
 9.9512523e-01 5.3841272e-06 2.5338602e-09 6.5998506e-07 1.4413574e-15], shape=(10,), dtype=float32)
Run time 869.929035417

simple_audio_mfcc_frame_length1024_frame_step512.py was just a rough test I did with an interest to the new math libs of tensorflow.

But the above is a CNN runing on a Pi4 and yeah with a 1000 of each label it takes approx 14 minutes to train.
It uses the keras framework and is using accuracy with a patience of 10, meaning if no increase in accuracy over a run of 10 epochs it will choose the last best.

Its a one off training need and yes is 14 minutes of your life but its a lot less painless that the cost of water cooled GPU’s and obviously also much faster on a desktop without GPU.

Once that model is trained it can be converted to tensorflow-lite and quantised down so its extremely efficient and fast on a Pi.
That model can be shipped and reused so it only needs a single training for multiple devices and people could share models if they so wished.

It could use usage data and constantly add to a dataset and retrain as a background idle task.

Its both KW & VAD as both could be neural nets which many advantages from load to accuracy that support universal models supplied or specific custom ones with NN custom VAD not only being able to distinguish voice but ‘your’ voice.

After training with tensorflow convert the model to tensorflow-lite and use tfl for inference.

Official releases are a bit slow and here is another community repo.

2 Likes

@synesthesiam

https://github.com/linto-ai/linto-desktoptools-hmg/tree/master got updated they dropped Sonopy and used Librosa.
There is also talk they may adopt multiple NN (CRNN) with it but currently its a GRU like precise and its extremely interesting in the difference it creates to Precise.
Also to be added are some noise addition routines that is currently a manual process.

I think you could install on a Pi with a Desktop but its aimed at X86 as I have had it working on Ubuntu 20.04 and also really interesting to get the visual feedback from the Google Command Set.
If you train and test and weed out bad label items its also interesting to be able to click and play and listen to why the model thinks its bad.
You get a real feel for what can be detrimental to accuracy and a few pruning runs will greatly increase overall accuracy.

In the Google Command Set Ver 2.0 there is only the label ‘visualise’ that is 3-phone long but maybe my ‘Hey Marvin’ might be a good choice.
Just like the Google Command set it needs some pruning so don’t assume all is correct I think somehow some are longer than 1sec but easy to correct.

https://drive.google.com/open?id=1LFa2M_AZxoXH-PA3kTiFjamEWHBHIdaA

https://drive.google.com/open?id=1-kWxcVYr1K9ube4MBKavGFO1CFSDAWVG

Also the raspberry samples added to Record 'Raspberry' & 'RaspberryPi' for a distributed dataset which would be great if others would add a few more raspberries to the collection :slight_smile:

1 Like

I tried out the code from simple_audio_tensorflow today, and it worked well! I’m planning to try it out against the Picovoice benchmark.

Have you always trained from scratch, or have you ever tried fine-tuning a pre-trained model?

@synesthesiam Always from scratch as to be honest without the chkpoints and stuff not sure how to fine tune a pretrained model.

Apols about the code it was purely interest that MFCC is now part of the tensorflow framework and I haven’t perf tested as for a Pi it could slower than Librosa as it can use both SIMD and GPU and likely the optimisation was there.
Linto who I think also have doubts about Sonopy have created https://github.com/linto-ai/sfeatpy and need to check that as feature extraction of audio is a heavy process and why a pipeline on a single MFCC process is of interest.

I prob wouldn’t use a CNN for KWS as there are better tensorflow-lite models but a low latency CNN for VAD, but you could check a CNN against the above but prob better comparing to Raven as in terms or WER and noise the Raven method is likely worst of all.

I was also checking the framework and training speed as a CRNN or DS-CNN are likely to be comparative in training just never got round to hacking a model out and CNN code was available.
Currently because there is a tool for creating a GRU its prob easier just to test the Linto HMG model as it already has a mic interface where you could do some accurate real world tests and any easily created model.

My hacks on https://www.tensorflow.org/tutorials/audio/simple_audio with a CNN was a thought to seeing how it copes with VAD for each frame that feeds a KWS.
Its not KW but Voice as the hit label and !Voice as the !label just haven’t got round to looking at the NN VAD examples I supplied or creating a dataset as presume need to get pysox and do some silence trimming and concatenation.

With Linto also adding https://github.com/linto-ai/sfeatpy I have that to try as well.

I would be interested if you did something ‘Raven-like’ with a CNN and compared to Raven though, but likely better to do the head scratching to create one of the latter models from https://github.com/google-research/google-research/tree/master/kws_streaming for KWS.

Snowboy which is in the above is a DNN which https://github.com/ARM-software/ML-KWS-for-MCU should give you an aproximate to a CNN.
Also a DNN is also in the graphs of https://github.com/google-research/google-research/tree/master/kws_streaming

Ravens False Alarm rate per Hour is 2.06 or 20.6 per 10 hours as above is a state of the art worst but its a KW gatherer that works quickly and no haven’t tried with a CNN but expecting it or any NN above even with relatively low sample counts of ‘own voice’ will exceed it greatly.
No-one has ever seemed to create a NN Raven style and likely it could, I have always done universal for both KW & !KW.

You could prob set up a web routine ‘Big words on a web page’ that gives guidance and records at 3 mic positions .3m, 1m, 3m of KW and some words of some phonetic pangrams then pysox those via pitch shift and padding into a much larger dataset but one of KW & !KW with noise also added to KW then train.

Using the https://github.com/linto-ai/linto-desktoptools-hmg and a GRU for that with its visual feedback is likely to be less painful and easier to use for purposes of test.

Where I got the CNN code https://www.tensorflow.org/tutorials/audio/simple_audio is a quick universal model on 1000 samples per label so I am not sure what the minimum for a custom ‘own voice’ is but after a bit of pysox manipulation & noise addition they do quickly build up (Couple of hundred).
The main thing is its a custom KW & !KW dataset not a custom KW label with a universal !KW label.
Also that if both VAD & KWS use MFCC we can supply both from the same feed and negate separate processing.

2 Likes

Talking about Raven settings, does setting minimum match to 2 instead of 1 will increase cpu load ? And should I disable avarrage then ?
Should it work better, lowering sensitivity I guess ?
Having three custom wakeword is rather cpu intensive ever.

Apols @KiboOst but just a last mention to @synesthesiam

Just pushed some more rough scripts to https://github.com/StuartIanNaylor/simple_audio_tensorflow

dcnn.py is the NN KWS model I was thinking about as well as converting to TFL its also runs on TF4MC as in the repo you will see dcnn.tflite

Also another horrid hacky script but with pysox this script will take 1 input file and create 20 variations on it mixing pitch, tempo & padding.

audio_vary.py

import sox
import numpy as np
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--input", help="input file name")
args = parser.parse_args()
    
np1 = np.arange(start=-1.0, stop=1.1, step=0.10)
np2 = np.arange(start=-1.0, stop=1.1, step=0.10)
np3 = np.arange(start=-1.0, stop=1.1, step=0.10)
np.random.shuffle(np1)
np.random.shuffle(np2)
np.random.shuffle(np3)

tfm = sox.Transformer()
tfm.silence(1, 0.1, 0.01)
tfm.silence(-1, 0.1, 0.01)
tfm.build_file(args.input, 'silence-strip.wav')
stat = sox.file_info.stat('silence-strip.wav')
duration = stat['Length (seconds)']


x = 0

while x < 21:
  tfm1 = sox.Transformer()
  pitch_offset = round(np1[x],1)
  tempo_offset = round(np2[x],1)
  pad_offset = round(np3[x],1)
  
  tfm1.norm(-3)
  tfm1.pitch(pitch_offset)
  pad = (1 - duration)
  if tempo_offset < 0:
    tempo = 1 - (abs(tempo_offset) / 10)
  else:
    tempo = 1 + (tempo_offset / 10)
    
  if pad_offset < 0:
    startpad = abs(pad - (pad * abs(pad_offset)) / 2)
    endpad = pad - startpad
  else:
    startpad = abs(pad * pad_offset) / 2
    endpad = pad - startpad  
        
  tfm1.tempo(tempo, 's')
  tfm1.pad(startpad, endpad)
  tfm1.trim(0, 1)
  tfm1.build_file('silence-strip.wav', 'pp' + str(x) + '-' + args.input)
  stat = sox.file_info.stat('pp' + str(x) + '-' + args.input)
  x = x + 1  

So even if you have a small number of samples you can still quickly build quite a decent dataset as 10 recordings becomes 200 variation.
Prob if 5 - 10 recording where made @ 0.3m, 1m & 3m you would then have 300 - 600 KW label items.
Just a quick hack but there just as a demo that if 10 recordings where made at 3 mic distances and run through training for TFL then is that not more valid than Raven as users can be added with no perf hit as you just add to the dataset and retrain and also through use you can capture data and autotrain.

I didn’t do noise addition but the KW are normalised @ -3dB so the whole dataset can be duplicated and split with mixes of noise @ 33% -8dB, 33% -13dB, 33% -18db.
The result with noise is a dataset of 600-1200 KW but also the steps can be decreased and range increased to give more.

1 Like