(venv) pi@raspberrypi:~/googlekws/simple_audio_tensorflow $ python3 simple_audio_mfcc_frame_length1024_frame_step512.py
Commands: ['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
'duration.py']
Number of total examples: 10114
Number of examples per label: 1000
Example file tensor: tf.Tensor(b'data/mini_speech_commands/left/4beff0c5_nohash_1.wav', shape=(), dtype=string)
Training set size 6400
Validation set size 800
Test set size 800
Run time 0.13212920299997677
Run time 0.6761154660000557
2021-02-14 10:06:37.656765: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-02-14 10:06:37.659695: W tensorflow/core/platform/profile_utils/cpu_utils.cc:116] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
Input shape: (30, 13, 1)
Run time 5.733527788999936
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
normalization (Normalization (None, 30, 13, 1) 3
_________________________________________________________________
conv2d (Conv2D) (None, 28, 11, 32) 320
_________________________________________________________________
conv2d_1 (Conv2D) (None, 26, 9, 64) 18496
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 4, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 13, 4, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 3328) 0
_________________________________________________________________
dense (Dense) (None, 128) 426112
_________________________________________________________________
dropout_1 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 1290
=================================================================
Total params: 446,221
Trainable params: 446,218
Non-trainable params: 3
_________________________________________________________________
Epoch 1/1000
100/100 [==============================] - 63s 601ms/step - loss: 1.8292 - accuracy: 0.3305 - val_loss: 1.0994 - val_accuracy: 0.6363
Epoch 2/1000
100/100 [==============================] - 30s 299ms/step - loss: 1.1535 - accuracy: 0.5955 - val_loss: 0.8443 - val_accuracy: 0.7175
Epoch 3/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.9190 - accuracy: 0.6740 - val_loss: 0.6876 - val_accuracy: 0.7775
Epoch 4/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.7892 - accuracy: 0.7273 - val_loss: 0.6035 - val_accuracy: 0.7987
Epoch 5/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.6533 - accuracy: 0.7606 - val_loss: 0.5486 - val_accuracy: 0.8100
Epoch 6/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.6117 - accuracy: 0.7831 - val_loss: 0.4823 - val_accuracy: 0.8500
Epoch 7/1000
100/100 [==============================] - 30s 300ms/step - loss: 0.5309 - accuracy: 0.8207 - val_loss: 0.4395 - val_accuracy: 0.8612
Epoch 8/1000
100/100 [==============================] - 30s 300ms/step - loss: 0.4771 - accuracy: 0.8333 - val_loss: 0.4316 - val_accuracy: 0.8612
Epoch 9/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.4371 - accuracy: 0.8485 - val_loss: 0.3950 - val_accuracy: 0.8763
Epoch 10/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3972 - accuracy: 0.8630 - val_loss: 0.3770 - val_accuracy: 0.8850
Epoch 11/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3604 - accuracy: 0.8745 - val_loss: 0.3590 - val_accuracy: 0.8938
Epoch 12/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3476 - accuracy: 0.8784 - val_loss: 0.3630 - val_accuracy: 0.8850
Epoch 13/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3174 - accuracy: 0.8832 - val_loss: 0.3481 - val_accuracy: 0.8888
Epoch 14/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3106 - accuracy: 0.8928 - val_loss: 0.3483 - val_accuracy: 0.9050
Epoch 15/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2803 - accuracy: 0.9049 - val_loss: 0.3573 - val_accuracy: 0.8875
Epoch 16/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2600 - accuracy: 0.9064 - val_loss: 0.3422 - val_accuracy: 0.9025
Epoch 17/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2419 - accuracy: 0.9138 - val_loss: 0.3672 - val_accuracy: 0.8900
Epoch 18/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2296 - accuracy: 0.9213 - val_loss: 0.3688 - val_accuracy: 0.8900
Epoch 19/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2125 - accuracy: 0.9234 - val_loss: 0.3620 - val_accuracy: 0.8975
Epoch 20/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1991 - accuracy: 0.9227 - val_loss: 0.3705 - val_accuracy: 0.8963
Epoch 21/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1880 - accuracy: 0.9331 - val_loss: 0.3890 - val_accuracy: 0.9000
Epoch 22/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1780 - accuracy: 0.9355 - val_loss: 0.3813 - val_accuracy: 0.9013
Epoch 23/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1744 - accuracy: 0.9380 - val_loss: 0.3512 - val_accuracy: 0.9087
Epoch 24/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1588 - accuracy: 0.9452 - val_loss: 0.3666 - val_accuracy: 0.8938
Epoch 25/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1539 - accuracy: 0.9453 - val_loss: 0.3481 - val_accuracy: 0.9025
Epoch 26/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1572 - accuracy: 0.9423 - val_loss: 0.3882 - val_accuracy: 0.9050
Epoch 00026: early stopping
Run time 856.116808603
Test set accuracy: 90%
Predictions for "no"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
'duration.py'] tf.Tensor(
[9.5505527e-07 7.6688975e-02 4.6016984e-03 8.7443608e-01 6.1507470e-09
4.1621499e-02 1.2028771e-03 7.7026243e-08 1.4477677e-03 6.7970650e-11], shape=(10,), dtype=float32)
Predictions for "hey-marvin"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
'duration.py'] tf.Tensor(
[1.4215793e-23 3.6277120e-28 7.7683000e-33 4.2175242e-28 1.0000000e+00
1.9384911e-25 4.4126632e-22 6.7758306e-20 2.7613047e-26 1.4793614e-35], shape=(10,), dtype=float32)
Predictions for "left"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
'duration.py'] tf.Tensor(
[6.2980646e-08 4.8209145e-08 7.2652750e-07 7.3871712e-07 3.9954323e-10
9.1625368e-10 9.9980372e-01 7.3275419e-06 1.8732932e-04 1.6446810e-13], shape=(10,), dtype=float32)
Predictions for "go"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
'duration.py'] tf.Tensor(
[2.7044329e-11 7.6081965e-06 3.2807577e-05 4.8282389e-03 4.0133222e-12
9.9512523e-01 5.3841272e-06 2.5338602e-09 6.5998506e-07 1.4413574e-15], shape=(10,), dtype=float32)
Run time 869.929035417
simple_audio_mfcc_frame_length1024_frame_step512.py was just a rough test I did with an interest to the new math libs of tensorflow.
But the above is a CNN runing on a Pi4 and yeah with a 1000 of each label it takes approx 14 minutes to train.
It uses the keras framework and is using accuracy with a patience of 10, meaning if no increase in accuracy over a run of 10 epochs it will choose the last best.
Its a one off training need and yes is 14 minutes of your life but its a lot less painless that the cost of water cooled GPU’s and obviously also much faster on a desktop without GPU.
Once that model is trained it can be converted to tensorflow-lite and quantised down so its extremely efficient and fast on a Pi.
That model can be shipped and reused so it only needs a single training for multiple devices and people could share models if they so wished.
It could use usage data and constantly add to a dataset and retrain as a background idle task.
Its both KW & VAD as both could be neural nets which many advantages from load to accuracy that support universal models supplied or specific custom ones with NN custom VAD not only being able to distinguish voice but ‘your’ voice.
After training with tensorflow convert the model to tensorflow-lite and use tfl for inference.
Official releases are a bit slow and here is another community repo.