Yeah I could not work the Vietnam thing out, so not being Vietnamese I was wasn’t so bothered.
Not bothered which but was really curious on performance figures.
When you throwing around loads of tensors the wider the databus the more simultaneously you can handle.
Armv7 -> Aarch64 really is 2-3x perf improvement as with TF its all been optimized for 64bit as it is faster due to it being predominately math libs.
When I have been playing with that Google-KWS it gives accuracy results for TF vs TFL but also the speed increase of a TFL quantised model is also really big.
I0329 19:49:03.096672 139843685021504 test.py:495] tf test accuracy, stream model state external = 100.00% 200 out of 609
I0329 19:49:40.055655 139843685021504 test.py:495] tf test accuracy, stream model state external = 100.00% 400 out of 609
I0329 19:50:17.611867 139843685021504 test.py:495] tf test accuracy, stream model state external = 100.00% 600 out of 609
I0329 19:50:19.069626 139843685021504 test.py:500] TF Final test accuracy of stream model state external = 100.00% (N=609)
INFO: TfLiteFlexDelegate delegate: 2 nodes delegated out of 34 nodes with 1 partitions.
I0329 19:52:51.021229 139843685021504 test.py:619] tflite test accuracy, stream model state external = 100.000000 200 out of 609
I0329 19:52:55.191242 139843685021504 test.py:619] tflite test accuracy, stream model state external = 100.000000 400 out of 609
I0329 19:52:59.372943 139843685021504 test.py:619] tflite test accuracy, stream model state external = 100.000000 600 out of 609
I0329 19:52:59.534713 139843685021504 test.py:624] tflite Final test accuracy, stream model state external = 100.00% (N=609)
Its not all running on TFL as 2 nodes do delegate out to run TF but the speed increases are pretty huge.
ONNX Runtime mobile can execute all standard ONNX models but what that exactly means I don’t know as just scraping the tip with tensflow & tensorflow-lite and with flex-delegates all I have gathered is how confusing it is to delegate out and often how constraining the basic functions of the ‘lite’ runtimes can be.
I presume the benefits of 64bit would be the same and also running ONNX mobile as to the above prob has similar results.
If you have the process time of a Pi4 producing a approx 10 sec sentence it would be interesting to compare as model vs model / framework vs framework is so confusing and don’t think really there is any metric you can use.
Tensorflow tends to have faster optimised versions as its a static based lib vs dynamic libs like pytorch so its far less flexible and why pytorch garners so much research as no need to write and compile out due to its dynamic nature.
I think onnx training can be either as can be used with TF and pytorch and its down to how training and models have been implemented if static optimisation is implemeneted.
I guess with the Pi4 it doesn’t matter so much but a whole rake of different frameworks can eat quite a bit of memory as opposed to several uses of 1.
I would have a read of https://github.com/TensorSpeech/TensorFlowTTS/issues/522 as that also suffers from a ‘crackling’ sound but seems if you overlap slightly and feed a queue it can be done.
I thought you where ‘too busy for kw’?