Also deepspeech KWS is single threaded and on a Pi3 runs at less than x.5 real-time.
However as an authoritive server KWS it might be very useful, but yeah KWS availability, accuracy and process load doesn’t give much to choose from.
There are KWS available and the deepspeech eventually will be released so its actually interesting as this allows some flexibility towards accuracy as being lite-weight is also a big concern and in many satellite situations a server can also be an authoritative KWS and override lower tier error.
I am back on about what I have posted before.
https://blog.aspiresys.pl/technology/building-jarvis-nlp-hot-word-detection/
That much is already there but playing with the simpler.
That model creation is actually quite easy as in the tensorflow example its done on 1 line of an input parser.
I don’t know if its intentional but the tensorflow example is extremely stinky in its choice and creation of spectograms whilst maybe its intentional it omits a line and lib call to librosa.
As the Aspiresys.pl blog states.
def extract_audio_features(self):
audio, sample_rate = librosa.load(self.file_path, sr=self.sample_rate, res_type='kaiser_best')
self.mfcc = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
self.mfccScaled = np.mean(self.mfcc.T, axis=0)
There is only a little bit more as spectograms don’t do well with noise but at the creation of an mfcc you can raise the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT and this is something I was thinking would have to happen as an audio expander filter but actually it can be done there.
So code wise we already have the source it just needs patching together and then its back to that single line of model creation.
default='yes,no,up,down,left,right,on,off,stop,go',
Models are often far too literate there is basically the keyword and further classification is to split and provide phonetic start and syllable count that have huge input into a mfcc spectogram as you can deduce as much to its match with other classifications as with the keyword itself.
Then we keep mentioning its about the quality of the models and how we lack them but actually in terms of dataset now there is not really a shortage and that is all that models contain.
How you capture audio sets a window and that is pretty consistent and with the audio tools we have available it is possible to recreate the same window around data-set items for use.
Its all here and its not all programming as there is quite a bit of ‘filing’ to be done, but also because of the current situation and lack of options all we need is something that is fit for purpose.
I have only got a GTX780 but a model only takes a few hours, in fact I have another on the way as planning on knocking out a few models and sharing results.
But it does take hours and there will probably be quite a few models but its not that a huge undertaking if shared by a few.
We just need to set some presentation standards in labels used, accuracy and classification sensitivity and I have a hunch we might get the start of a model that can be fit for purpose but also a great demonstration of much of what is involved as this is just KWS and thankfully I don’t have to think about ASR.