As the title suggests, is it possible to make a vocal model speaker dependent? Any suggestions to start from? I am new to this environment. Thank you.
I’m curious too.
In fact i did it already by mistake. Tree month ago I trained my first model with mycroft-precise on just my voice and the model was very speaker dependent. It mainly worked only with my voice.
Friends who visit my house were unable to trigger it.
It still works much better with my voice than other voices due to my dataset composed of 70% with my voice.
I think that is a bonus as who cares about other people than the designated voice actors of the dataset.
It can help to augment the recordings to give slight variation as distance, doa and even mood can create different intonation and tonal qualities. Also recording on the device of use can be critical unless also you create variation.
The trick is not applying too much so that you don’t increase false positives.
Due to a voice AI spending much of its time idle but often powered 24/7 I believe we should be able to use voice of use provide additions and replace some of the augmented dataset and get increasing accurate for the users included in the dataset and periodically retrain whilst idle.
Its the same for all from ASR to TTS but KWS being a single KW is much easier to train needing much smaller datasets and training time.
As well as ‘own. voice’ accuracy gender, age, region and even environment can be part of a dataset to help increase accuracy.
The universal models such as EN, FR are not there to be more accurate they are just to help developers in only having to provide singular language models.
KWS should be quite easy to provide highly accurate custom models but unlike ASR & TTS that have a wealth of sentence based datasets, word based KW applicable data is much rarer and it can be a combination of recording yourself, increasing that data by augmentation and addition of some word datasets that embarrassingly I only know of a very limited few EN ones.
I have also experimented with mycroft-precise, only my voice data, and muck samples of noise in my home, and I must say that it works very well. Precise, however, is easy to use, I don’t know how to make a complete set of recognition based only on my voice. I think if uses tensorflow, procedures is similar, no? Only problem whit mycroft is the file release on .flite, but rhasspy requires .pb, how did you solve?
It would be interesting to add a (voluntary) function on rhaaspy to send recognition samples, and ambient noise samples, to increase the available database. It could be done?
It would be simple for KW as its just a ringbuffer of KW length plus maybe a tad more which is as simple as np.roll(-chunk size) and tac a chunk onto the end np(lastchunk:)
If you get a KW hit you have the last second or more of audio which should be able to be set to grab KW quite cleanly.
Post KW you have the command sentence which is sent to ASR which with some simple logic that a clear intent was received without similar repeat send a get_last_kw and store both as usage data.
Have Mycroft updated and changed the Model of Precise as tflite doesn’t support its main Layer of GRU and it would be a strange tflite as near all would be delegated out to tensorflow-full.
Also both Rhasspy & Mycroft really should take the plunge to a 64bit release as the main components for KWS, ASR & TTS do actually benefit from a 2-3x perf boost.
You combine that with a model such as a CRNN which a lead CNN boils down much of the parameters before using a delegated GRU or complete tflite operation such as a DS-CNN with 64bit things get extremely rapid approx 17% single core load on a Pi3 for a 20ms streaming KWS (50 inferences per second).
TFlite is still rapid on 32bit but prob some of the NEON SIMD advantages are lost due vector laden nature of a NN framework that 64bit can just simply hold more 16bit (single instruction multiple data) vectors in one go.
There is a precise-convert command which convert your model to pb file.
In my case convert to .tflite, strange!
More likely convert to tflite but still likely a bit pontless as near all the model will delegate out to full tensorflow and have very little tflite optimization.
I have been using the far more modern google-streaming-kws framework that does have a large array of current state of art models and some older such as Precises GRU or even just straight CNNs.
The way Mycroft present Precise as there’s is total baloney and they are purely training a tensorflow model an old one at that, on an old version of tensorflow from the last time I looked.
Its neither lightweight or fast and there MFCC module sonopy produces some strange results that it could actually be fundamentally flawed.
But anyway choice is yours and with custom datasets for a while I have been working with augmented datasets and had a list of noise datasets that I have updated with the KW datasets I know.
The commonvoice single word dataset does have the word hey that gives great combination options to create better and more unique KW.
The list is in the readme and mix-b.py and the reader.py give examples how to make augmented datasets.
I have setup instructions and a bit of a demo of g-kws here
https://github.com/ARM-software/keyword-transformer its interesting as G-kws had a pretty archaic training loop to keep it similar to arms original https://github.com/ARM-software/ML-KWS-for-MCU
But the new model has had an updated training loop that I will prob check out one time.
The notion of branded KWS is just crass as there really is no such thing as most modern KWS are purely model adoptions sitting on frameworks far in advance of anything a supposed KWS vendor produce.
The most important elements of a KWS is the destination platform and the dataset and a framework like google-kws is needed so that a single dataset can provide a range of models for a range of working architectures.
Tensorflow and tensorflow lite offers solution from microcontroller to X86 and the g-kws sits on top of that and is a specific collection for KWS.
The dataset is extremely restrictive as opposed to ASR & TTS with the huge wealth of spoken sentence datasets apart from the list in the dataset-builder that is about it.
As I was searching for the url for the fluent ai dataset I noticed Auto-KWS 2021 Challenge: Task, Datasets, and Baselines
Which will be really interesting to see what solutions are proposed as that is the next major evolution to make custom models that self train automatically.
If you navigate to the autospeech-2021
In the last decade, machine learning (ML) and deep learning (DL) has achieved remarkable success in speech-related tasks, e.g., speaker verification (SV), automatic speech recognition(ASR) and keyword spotting (KWS). However, in practice, it is very difficult to get proper performance without expertise of machine learning and speech processing. Automated Machine Learning (AutoML) is proposed to explore automatic pipeline to train effective models given a specific task requirement without any human intervention. Moreover, some methods belonging to AutoML, such as Automated Deep Learning (AutoDL) and meta-learning have been used in KWS and SV tasks respectively. A series of AutoML competitions, e.g., automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners.
Keyword spotting, usually as the entrance of smart device terminals, such as mobile phone, smart speakers, or other intelligent terminals, has received a lot of attention in both academia and industry. Meanwhile, out of consideration of fun and security, the personalized wake-up mode has more application scenarios and requirements. Conventionally, the solution pipeline is combined of KWS and text dependent speaker verification (TDSV) system, and in which case, two systems are optimized separately. On the other hand, there are always few data belonging to the target speaker, so both of KWS and speaker verification(SV) in that case can be considered as low resource tasks.
In this challenge, we propose the automated machine learning for Personalized Keyword Spotting (Auto-KWS) which aims at proposing automated solutions for personalized keyword spotting tasks. Basically, there are several specific questions that can be further participants explored, including but not limited to:
- How to automatically handle multilingual, multi accent or various keywords?
- How to make better use of additional tagged corpus automatically?
- How to integrate keyword spotting task and speaker verification task?
- How to jointly optimize personalized keyword spotting with speaker verification?
- How to design multi-task learning for personalized keyword spotting with speaker verification?
- How to automatically design effective neural network structures?
- How to reasonably use meta-learning, few-shot learning, or other AutoML technologies in this task?
Additionally, participants should also consider:
- How to automatically and efficiently select appropriate machine learning model and hyper-parameters?
- How to make the solution more generic, i.e., how to make it applicable for unseen tasks?
- How to keep the computational and memory cost acceptable?
We have already organized two successful automated speech classification challenge AutoSpeech1 in ACML2019 and AutoSpeech2020 in INTERSPEECH2020, which are the first two challenges that combine AutoML and speech tasks. This time, our challenge Auto-KWS will focus on personalized keyword spotting tasks for the first time, and the released database will also serve as a benchmark for researches in this filed and boost the idea exchanging and discussion in this area.
- Feb 26th: Feedback Phase starts
- Mar 26th: Feedback Phase ends, Private Phase starts
- Mar 26th: Interspeech paper submission deadline
- Jun 2nd: Interspeech paper notification
- Aug 31st: Interspeech starts
The datasets links from the datasetbuilder I will link here are they might be usefull for anyone thinking of a custom personalised KWS as a mixture of all can reduce work and quickly make a dataset. You don’t have to go full custom as just adding some ‘own voice’ will increase accuracy for you.
Single Word Target Segment from https://commonvoice.mozilla.org/en/datasets
Personalized KWS isn’t that hard and also think of it as a start point as its also quite easy to automate the training routine so that collected usage can be added to a rolling dataset so it can get more accurate through use.
Dunno why the fluentai link shows a redirect but if you click it takes you there.
Incredibly useful thanks a lot!