One of the things that I see @synesthesiam is working on is a dataset generator and trainer for Precise. Perhaps a bit off topic here, but what I see is that it uses TTS to generate all kinds of synthetic permutations of the keyword utterance. Would it not be valuable to create a tool that does something similar but takes a human voice recording as input and then just manipulates the recording to generate permutations? That way you’re not relying (only) on a synthetic source.
Right now it seems to need a source file, but perhaps the features (timbre, rhythm and pitch) can be generated and then the mapping of the source file to the hundreds of permutations of the original features can be automated…
Just thinking out loud.
I am not a fan of precise apart from its opensource as its not that precise and as Pi zero fans will find out to dismay its not that lite.
I am sort of the same as there are phonetic datasets and audio tools to synthesize word datasets and like @fastjack says the use of better quality TTS for this is also a great source for datasets even if not exactly sure if the spectrograph of human vs TTS is an exact fit.
But the assumption is regionally via country we all speak the same and that is just not true.
As people we have volcabulary styles of preference of use that vary drastically and can be quite unique.
Its an impossible task to create a model that fits all and one that really isn’t needed as custom use data added to the dataset greatly improves accuracy and we are getting to a stage where at defined periods we can update our model through a process of continous training of use and not rely on ill fitting generalised models.
The best results by far are by including your voice.
Its why I posted Cloud TPU Alpha Access as yes that is alpha access but occasional custom tailored model training is likely to be quite cheap.
I totally agree with @rolyan_trauts on the fact that no matter how many permutations in terms of tempo, rhythm and pitch you can create from a single source, regional accents are not taken into account with that method. However, my suggestion pertains to creating a model for your personal use, based on a recording or recordings of your own voice, and then to enrich the dataset with generated permutations of your voice as well as synthetic utterances (using TTS) to create a larger training set.
Using local or cloud processing to train the model is up to whoever is doing the training.
Back to the initial idea of this topic - has anybody tried LinTo?
Mycroft is heavily working on making precise more lean, we are also using a quite outdated version here in rhasspy - however new versions of precise are supposed to even come as some kind of snap, flatpak, or appimage (if that makes them lighter is arguable).
So what’s up with LinTo? - I would be very interested in some reports.
Linto is looking quite good but guess no one is mentioning much because we are here in Rhasspy as from a user perspective its prob a choice of Rhasspy or Linto and if cloud services of various form are OK.
Which actually is problematic for Rhasspy with the current closed models it offers and Linto Hotword and Hotword Trainer would be a great addition.
They also realise you need open datasets. https://wakemeup.linto.ai/
I had forgotten about Linto because some of the server side operation put me off but the KWS would make an excellent Rhasspy system that currently has a big need.
I am not all that sure if using TTS is a good way to train a model in that the human dataset that created the TTS to synthetic output for a model to collect human voice might seem self evident where that might be going wrong.
I don’t understand why either as its not for a lack of datasets.
Accuracy of a model is all about accuracy of dataset with your own voice being must accurate then you have regional and dialects then gender then your overall language.
I have been manipulating purely the male northern English dataset.
Used Deepspeech to get start and end point of words in sentences and I am going to use Sox to strip and pad into 1 sec clips and normalise.
Why they are going the way they are going is a total mystery and why they are training with TTS is even a bigger one, to what is needed for accuracy.
But it will be the most accurate TTS KWS available!? The accuracy for TTS hotwords of various pitch will be extremely high.
It is probably good to give you a starter set for a custom keyword but that is really all.