Suggestion: Look into the LinTo project

Today I stumbled across the LinTo project, a totally open source virtual assistant project out of France.

They have developed some very interesting tools in the open.
Two very interesting ones are a hotword/wakeword spotting tool, and a hotword model generator.

Perhaps worth taking a look at if this has not been done already of course. (I searched for LinTo on the forum, but did not find any mention of it).

indeed seems interresting thanks for the link :slight_smile:

The LinTO model looks pretty much like the one used by Mycroft Precise with a few changes.
I don’t know if either of these models provide good accuracy.

Maybe @ulno can provide more insight on Precise accuracy and perfs.

The problem/hard part is creating and preparing the dataset…

Yeah its sort of strange we haven’t had a request to even say “Hey Rhasspy” a 100 times
Maybe because no one is sure how to pronounce it :slight_smile:
I would and could.

1 Like

One of the things that I see @synesthesiam is working on is a dataset generator and trainer for Precise. Perhaps a bit off topic here, but what I see is that it uses TTS to generate all kinds of synthetic permutations of the keyword utterance. Would it not be valuable to create a tool that does something similar but takes a human voice recording as input and then just manipulates the recording to generate permutations? That way you’re not relying (only) on a synthetic source.

Check this out for inspiration.

Right now it seems to need a source file, but perhaps the features (timbre, rhythm and pitch) can be generated and then the mapping of the source file to the hundreds of permutations of the original features can be automated…
Just thinking out loud. :slight_smile:

1 Like

Data augmentation FTW.

Using external TTS can increase the number of variations in the dataset.

I think having both real human recordings and TTS generated ones will help improve the model accuracy…

I’d love using Google and Amazon own tools (Free tier) to emancipate from them :wink:

I am not a fan of precise apart from its opensource as its not that precise and as Pi zero fans will find out to dismay its not that lite.

I am sort of the same as there are phonetic datasets and audio tools to synthesize word datasets and like @fastjack says the use of better quality TTS for this is also a great source for datasets even if not exactly sure if the spectrograph of human vs TTS is an exact fit.

But the assumption is regionally via country we all speak the same and that is just not true.
As people we have volcabulary styles of preference of use that vary drastically and can be quite unique.

Its an impossible task to create a model that fits all and one that really isn’t needed as custom use data added to the dataset greatly improves accuracy and we are getting to a stage where at defined periods we can update our model through a process of continous training of use and not rely on ill fitting generalised models.

The best results by far are by including your voice.

Its why I posted Cloud TPU Alpha Access as yes that is alpha access but occasional custom tailored model training is likely to be quite cheap.

I totally agree with @rolyan_trauts on the fact that no matter how many permutations in terms of tempo, rhythm and pitch you can create from a single source, regional accents are not taken into account with that method. However, my suggestion pertains to creating a model for your personal use, based on a recording or recordings of your own voice, and then to enrich the dataset with generated permutations of your voice as well as synthetic utterances (using TTS) to create a larger training set.

Using local or cloud processing to train the model is up to whoever is doing the training.

Back to the initial idea of this topic - has anybody tried LinTo?
Mycroft is heavily working on making precise more lean, we are also using a quite outdated version here in rhasspy - however new versions of precise are supposed to even come as some kind of snap, flatpak, or appimage (if that makes them lighter is arguable).

So what’s up with LinTo? - I would be very interested in some reports.

Linto is looking quite good but guess no one is mentioning much because we are here in Rhasspy as from a user perspective its prob a choice of Rhasspy or Linto and if cloud services of various form are OK.

Out of Linto is the generic Lib https://github.com/linto-ai/pyrtstools which is quite interesting.

Disclaimer: This is an early version designed to provide a voice command detection pipeline for LinTO. However the elements are designed to be generic and can be used for other purposes.

Features

pyrtstools features different blocks:

  • Audio acquisition
  • Voice activity detection
  • Feature extraction
  • Keyword spotting

It uses https://github.com/mycroftai/sonopy to create MFCC.

I prefer Librosa as it gives much more control over MFCC creation.
Librosa is still a simple API but it has some really useful parameters that often are missing in others.

https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html

The wakeword section of Linto is pretty good.

https://doc.linto.ai/#/client/embedded_hotword

Which actually is problematic for Rhasspy with the current closed models it offers and Linto Hotword and Hotword Trainer would be a great addition.
They also realise you need open datasets.
https://wakemeup.linto.ai/

I had forgotten about Linto because some of the server side operation put me off but the KWS would make an excellent Rhasspy system that currently has a big need.

I am not all that sure if using TTS is a good way to train a model in that the human dataset that created the TTS to synthetic output for a model to collect human voice might seem self evident where that might be going wrong.

I don’t understand why either as its not for a lack of datasets.

Accuracy of a model is all about accuracy of dataset with your own voice being must accurate then you have regional and dialects then gender then your overall language.

I have been manipulating purely the male northern English dataset.
Used Deepspeech to get start and end point of words in sentences and I am going to use Sox to strip and pad into 1 sec clips and normalise.

https://openslr.org/83/

Why they are going the way they are going is a total mystery and why they are training with TTS is even a bigger one, to what is needed for accuracy.
But it will be the most accurate TTS KWS available!? The accuracy for TTS hotwords of various pitch will be extremely high.
It is probably good to give you a starter set for a custom keyword but that is really all.

Precise is pretty clunky and not all that accurate as think its phonetic based but interested in the Linto tensorflowlite version.
Also they might of created an image and fixed some of what I hate about the seedvoicecard drivers but they have an image with drivers installed.
https://gamma.linto.ai/downloads/Raspberry/v1.1/linto-maker-buster-v1.1-seeedvoicecard.img

As I really like that card but the drivers kill it for me, but going to check that image as actually for KWS its really interesting.