Mycroft Precise Model Problem (computer-en.pb)

Daenara · June 6, 2020, 5:07pm

Good evening,

I tried getting precise to work today and I had quite the hard time with it. I started out trying to use the model computer-en.pb which did not work at all for me. Next I tried the hey-mycroft-2.pb that comes with the rhasspy docker, default settings and it worked. It did not recognize me every time but I put that one down to my bad English pronunciation and the fact that I play around in a German profile.

Since I proved that precise can work, I tried the other models that came with the docker, I could not get any of the others to recognize me, most likely pronunciation again. Then i went back to the computer model. In theory English and German pronunciation should be similar, if not identical so I thought it should work even with my bad accent. I could not get it working with the default settings so I played around with the sensitivity. With it at 0.9 instead of 0.5 I could get it to respond sometimes but after some testing it turned out that it basically responds to anything but not the word “computer”. It was triggered when i tried asking a whole question “Computer, wie wird das wetter” but not when I just used “computer”. More testing resulted in me just asking “wie wird das wetter” without wakeword and low and behold, that triggered the wakeword. While I was testing my phone got a notification, which makes 3 beeping sounds, it triggered the wakeword.

Now I am at a point where I just don’t get what is wrong. Is the model that bad or am I doing something wrong?

In addition I have not found anywhere what the two parameters that I can play around with mean, my guess is that sensitivity is between 0 and 1 and the higher it is, the more reactive it will get but what about trigger level?

I also noticed that when I set pocketsphinx to the wakeword “computer” and play around with it it plays a sound that it is listening and I can ask my question afterwards. Even with the (somewhat) working “hey mycroft” I had to put the question in directly after because if I waited for the beep I did not have enough time to finish speaking.

I do know that others are using precise or at least experimenting with it so I hope that you might be able to help me out with your experiences. I know I can train my own wakeword and about @ulno’s guide to it but i wanted to test it before I spend hours training my own.

Daenara

JGKK · June 6, 2020, 6:01pm

I can give you one of my wake-words for testing if your interested. its hey pips. I also have one for my freehome. I didn’t have great success with the computer one either.

moqart · June 6, 2020, 11:04pm

I also speak german and i am using a respeaker 2 mic hat with rhasspy. I also used it for recording the wakeword samples that i used with precise.

I managed to train a wakeword model with precise for “computer” that works relatively well for me.
Because the model is only trained with the wakeword spoken by me it does not work well for other people.

JGKK · June 7, 2020, 7:40am

we could join forces and upload all our samples for computer to a google drive folder and id be willing to have a go at a model for the community. Also i would prefer hey computer

moqart · June 7, 2020, 1:21pm

@Daenara I took another look at my rhasspy setup as i now got time and the precise wakeword listener also works in the rhasspy 2.5 docker install.
If you want to train your own wakeword model with mycroft precise you should record the wakeword atleast 50 times if you want good results.
But it is definitly worth it as i now got a wakeword that works great. It also helps if you record the wakeword samples with the same mic that you plan to use it with.

I am willing to share my wakeword samples for the german wakeword computer if you want to create a more universal wakeword model.

JGKK · June 7, 2020, 1:54pm

I found the biggest influence was to have lots of different random noise to do incremental training against. Its also important to duplicate your wake word training data with added background noise to train a background noise resistant model.
Precise includes a tool for that.
I also played with adding whitenoise to some of the data and doing some duplicates which shifted pitch. While it does improve the duplicate set with added noise is more important.
I found all the youtube videos like one hour of random household or bar noises to be a great source for random audio to train against. Just download the audio with youtube-dl and convert it to the right format with sox.

As i said if we create a public folder with data for the wakeword id be more than happy to train it.

Johannes

moqart · June 7, 2020, 2:30pm

@JGKK Do you speak german or english ?
The mozilla common voice dataset also seems like a great source of data to decrease false activations from talking. https://voice.mozilla.org/de/datasets Because there is a transcript for all of the recorded sentences it should not be too difficult to remove all the sentences containing the specific wakeword.

JGKK · June 7, 2020, 2:34pm

Both but my mother tongue is german. Yes the mozilla common voice data set would be a great source too.

rolyan_trauts · June 7, 2020, 4:41pm

That is the best idea as if you have the datasets you can add improve or choose what samples to include.

I really wanted to create a Wordpress or Joomla simple table app so that it had more user control but a Google spreadsheet linking to datasets allows a distributed community to create a huge distributed dataset.

Invite people to start using googledrive, onedrive and others to load up a dataset and share the url in a central community repo and use something as simple as above.

As many above ‘own voice’ can greatly increase accuracy because the model isn’t recognising speech purely MFCC spectra images of likely 1sec word windows.
I quite like Linto HGM (Hotword Model Generator) because in the MFCC setup it shows you the average MFCC image when you create the parameters and your sort of blind if you don’t get that sort of feedback.

I am not saying it has to be Linto HMG but we really need a tool like this as from the initial folder selection for hotwords, not-hotwards, model creation, evaluation to hotword voice testing it has every crucial tool for hotword generation in a single package.

If someone says yeah and takes control of a spreadsheet and we will create a thread where people post urls or report dead links then they can add dataset entries and share a viewable version publically.

If you do then I will get to work at uploading some regional gender based KWS word datasets from https://openslr.org/83/
Not all that many lines in there but when split to words should be considerable.
Its choice when which you download and what you combine to create your dataset and your training weight of direction can be the predominance of samples of the language, region, gender, age that you choose.
I will get onto doing the same with the Mozilla common voice as the metadata is pretty sparse but the sample quantity is absolutely huge so should have a lot of UK English gender and mixed word samples.

Don’t use a single drive folder just have a central database of some form and link many.

Don’t think of hotwords as spoken words but purely MFCC representation.

I noticed with Linto HGM and the Google command set with the few positive/negative detections often these where by non native english speakers.
The words where to reconisable by human ears but the pronounciation and hence MFCC spectra would of been very different.

Its sort of really essential to create a large metadata rich central dataset that allows users to easily pick thier language, region, gender, age specifics as its stupid and not nesscary to create huge complex models to encapsulate all.
Generally trying to encapsulate all just adds more variance and creates specular models that have more overhead and prone to being less accurate not more.
We should not be shipping non opensource blackbox models but the datasets aka ‘code’ to create them.

The current datasets are heavily biased to native speakers with a smattering of token non-native speakers and its absolutely useless as without some balance for non-native speakers the model is useless and for native speakers the addition just adds useless samples to increase specularity and reduce accuracy.

You can create consise small accurate models quite easy and adding a few of your ‘own voice’ will only add to that. You don’t need to be inclusive to the whole world with the voice AI that is in your room, but the datasets that we share publically and globally do.

If you are going to contribute to Commonvoice or others please also add accurate metadata as we seem to be creating huge models that centralise around native speaking common languages and being exclusive to regional, gender and non native use, in a similar way lesser spoken language is used.