Collecting positive wakewords on non-raven systems

Evening,

I am trying to find a wakeword system that works for me and so far, all have disappointed. Right now I am trying to train a mycroft-precise model using the few positive samples raven provided me with before all the false positives got annoying and I turned it off.

If I can manage to train mycroft precise enough to work better than raven, how do I get all the false positives I would need to refine it further. Is there a way to just collect all samples that wakeword engines react to? The collection system of raven is great, but it does not help once the initial model is trained.

Hello,
Have a look at this post over on the node-red forum by me. (especially the second part after the edit break) It explains how i train my precise models with which strategy/ commands in what order and also includes some useful bash scripts to collect samples, split up random noise recordings and add noise to wake-word recordings from that noise.

some key points are:

  • you will need at least 10 samples from each person in the household (more better, quiet no background noise)
  • duplicate this sample set with added background noise to train a noise resistant model (see the following points about random audio)
  • collect random audio from your household with common occurrences like household noises, watching tv, conversations (everything except the wakeword)
  • additionally download audio from youtube videos such as one hour of coffee shop noise, one hour of bar noises or train station noises
  • split all that audio into chunks no longer than one minute (this is the audio that is added to the noisy wake-words and used to train against and there is a sox command to easily do this in the linked post above)
  • you will need at least 5 hours of combined random non wake word audio to train a satisfactory model
  • if something goes bad or you add new data start training from a fresh start
  • play with the threshold and sensitivity parameters in training
  • if you find that certain noises/scenes trigger a lot of false positives record them and add them to your noise library
  • be patient and don’t get frustrated

Johannes

3 Likes

While this is a good writeup in how to effectively train, that is not what I am looking for. I can train models just fine, what I am looking for is a way to capture whatever the system decides is a positive wakeword once in production.

If I say something and I remember what I said I can record that later on and put it into not-wake-words to train with, but if I have a conversation with ppl and something they say triggers my wakeword, or if I am in a voice chat and something triggers it, those I can’t record later on. Raven has this nifty feature where it just stores everything that triggers it into a folder so I can later sort through it and use everything collected to train a custom model. I am looking for a way to “emulate” this behavior with other systems.

It is likely that there is no way to do it (yet) but I am not that well versed in the audio side of things, so I thought I might just ask. There also might be a simple way to do it that I just don’t see.

You could run precise on a second device, i think it has this functionality build in as one of the tools if you do a source install. You could probably also build something like this with voice2json and nodered if you run your model there.

If your interested I just build a quick flow in nodered as proof of concept which does exactly that. It uses voice2json to listen to the wakeword and every time its triggered it saves the last two seconds of audio before the detection. You could probably run it in parallel to Rhasspy. You could also use the mqtt message for wake word detection from Rhasspy and than wouldn’t have to run voice2json in parallel. Only prerequisite would be to have your mic configured as a dmix device or use pulseaudio.

I am not sure why this isn’t done as from reaction after ASR result on if repeat KW is said again is extremely indicative that it was a KW and the ASR sentence was accepted.
Why that data isn’t stored and batched for further training and model updates so that a system can get better through use is a mystery.
Why we lack easy tools for training and why we don’t have a more modern KWS to precise as its Precise only by name and its purely about picking a newer tensorflow keras/pytorch framework and providing a KWS ourselves.
Its the tools though as its mystery why they haven’t been created and these strange collection of mediocre KWS rather than a good one with supporting tools?!?
Porcupine now Rhasspy has been updated and its working again is prob best if you value accurate KW over not so accurate custom ones.

Totally bemused why we don’t have a accurate custom one with supporting tools, but hey.

I have a ‘hey marvin’ dataset if of any use.
https://drive.google.com/open?id=1LFa2M_AZxoXH-PA3kTiFjamEWHBHIdaA

There is a pull request for precise with a newer version of tensorflow, I have tried it but training does not work well at all and half the scripts are broken. I tried for a day to use it and the best I could do was 50% false-negatives for my wakeword and training was slow. The current version of precise at least trains well and I had only one false-negative with the same training data in 1/100 of the time. And that one was excusable since it was pretty bad quality wise.

I don’t see a big problem with training a model myself, so far i have not have any good experience with any pre-trained model. The only thing that I could get to work was pocketsphinx and that basically reacted to anything I said (and any noise) . The next best results I had was raven, but that still reacted to way to much so now I am training precise. I got it up to roughly the same as raven was, and I only used a pretty small set of samples for wakewords (less than 50) and next to no not wakewords recorded by the same mic rhasspy uses. I only played around with it properly for 3 hours as of yet, thought I spent nearly a week getting a system up and running to even train properly. Once I have more non wakeword data in there, I am pretty sure I can get it to run properly for me.

The problem here is, that few ppl have the knowledge, time and motivation to write something like that. And if something working exists, it happens often enough that it is bought out by some company wanting to play with voice assistants. Precise is not bad, it is just outdated and there is no working version for python 3.8 because of them using an outdated tensorflow library. If they would update that could produce pretty good model.

Another part of the problem is where ppl run their voice assistants. From what I read here, quite a few ppl use anything from a raspberry pi zero to a raspberry pi 4, with a few ppl using it in a vm on their nas. The ppl that have a small server, or anything with more power than a pi4 are rare exceptions, so most systems are geared to run on lower end systems. If it has to run on a potato, there is next to no way it will be perfectly accurate. And if there is a system that is accurate but does not run on the most used hardware, what good is it then?

I am pretty bad with node red, could you upload your flow and tell me what I have to change to use the rhasspy mqtt message instead of voice2json? Once I managed to get a decently running model, I can play around with that.

That will be a bit of a problem, but I might manage. I know I don’t use pulseaudio, with my pi4 being a minimal install. I am not sure if I can manage to switch to dmix without breaking my sound, especially since I don’t know if that will even work with my rhasspy running in a docker container.

If you post your asound.conf i can take a look at it to see what needs to be changed to enable dmix.

Im thankful for that as that mostly complicates everything in my opinion. Im not a big fan of pulseaudio in combination with projects like Rhasspy.

I will do that tonight

There is nothing to Precise but a tensorflow model with a MFCC feed and that is all it is and that is all any modern KWS is purely a streaming neural network.
Since I started talking about MFCC over 6 months ago practically every framework now includes MFCC math libs so that the other part from Mycroft the broke MFCC lib Sonopy is no longer needed.
There is practically zero to write as it 99% the use of a framework such as Keras or Pytorch.

Cutting edge AI tensorflow lite runs on microcontrollers with better accuracy than the KWS we currently emply.
From Arm to Google they have posted many publications on accuracy, what it can be used on and how to use it.

The above is over 3 years old and this is nothing new.

Google even have code labs https://codelabs.developers.google.com/codelabs/sparkfun-tensorflow#0
https://www.tensorflow.org/lite/
Google have vast resources on cutting edge AI for the edge IE microcontrollers.
https://www.tensorflow.org/lite/

Its like the methods of Raven which hark back to early attempts in voice recognition that was no way exclusively Snips and yes it still works but with the massive steps that have been made in AI its archaic and totally defunct as a method.

There is practically nothing to do to provide a KWS now as all is needed is to adopt a framework and model so that you have a code base that you can update and not be constricted with old outdated versions of current software as in Precise.

There are $2 chips such as the ESP32 running KWS with more accuracy than some of the KWS options Rhasppy has and if anyone is telling you anything else its basically a lie. In fact $2 chips are outperforming the newly adopted Snowboy which was a commercial product that has been dropped and is available opensource as the technologies behind it in terms of accuracy are now redundant.
At least Raven is a quick train use sample catcher to train a new model but why effort is being given to Snowboy is such a curve ball I better stop my opinion here as when it comes to KWS and what we use with Rhasspy its such a WTF I don’t like the only angles to the rationale I can deduce.

I will argue my stance with microphones but the KWS direction I have kept quiet about and generally will as the rationales are so odd I think its best to keep quiet as what you said above is actually totally wrong but for some reason that info seems to be coming from this community and its puzzling?

Raven was mainly build to gather samples to train a model with while still having a decently working custom wakeword. It is way better than pocketsphinx ever was and that was the only wakeword I could ever get running before Raven.

There is a fork of precise that is using tensorflow 2.2.0 but it hasn’t been merged yet and to be frank, it is quite broken. Training models is slower than with the official precise release, the train-incremental script is completely broken, the export function was rewritten to export to a tensorflow light model and the model that is trained is now a .pb model instead of a .net but I have found no way to actually use it with rhasspy, just copying it over did not throw any errors, but I could not get a wakeword detected. I can’t diagnose if it is because of the training not working properly and therefore the model just not recognizing me speaking or something else.

I agree with you that none of the wakeword systems we have now are cutting edge and that it could be better but I personally do not know how to write something like that, I had problems just trying to figure out the differences in the precise scripts of those two versions.

For right now we just have to work with what we have, and this thread is not geared towards any system at all, I am just looking for ways to collect data to train models on while using any non-raven KWS since they don’t come with the handy inbuild collection that raven does.

I think the main reason that effort is put into snowboy is that it is open source, rhasspy already can use it and there is a need for an easy to train KWS for ppl that aren’t well versed in linux and just want to have a private voice assistant. Not everyone in this community even knows how to code or understands the more technical parts (I gave up on reading your posts on microphones because I didn’t understand most of it and I know I will continue to use my respeaker4 as long as I can keep it working, so I will try to decipher those if I actually need the information). With snowboy becoming open source it was an easy way to add this missing element without having to reinvent the wheel, so to speak.

@synesthesiam is no expert in AI technology as far as I know and he is working on rhasspy mostly by himself. There are quite a few things that need work and I don’t think it would be in anyone’s interest if he pours all his time into getting a perfect cutting edge KWS system for rhasspy and neglecting everything else. If you know of a working open source framework that can be integrated into rhasspy, has a decent way to train, is kept up to date by the maintainers (since I don’t think there is someone in this community that could keep something like that up to date, otherwise we would have something better already) and, most importantly is stable enough that an update to tensorflow or whatever it uses behind the scenes will not cause new models not to run without manually changing the integration into rhasspy then I am sure it could be integrated. This community has no one with the knowledge of how to and the time to actually work on this project and we can’t just offload everything on synesthesiam, he has enough on his plate as is.

The reason it is coming from this community is because most of the community, me included, are no experts in any of this stuff. We also work on this as a hobby and if I tried to look up every unfamiliar term and technology I will never get anything done. So most ppl either get most of their information from this community or from skimming/half understanding the underlying concepts. So we assume ppl posting here about something know what they are talking about without using extensive research to prove it and if something is posted about often enough it sticks and becomes the norm. So we need experts or ppl with more time and interest to actually do the research to correct that opinion but after it being repeated all over the place it is hard to get it out of ppls mind. If I read about something in 10 different threads, mostly written by 10 different ppl then I assume that they have to at least know some of what they are talking about.

PS: If someone that can move things in this forum comes by, this discussion about KWS systems might be better off in its own thread.

2 Likes

Its called keras or pytorch there is no such thing as cutting edge KWS its merely the cutting edge machine learning platform it is based on.
As in garbage in and garbage out a KWS is the initial voice capture and very much dictates if a garbage stream or not.
I did post last May KWS small lean and just enough for fit for purpose
Of the the various frameworks and the problem is the assumption that it should or needs to fit into rhasspy when a KWS broadcasts from KW till silence and is little more than a stream.

Precise uses keras and tensorflow, the only problem is, that it is out of date.

As for fitting into rhasspy, well, rhasspy is a project that is supposed to be easy to use with minimal knowledge about all the components, so everything has to fit into rhasspy at least in the way that you can select it in the web gui, configure the needed parameters and then use it.

For anything that is not as “beginner friendly” (even rhasspy has a pretty steep learning curve with all the thresholds and sensitivities to be adjusted) as rhasspy you have the custom script you can use as a KWS or the option of using voice2json and build everything yourself.

You obviously are pretty much an expert on this, and you have no problem with training a model without proper guidance, and setting up everything on your own so you can play around all you like and get decent results. Most ppl using rhasspy can’t do that, I would struggle to use the console script option rhasspy provides to even wrap a precise-listen command. I would most likely succeed sometime, but only after sinking a few days into the project and being frustrated to no end by it.

So I hope you can understand why it “needs” to fit into rhasspy for it to be usable by more ppl.

I basically use the config provided by respeaker with their, mostly working, drivers.

Default config for my mic:
# The IPC key of dmix or dsnoop plugin must be unique
# If 555555 or 666666 is used by other processes, use another one

# use samplerate to resample as speexdsp resample is bad
defaults.pcm.rate_converter "samplerate"

pcm.!default {
    type asym
    playback.pcm "playback"
    capture.pcm "ac108"
}

pcm.playback {
    type plug
    slave.pcm "hw:ALSA"
}

# pcm.dmixed {
#     type dmix
#     slave.pcm "hw:0,0"
#     ipc_key 555555 
# }

pcm.ac108 {
    type plug
    slave.pcm "hw:seeed4micvoicec"
}

# pcm.multiapps {
#     type dsnoop
#     ac108-slavepcm "hw:1,0"
#     ipc_key 666666
# }

My main problem is the fact that I don’t understand how alsa works, I just copy paste stuff that is in “guides” that try to explain how it works with just a bunch of examples.

The problem is with Rhasppy is your are thinking like Wordperfect did with Version 5 when an upstart called microsoft split Wordperfect into a suite of interoperable software that also partitioned complexity into ease of use.
A KWS is a HMI (Human Machine Interface) and should be free of system constraints or the obsolesce of the system its used with.
I am no expert and don’t train models because my knowledge believes the manner of all-in infrastructure is wrong, so I don’t waste my time.
A KWS should just be pointed to the IP of the system its a HMI for and that is it and there is no more need than that.

If that’s soon enough I will install a 4mic on a spare pi tomorrow and set up a working dsnoop asound.conf and share it with you. Are you using the default device in Rhasspy or do you choose the 4 mic directly?

It took me a long time to even half understand it. @rolyan_trauts is definitely the one most knowledgeable about alsa here I would say.

This is what I have set my rhasspy to:


I had some trouble with the default device in docker, selecting the card directly fixed that.

I am also not sure if changing the asound.conf of my host system will automatically work for rhasspy inside docker. Looking inside there is no asound.conf but that is something that i am seeing more and more often in current distributions and I have not yet found any explanation how it works without the config file.

As for the time frame, I have been without a working wakeword basically since I started playing around with rhasspy back on 2.4.19, so I am in no hurry. The only reason I am trying to get a wakeword working now is because I am tired of things not working when testing via web gui because it behaves differently and spending hours on debugging until I figure out it is working fine with a wakeword.

Alsa is a pain and every time I don’t use for a couple of month my MS wipes much of what I know.

4 mic is a weird card as its likely summing the channels is detrimental so just use as a single mic in terms of KWS and 1 channel.
Its the respeaker dross that confuses things as it adds much that is not needed.
I have a 4 mic & a linear in my bin of spare parts and forgot even if they have AGC but because they lack audio out (not the usb one) aec will not work (alsa one) (pulse aec is pretty ineffective on a pi)

I only use one channel of the mic since I have this weird issue (again) where at least one random channel produces loud noise at random. It sometimes fixes itself with a restart but mostly it doesn’t. So I turned all but the first channel down in alsamixer. I know it is not the best mic to use, especially with the driver issues it has but it was recommended for snips and I got one. I don’t try to do anything fancy with that pi, I don’t expect to be understood over music that is more than just a bit of background noise and I like the led ring so it has to work.

But I have to disagree that it is the respeaker stuff that makes it hard. I spend hours staring at non respeaker configs in vms and on my laptop without understanding much more than i do with the respeaker stuff.