Show-Stopper Wakeword

schnopsi · January 15, 2021, 9:05am

I did a lot of testing with different hardware and wakewords. All results are too unsatisfying to use Rhasspy in “production”. Although it seems, some of you are able to use it.
What’s irritating at most is the difference between the old Snips-hotword and Rhasspy on the same hardware. When using Snips on a Raspberry Zero and a Respeaker 2Mic-Hat I can call the wakeword from ~3m away. When using Rhasspy with exactly the same hardware and audio configuration (disabled Snips, enabled Rhasspy), I have to sit in front of the mic and speak into the direction of the microphones.
Doesn’t matter, if Rhasspy is installed via Docker, virtual environment or the debian package.

For testing purposes I use the Docker image on my Windows PC and my Logitech webcam as microphone. Here I can call the wakeword from the other side of my apartment and everything is working fine. The same webcam connected to the Raspberry is exactly as bad as using the Respeaker Hat.

Tried Raspberry Zero, 2B, 4 with the Respeaker 2Mics-Hat, a Respeaker 6Mics Hat, a Respeaker USB Mic Array, a Logitech C930e webcam and a PSEye. Used Raven and Snowboy as wakeword while testing. Nothing is working as expected.

Why this difference to the Snips-wakeword?

rolyan_trauts · January 15, 2021, 10:53am

I have always been critical and slightly bemused to why this has not been addressed as Raven or Snowboy are not exactly the best KW systems.
Likely Porcupine & Precise are better and that Porcupine is the best if you can not train Precise.

I never did try Snips but you don’t seem to be alone in your findings but without the code or previous Snips experience I am blind.
If you pick a platform and I am not a fan of trying to use a Zero due to load, your mic and KWS we could try and see what audio rhasspy is receiving by simply recording some samples.

I have always presumed Snips normalised audio input but how some say it worked with noise is a mystery to me.

Also maybe give a thumbs up here as better KW for porcupine in the current situation will just help.

I have a hunch that is you run noise reduction on your dataset say rnnoise then you can use rnnoise to help also and much has been about bad models and implementation, poor mic config, lack of processing but for some reason a complete lack of community focus to the problem and generally ignoring it?!? I dunno why.

schnopsi · January 15, 2021, 12:17pm

Left my thumb up there.
But had to realize, that using Porcupine doesn’t make a big difference. It’s a little bit better, but still not usable.

Currently I’m testing on a PI 2b with the 2Mic Respeaker Hat. But I can set up whatever you prefer.

romkabouter · January 15, 2021, 1:20pm

Maybe the recording level is just lower?

schnopsi · January 15, 2021, 1:39pm

Well, my asound.conf is the same with Rhasspy and Snips

rolyan_trauts · January 15, 2021, 1:41pm

That will do can you go onto the container and arecord -Dmymic -r16000 -fS16_LE rec_test.wav and post to a drive somewhere.
Also amixer -c1 contents
“c1” is the card index of ‘mymic’

You will have to do a aplay -l to list and you may have to stop the rhasspy service to get access to the card.

Humor me as lets have a listen

PS do you have the code for snips or an image?

Also alsamixer in the container

docker exec -it rhasspy /bin/bash

synesthesiam · January 15, 2021, 2:51pm

With help from audio experts like @rolyan_trauts and others, I’m hoping that we can get to this level of performance for wake words (and for speech recognition with music playing).

I don’t know the reason for the difference; I’m not audio expert, so I’m probably being naive by just recording audio at 16Khz, 16-bit mono in fixed-sized buffers and passing that right into the wake word system. Maybe other systems record differently or do some kind of pre-processing?

From my basic understanding, the variables I see are:

Recording settings (device, sample rate, channels, etc.)
Size of recording buffer (could make a big difference if wake word is often split between buffers)
Pre-processing of audio before entering wake word system

Any guesses?

rolyan_trauts · January 15, 2021, 3:17pm

Don’t you out me as a pert as in the modern world we all know experts are bad!

It could well be docker syndrome and that what is happening in the container isn’t what people are thinking on the host.

I have a pet gripe with the ability to cope with noise and that yeah maybe if compiled models on processed datasets and fed processed audio we might get much better results than clean audio datasets and trying to feed clean audio.

Also my gripe is that we haven’t created easy tools to create models apart from some fiddly scripts.

Then the myth that a mic array alone is some way better when actually it can create a whole load of audio filters depending on how they are summed and the distance apart.
Arrays with DSP algs work fine as its the algs that correct and beamform but without often a single mic is likely better.

synesthesiam · January 15, 2021, 3:28pm

Anyone can be an “expert” now with a little “research” on YouTube, right?

The snowboy wake word creator I posted yesterday is a step towards that. My plan is to repurpose the front-end for a GPU-based training system. Now that I’ve successfully got nvidia-docker to work, it just might be possible to have a single image that can record examples and re-train using CUDA.

schnopsi · January 15, 2021, 3:30pm

Here are the samples: https://cloud.drhirn.com/index.php/s/wAbGeH5TpDcQc3D
Recorded with arecord --device "hw:2,0" -r16000 -fS16_LE sarah1.wav -c 2 while sitting exactly the same way as if talking to rhasspy.

Output of amixer contents is also there.

Btw., I’m actually testing with the debian package, not Docker.

And no, I’m sorry, no Snips code here. But I could provide an sd-card image.

rolyan_trauts · January 15, 2021, 3:41pm

With KWS I don’t actually think CUDA is a need as yeah its a lot faster but Keras CPU models are not hours to compile.

You are right about Youtube as thank god now I can spot Illuminati @ 20 paces.

The alsa state directories /var/lib/alsa/ maybe should be shared from the host as part of docker run?

If you could take an image then please do as really interested to take a look.
Try setting up your ALC with you soundcard as from audacity your levels are really low.

@JGKK you need a fan of the 2 mic and maybe they will help

amixer -c “seeed2micvoicec” cset numid=1 34,34
amixer -c “seeed2micvoicec” cset numid=26 3
amixer -c “seeed2micvoicec” cset numid=27 7
amixer -c “seeed2micvoicec” cset numid=29 0
amixer -c “seeed2micvoicec” cset numid=30 5
amixer -c “seeed2micvoicec” cset numid=32 10
amixer -c “seeed2micvoicec” cset numid=33 15
amixer -c “seeed2micvoicec” cset numid=34 30
amixer -c “seeed2micvoicec” cset numid=35 on
alsactl store

Might fix your woes but see what @JGKK says for volumes & alc

Audacity is a really good tool as if you select your wav without the start click and then go effects ‘amplify’ before you do it tells you what gain is needed to get to 0db and currently 18db which is pretty huge!

schnopsi · January 15, 2021, 4:06pm

rolyan_trauts:

amixer -c “seeed2micvoicec” cset numid=1 34,34
amixer -c “seeed2micvoicec” cset numid=26 3
amixer -c “seeed2micvoicec” cset numid=27 7
amixer -c “seeed2micvoicec” cset numid=29 0
amixer -c “seeed2micvoicec” cset numid=30 5
amixer -c “seeed2micvoicec” cset numid=32 10
amixer -c “seeed2micvoicec” cset numid=33 15
amixer -c “seeed2micvoicec” cset numid=34 30
amixer -c “seeed2micvoicec” cset numid=35 on

I read this post some days ago.
Tried those settings again. Doesn’t make any difference. Still around 18 to 22db referring to Audacity.

rolyan_trauts · January 15, 2021, 4:09pm

Not sure mate but for some reason your mics are extremely quiet.

Alsamixer select your card press F5 up capture to the max and see what you get.

I am not infront of a working Pi with a 2 mic at the moment.

schnopsi · January 15, 2021, 4:17pm

I’m sorry to say, but everything’s turned to the max and there is … ahhmm … no difference

rolyan_trauts · January 15, 2021, 4:19pm

I am not seeing the all important capture volume

https://community.rhasspy.org/uploads/default/original/2X/b/b60eff1d2f8e0b4b7bde93e9b65a7602ce49a267.png

schnopsi · January 15, 2021, 4:27pm

Because there aren’t any!? This is the rest of the settings.

Here is the image of a working Snips satellite: https://cloud.drhirn.com/index.php/s/5sG4oTxoYnpGwow

Oh oh, my bad. Forget to change view to all. Give me a sec.

rolyan_trauts · January 15, 2021, 4:37pm

F4 doh apols for capture

schnopsi · January 15, 2021, 4:48pm

After turning up capture to max there’s only a 0.5dB step to take. And the wakeword is working great.

Oh my, this is embarrasing…
Hours and hours of testing just because I’m too stupid to press F5. Sometimes I hate myself

Thanks for pointing me in the right direction!

rolyan_trauts · January 15, 2021, 4:50pm

Its Ok yesterday I spent ages trying to get into the car with my front door key.
It only occured to me why the neighbor was so angry when I remembered I don’t have a car.