Ok played a bit more and I actually dropped the compand command completely. I also played with the alsamixer settings and I actually found that I get much better results lowering the capture gain to around 30 and the alc target to 20.
Hi,
Following the discussion, even if a bit too much technical for me but really interesting.
Does lwoering capture gain will need them to speaker stronger ?
Also, in alsamixer, when on capture, why do we have some speaker settings ??
Trying to get command recognized with low/normal background music. Actually rhasspy is totally unable to understand anything with very low background sounds. Listening never ends. So I will experiment some capture settings to see if we can improve this even a bit
Do you have the asound.conf set up with dsnoop for the mic and is Rhasspy recording set to record directly from the mic or from the abstracted capture pcm defined in the asound.conf?
Because than you could run a separate record command on the commandline recording the audio either to the sd card or an attached usb stick (would recommend that to minimize sd card wear).
This way you could actually hear what Rhasspy hears when you speak.
I use a very different set up as I use voice2json and sox silence detection to record commands. So my commands are limited to 5s length even if there is no silence.
But actually hearing what your settings sound like is important to make decisions about settings.
Edit
also very important there seems to be quite a spread in sound quality of those 2 mic pi hats especially when dealing with the 3rd party clones
Kibo 1st of all have you got a cone 2 mic or official one? As if you have an official one you can put that to bed.
I have a clone but have a hunch they are all the same apart from to slight strains of the respeaker & respeaker clones and the waveshare & waveshare clones as they are slightly different.
I have four official seed pi hat.
The last one have black mic when other three have silver mics.
Problem is that with snips and same hardware I can trigger wakeword and commands with music. Absolutely no way to get a command recognised with rhasspy and even low music or background noise. This is the last thing that prevent me replacing snips by rhasspy.
I think you might still have that bug kibo has an update and fix been made yet?
Yes this has not been fixed and no new docker anyway.
But apart ending the recognition, which would still be better than listening for hours, maybe some mic settings could help to get command recognized.
Or better mic, dunno, but as music doesn’t come from same device I doubt it would help.
There is a gulf of price as the USB respeaker is much more resilient to noise and is packed with technology.
https://www.seeedstudio.com/ReSpeaker-Mic-Array-v2-0.html
I personally think they are overrated for price and not a fan but there are those who like them they are better than ‘just a mic’ which you have when it comes to 3rd party noise.
What you should do is record some test waves just use arecord and get a good full normalised wav that you can see in audacity.
https://www.audacityteam.org/
Free open source and pretty great.
Basically you want the biggest wave possible without it clipping.
There are 4 recordings there and the top 1 is just too loud but wasn’t bothered as the wifi speaker was very close and a touch too loud.
The recording below are too low.
So what you need to do is get close with what you would call loud set up your gain and ALC so that gives you a full wave with no clipping, maybe let it get away with a tad.
But also do some arecords and post here and let us have a look see.
arecord -D mydevice -r16000 -fS16_LE -c1 test.wav
aplay -l
to get device indexes
aplay -L
to get device names
use winscp to pull to windows or ubuntu or whatever you use as a desktop and have a look at in in audacity.
There is another one I have never tried https://antimatter.ai/acusis-s and might be a choice as not a fan of respeaker, personally I don’t think there is any variance in quality its just they are all a bit poor.
My anker power conf uses the same chip as the 2 above and that doesn’t really impress me that much but is much better than ‘just another microphone’
Thats a steep price for the mic you linked
What the Anker as that is just RRP from the manufacturer site Mine was £70 same approx same as respeaker.
The https://antimatter.ai/acusis-s is the only linear array I know of using the same chip as the other 2.
But that is the huge gulf between ‘just another mic’ and some technology for aec and noise supression and yeah some do, some don’t think they are worth it as Google & Amazon have similar algs in silicon at obviously much lower prices.
There is another one
Which really is a stereo usb card with software they supply to run on a Pi.
I have purchased the software as been meaning to check if I can get it to work and had forgot about that.
Here is another set of test recordings for you to look at and import into audacity @rolyan_trauts :
This is 4 recordings. Each one is me reading the first paragrapgh of the hobbit.
Near is less than 1m distance to 2 mics pi hat and far is 3.5 meters distance.
There is one with and one without background music for each.
Background music is normal tv level 4.5 meters from the mic.
This is the Respeaker 2 mics hat stock installation alsa mixer settings of capture gain 31 and alc target 20. No recording effects applied and the room is about 35 square meters big.
I really can’t know if such mic would help, no experience in this field. Also, I understand echo cancellation comes from the DSP having the music input to invert it and cancel it, which will never be the case for me (hifi amp).
Example right now ;
I’m sitting beside Triangle Antal playing normal music level, at 5 meters from snips (rpi3 + respeaker 2 mic).
I call snips, bip, ‘shutdown the music’, boum… no more music.
Actually I can’t get even near such result with rhasspy on exact same hardware. So before paying such advanced mic (which I can), with no mic experience and audio knowledge, I’m not sure the problem comes from the mic when I see what snips can achieve with this hardware, being totally usable day to day as of a living with it.
I will definitely test gain / ALC settings and record in such environment. As snips was selling dev kit with pi hat, maybe they have tweak a lot their algo for such hardware. Dunno. @fastjack also mentioned the khaldi versus vad detection.
Here are my mic.
You can see the last one behind with the black mic (all other parts exact same).
I use a custom precise model I trained myself on 100+ samples from both my girlfriend and me with and without background noise. It’s trained against about 20 hours of pieces of random noise. This gave me a pretty noise resistant responsive keyword model.
When it comes to recording the command after a wake Word was spotted as I use voice2json I have set up the command recoding differently to what Rhasspy does with vad.
I use sox with a combination of silence detection and a max record length of 5 seconds as I found that no command takes me longer than that. So it records either until silence or for max 5 seconds and than sends the recorded audio to the voice2json Kaldi stt component if silence was detected or not. (I also use sox vad on the audio After rec before stt to trim the end and the beginning for non speech).
I found this approach quite robust.
I have talked to @synesthesiam before about doing something similar for voice2json and maybe Rhasspy.
For example adding an option to send the audio even if a time out was reached instead of erroring. This would allow to effectively do the same I do with sox and configure a short timeout and if no end of speech was detected when the timeout was reached try the best stt wise with what was recorded.
Agree about the timeout. I also suggested to have a max duration settings as yes I rarely speak command more than 5 or 7 seconds. And the yes it should try to recognise this instead of error.
Talking about snips it does stop listening when I stop talking, with near same level music.
Maybe your solution should be implemented into rhasspy for everyone.
Yes that’s the Kaldi endpoint strategy. It stops not when you stop talking but when the stt thinks it understood a sensible command from the language model as the the stt happens streaming while you speak. So it doesn’t need silence to stop.
Voice2json actually does something like this already but I don’t use it for robustness reasons.
Record with arecord position yourself how you wish to control rhasspy with your music or whatever and post on here.
Do what I do and just paste it to google drive.
But as far as I am concerned with current Rhasspy setup if you are sat by a speaker playing music and you are trying to control Rhasspy you haven’t a chance because of a total lack of any audio processing.
I never used Snips but it sounds like they must of had some form of audio processing to enable that.
Kibo did you play the music through snips or just have the hifi on as if playing from rhasppy we can do something about that.
I never play audio from snips or rhasspy. Always on a separated hifi amp.
We very rarely watch tv but very often have background music. I guess that’s why this is important for us (family) and maybe less for others.
Yes I will do such recording tomorrow.
Wow ! Now I understand why snips works so well compared to rhasspy
It’s much more than that. They had a whole team working on this. They optimized every part of the pipeline probably doing audio pre processing as @rolyan_trauts proposed. Having custom language and acoustic models and so on. Sonos didn’t pay for nothing. They were a talented bunch.