Better hotword detection during music playback

Hi @all,

Since my radio streaming is working now and I’ve just accomplished to mute mopidy on hotword detection and unmute after playFinished is published I’ve got another problem.

As soon as radio is running I need to speak directly into the mic. Although the speakers are about a meter away. I don’t understand why that is. Are there any ways to get this better?

I’m using a Respeaker USB Mic Array.

Any hints would be appreciated since I’m a total noob on mic stuff.

EDIT: I’ve played around with sensitivity values of jarvis snowboy model and it’s getting better but still annoying in some way.

You need to play the music/radio stream through the Respeaker 3.5mm jack audio output to get the embedded AEC alto to kick in during playback and remove it from the audio capture.

Without AEC, it won’t work. Even by increasing the hotword sensibility.

Check out these recent topics about AEC for more detail:


Hope this helps.

Also check that you are capturing from the dedicated ASR input channel of the Respeaker (with noise suppression, AEC and beam forming) and not from the raw mic input channels.

The 1 channel firmware only provides this special channel as input.

The 6 channels provides this special channel and 4 raw mic channels. The last one is the playback loopback and is not used by Rhasspy.

For simplicity sake, either flash the 1 channel firmware or configure your ALSA devices to capture from this specific channel.

Hope this helps.

I haven’t got a USB 4 mic but on that mic AEC is built in. Also think it has a loopback channel that you can add to the ‘echo’ channel to be removed.

The software EC is for those without hardware EC but your mic should have that.
I think on the standard settings its just channel 0 that has the processing the other channels are just raw mic.
So unless you pull channel 0 only you could be mixing in the echo again.

You really need to have a look at http://wiki.seeedstudio.com/ReSpeaker-USB-Mic-Array/
As without a USB 4 mic I don’t know but @fastjack does have one I think.
Far too many settings for someone who is blind of that particular mic.
There really is a lot going on with that mic.

pi@raspberrypi:~/usb_4_mic_array $ python tuning.py -p
name            type    max min r/w info
-------------------------------
AECFREEZEONOFF      int 1   0   rw  Adaptive Echo Canceler updates inhibit.
                                                            0 = Adaptation enabled
                                                            1 = Freeze adaptation, filter only
AECNORM             float   16  0.25    rw  Limit on norm of AEC filter coefficients
AECPATHCHANGE       int 1   0   ro  AEC Path Change Detection.
                                                            0 = false (no path change detected)
                                                            1 = true (path change detected)
AECSILENCELEVEL     float   1   1e-09   rw  Threshold for signal detection in AEC [-inf .. 0] dBov (Default: -80dBov = 10log10(1x10-8))
AECSILENCEMODE      int 1   0   ro  AEC far-end silence detection status. 
                                                            0 = false (signal detected) 
                                                            1 = true (silence detected)
AGCDESIREDLEVEL     float   0.99    1e-08   rw  Target power level of the output signal. 
                                                            [−inf .. 0] dBov (default: −23dBov = 10log10(0.005))
AGCGAIN             float   1000    1   rw  Current AGC gain factor. 
                                                            [0 .. 60] dB (default: 0.0dB = 20log10(1.0))
AGCMAXGAIN          float   1000    1   rw  Maximum AGC gain factor. 
                                                            [0 .. 60] dB (default 30dB = 20log10(31.6))
AGCONOFF            int 1   0   rw  Automatic Gain Control. 
                                                            0 = OFF 
                                                            1 = ON
AGCTIME             float   1   0.1 rw  Ramps-up / down time-constant in seconds.
CNIONOFF            int 1   0   rw  Comfort Noise Insertion.
                                                            0 = OFF
                                                            1 = ON
DOAANGLE            int 359 0   ro  DOA angle. Current value. Orientation depends on build configuration.
ECHOONOFF           int 1   0   rw  Echo suppression.
                                                            0 = OFF
                                                            1 = ON
FREEZEONOFF         int 1   0   rw  Adaptive beamformer updates.
                                                            0 = Adaptation enabled
                                                            1 = Freeze adaptation, filter only
FSBPATHCHANGE       int 1   0   ro  FSB Path Change Detection.
                                                            0 = false (no path change detected)
                                                            1 = true (path change detected)
FSBUPDATED          int 1   0   ro  FSB Update Decision.
                                                            0 = false (FSB was not updated)
                                                            1 = true (FSB was updated)
GAMMAVAD_SR         float   1000    0   rw  Set the threshold for voice activity detection.
                                                            [−inf .. 60] dB (default: 3.5dB 20log10(1.5))
GAMMA_E             float   3   0   rw  Over-subtraction factor of echo (direct and early components). min .. max attenuation
GAMMA_ENL           float   5   0   rw  Over-subtraction factor of non-linear echo. min .. max attenuation
GAMMA_ETAIL         float   3   0   rw  Over-subtraction factor of echo (tail components). min .. max attenuation
GAMMA_NN            float   3   0   rw  Over-subtraction factor of non- stationary noise. min .. max attenuation
GAMMA_NN_SR         float   3   0   rw  Over-subtraction factor of non-stationary noise for ASR. 
                                                            [0.0 .. 3.0] (default: 1.1)
GAMMA_NS            float   3   0   rw  Over-subtraction factor of stationary noise. min .. max attenuation
GAMMA_NS_SR         float   3   0   rw  Over-subtraction factor of stationary noise for ASR. 
                                                            [0.0 .. 3.0] (default: 1.0)
HPFONOFF            int 3   0   rw  High-pass Filter on microphone signals.
                                                            0 = OFF
                                                            1 = ON - 70 Hz cut-off
                                                            2 = ON - 125 Hz cut-off
                                                            3 = ON - 180 Hz cut-off
MIN_NN              float   1   0   rw  Gain-floor for non-stationary noise suppression.
                                                            [−inf .. 0] dB (default: −10dB = 20log10(0.3))
MIN_NN_SR           float   1   0   rw  Gain-floor for non-stationary noise suppression for ASR.
                                                            [−inf .. 0] dB (default: −10dB = 20log10(0.3))
MIN_NS              float   1   0   rw  Gain-floor for stationary noise suppression.
                                                            [−inf .. 0] dB (default: −16dB = 20log10(0.15))
MIN_NS_SR           float   1   0   rw  Gain-floor for stationary noise suppression for ASR.
                                                            [−inf .. 0] dB (default: −16dB = 20log10(0.15))
NLAEC_MODE          int 2   0   rw  Non-Linear AEC training mode.
                                                            0 = OFF
                                                            1 = ON - phase 1
                                                            2 = ON - phase 2
NLATTENONOFF        int 1   0   rw  Non-Linear echo attenuation.
                                                            0 = OFF
                                                            1 = ON
NONSTATNOISEONOFF   int 1   0   rw  Non-stationary noise suppression.
                                                            0 = OFF
                                                            1 = ON
NONSTATNOISEONOFF_SR    int 1   0   rw  Non-stationary noise suppression for ASR.
                                                            0 = OFF
                                                            1 = ON
RT60                float   0.9 0.25    ro  Current RT60 estimate in seconds
RT60ONOFF           int 1   0   rw  RT60 Estimation for AES. 0 = OFF 1 = ON
SPEECHDETECTED      int 1   0   ro  Speech detection status.
                                                            0 = false (no speech detected)
                                                            1 = true (speech detected)
STATNOISEONOFF      int 1   0   rw  Stationary noise suppression.
                                                            0 = OFF
                                                            1 = ON
STATNOISEONOFF_SR   int 1   0   rw  Stationary noise suppression for ASR.
                                                            0 = OFF
                                                            1 = ON
TRANSIENTONOFF      int 1   0   rw  Transient echo suppression.
                                                            0 = OFF
                                                            1 = ON
VOICEACTIVITY       int 1   0   ro  VAD voice activity status.
                                                            0 = false (no voice activity)
                                                            1 = true (voice activity)

http://wiki.seeedstudio.com/ReSpeaker-USB-Mic-Array/

Is prob going to be essential for you^

Quite happy though as it gives some example wavs and to be honest its not much better if any than software but if you have hardware you shouldn’t need software.

This always confuses me as these things like the pulseaudio webrtc-aec plugin have vad but it doesn’t tell you where you can access that status?!
If you have AEC running then echo should be attenuated and VAD should only pick up on spoken voice that isn’t echo as media voice can be a real problem.

If you can work out how to turn on EC only use channel 0 for recording and access the VAD status you need to write something to mute media on VAD and then your barging in like a good one :slight_smile:

Quite frankly, I did not have to change any of those settings at all. The defaults are pretty good already.

I’ve been able to get the wake word to trigger (Snips) from more than 5 meters away during music playback.

Note that the mic orientation, speakers position and casing structure play a huge part in getting good results.

1 Like

@fastjack I think something must be wrong with setup as yeah they should just work but don’t you have to create an asound.conf and just pull channel 0?

I did find the VAD on the wiki

from tuning import Tuning
import usb.core
import usb.util
import time

dev = usb.core.find(idVendor=0x2886, idProduct=0x0018)
#print dev
if dev:
    Mic_tuning = Tuning(dev)
    print Mic_tuning.is_voice()
    while True:
        try:
            print Mic_tuning.is_voice()
            time.sleep(1)
        except KeyboardInterrupt:
            break

With the 1 channel firmware (which is the factory default I think… not really sure though) you only get one input channel (which is the processed one with NS, beam forming and AEC). So all good.

If the Respeaker installed firmware is the 6 channels one, you have to setup asound.conf to create a PCM that only forward the channel 0 (processed input signal).

Or flash the 1 channel firmware…

Yeah 1 channel is 1 channel :slight_smile: Doh but hey I don’t own one

1_channel_firmware.bin should be ready to go 6_channels_firmware.bin pull from chanel 0 only.

I guess its

git clone https://github.com/respeaker/usb_4_mic_array.git
cd usb_4_mic_array
python tuning.py -p

And see what is turned on

Didn’t know that you guys are so fast in answering. Thanks a lot.

Awesome community around here.

That’s the ouput:

Output
pi@raspberrypi:~/Documents/respeakerStuff/usb_4_mic_array $ python tuning.py -p
name			type	max	min	r/w	info
-------------------------------
AECFREEZEONOFF  	int	1	0	rw	Adaptive Echo Canceler updates inhibit.
                                                            0 = Adaptation enabled
                                                            1 = Freeze adaptation, filter only
AECNORM         	float	16	0.25	rw	Limit on norm of AEC filter coefficients
AECPATHCHANGE   	int	1	0	ro	AEC Path Change Detection.
                                                            0 = false (no path change detected)
                                                            1 = true (path change detected)
AECSILENCELEVEL 	float	1	1e-09	rw	Threshold for signal detection in AEC [-inf .. 0] dBov (Default: -80dBov = 10log10(1x10-8))
AECSILENCEMODE  	int	1	0	ro	AEC far-end silence detection status.
                                                            0 = false (signal detected)
                                                            1 = true (silence detected)
AGCDESIREDLEVEL 	float	0.99	1e-08	rw	Target power level of the output signal.
                                                            [−inf .. 0] dBov (default: −23dBov = 10log10(0.005))
AGCGAIN         	float	1000	1	rw	Current AGC gain factor.
                                                            [0 .. 60] dB (default: 0.0dB = 20log10(1.0))
AGCMAXGAIN      	float	1000	1	rw	Maximum AGC gain factor.
                                                            [0 .. 60] dB (default 30dB = 20log10(31.6))
AGCONOFF        	int	1	0	rw	Automatic Gain Control.
                                                            0 = OFF
                                                            1 = ON
AGCTIME         	float	1	0.1	rw	Ramps-up / down time-constant in seconds.
CNIONOFF        	int	1	0	rw	Comfort Noise Insertion.
                                                            0 = OFF
                                                            1 = ON
DOAANGLE        	int	359	0	ro	DOA angle. Current value. Orientation depends on build configuration.
ECHOONOFF       	int	1	0	rw	Echo suppression.
                                                            0 = OFF
                                                            1 = ON
FREEZEONOFF     	int	1	0	rw	Adaptive beamformer updates.
                                                            0 = Adaptation enabled
                                                            1 = Freeze adaptation, filter only
FSBPATHCHANGE   	int	1	0	ro	FSB Path Change Detection.
                                                            0 = false (no path change detected)
                                                            1 = true (path change detected)
FSBUPDATED      	int	1	0	ro	FSB Update Decision.
                                                            0 = false (FSB was not updated)
                                                            1 = true (FSB was updated)
GAMMAVAD_SR     	float	1000	0	rw	Set the threshold for voice activity detection.
                                                            [−inf .. 60] dB (default: 3.5dB 20log10(1.5))
GAMMA_E         	float	3	0	rw	Over-subtraction factor of echo (direct and early components). min .. max attenuation
GAMMA_ENL       	float	5	0	rw	Over-subtraction factor of non-linear echo. min .. max attenuation
GAMMA_ETAIL     	float	3	0	rw	Over-subtraction factor of echo (tail components). min .. max attenuation
GAMMA_NN        	float	3	0	rw	Over-subtraction factor of non- stationary noise. min .. max attenuation
GAMMA_NN_SR     	float	3	0	rw	Over-subtraction factor of non-stationary noise for ASR.
                                                            [0.0 .. 3.0] (default: 1.1)
GAMMA_NS        	float	3	0	rw	Over-subtraction factor of stationary noise. min .. max attenuation
GAMMA_NS_SR     	float	3	0	rw	Over-subtraction factor of stationary noise for ASR.
                                                            [0.0 .. 3.0] (default: 1.0)
HPFONOFF        	int	3	0	rw	High-pass Filter on microphone signals.
                                                            0 = OFF
                                                            1 = ON - 70 Hz cut-off
                                                            2 = ON - 125 Hz cut-off
                                                            3 = ON - 180 Hz cut-off
MIN_NN          	float	1	0	rw	Gain-floor for non-stationary noise suppression.
                                                            [−inf .. 0] dB (default: −10dB = 20log10(0.3))
MIN_NN_SR       	float	1	0	rw	Gain-floor for non-stationary noise suppression for ASR.
                                                            [−inf .. 0] dB (default: −10dB = 20log10(0.3))
MIN_NS          	float	1	0	rw	Gain-floor for stationary noise suppression.
                                                            [−inf .. 0] dB (default: −16dB = 20log10(0.15))
MIN_NS_SR       	float	1	0	rw	Gain-floor for stationary noise suppression for ASR.
                                                            [−inf .. 0] dB (default: −16dB = 20log10(0.15))
NLAEC_MODE      	int	2	0	rw	Non-Linear AEC training mode.
                                                            0 = OFF
                                                            1 = ON - phase 1
                                                            2 = ON - phase 2
NLATTENONOFF    	int	1	0	rw	Non-Linear echo attenuation.
                                                            0 = OFF
                                                            1 = ON
NONSTATNOISEONOFF	int	1	0	rw	Non-stationary noise suppression.
                                                            0 = OFF
                                                            1 = ON
NONSTATNOISEONOFF_SR	int	1	0	rw	Non-stationary noise suppression for ASR.
                                                            0 = OFF
                                                            1 = ON
RT60            	float	0.9	0.25	ro	Current RT60 estimate in seconds
RT60ONOFF       	int	1	0	rw	RT60 Estimation for AES. 0 = OFF 1 = ON
SPEECHDETECTED  	int	1	0	ro	Speech detection status.
                                                            0 = false (no speech detected)
                                                            1 = true (speech detected)
STATNOISEONOFF  	int	1	0	rw	Stationary noise suppression.
                                                            0 = OFF
                                                            1 = ON
STATNOISEONOFF_SR	int	1	0	rw	Stationary noise suppression for ASR.
                                                            0 = OFF
                                                            1 = ON
TRANSIENTONOFF  	int	1	0	rw	Transient echo suppression.
                                                            0 = OFF
                                                            1 = ON
VOICEACTIVITY   	int	1	0	ro	VAD voice activity status.
                                                            0 = false (no voice activity)
                                                            1 = true (voice activity)

What do I need to do now?

1 Like

Well that is really weird as the output doesn’t show current settings.

Presuming sudo python tuning.py ECHOONOFF 1
Also dunno how you know if you have the 1 channel or 6 channel firmware running.
But guess if you stick it in a usb run audacity record 6 channels if there is audio in any of them other than 1 you have the 6 channel firmware.
Then you have to flash the mic with the firmware you want

pip install pyusb
python dfu.py --download new_firmware.bin 

From https://github.com/respeaker/usb_4_mic_array

Also record a test wav arecord -Dplughw:1 -f S16_LE -r16000 rec.wav
wget https://file-examples.com/wp-content/uploads/2017/11/file_example_WAV_10MG.wav
When you are playing something aplay -Dplughw:1 file_example_WAV_10MG.wav

Then play back what you got aplay -Dplughw:1 rec.wav

presuming its card 1 but aplay -l or aplay -L for pcms

You can also clone the channel audio to output on 2 devices simultaneously.

It’s what i did with my mmbox and snips. The output audio is cloned on Respeaker 3.5 (even if nothing is plugged) and on HDMI output.

So, i can use the respeaker AEC.

I did a french blog article on that : https://www.coxprod.org/domotique/aec-et-reduction-de-bruit-avec-le-respeaker-mic-array-v2-0/

Ced

2 Likes

I was not able to play audio on both the 3.5mm and Respeaker audio devices due to clock drifting.

The Raspberry PI HDMI audio output must handle the clock drift better than the 3.5mm ouput then…

Did not though of using the HDMI to output playback :slight_smile:

This is very interesting!

How does it behave over time?
Any over/under runs during a long playback?

I’m not as exprt as you :slight_smile:

I did some tests and they were corrects.

After, i saved all snips audio request in wav files for few day with AEC and others few days without.

I noticed a better comprehension with AEC but less low volume.

If you give me some tests to do for get scientific value, i’ll happy to do them for you :slight_smile:

Ced

I was the same with SpeexDsp with clock drift but I used a very bad source for the Mic as the clock drift on a USB mic and Pi 3.5mm output completely killed AEC even webrtc AEC that is supposed to have drift compensation.

After successful AEC with Speex and a Respeaker 2 mic where clock drift is not an issue I was extremely interested in trying Webrtc even if it meant installing pulse audio.
Pulseaudio and the Respeakers don’t seem to like each other and if I have to be honest I like the hardware of the 2mic but think the drivers stink.

The Pi 3.5mm is a bit of a stinker but I am thinking if you get 2x I2S microphones such as.

You may well be able to use in conjunction with the Pi3.5mm or even a DAC on the other I2S as thinking they all share the Pi clock.
Its a shame I2S mics don’t seem to work on the Pi4?! But do on 3/zero or so Adafruit say.

I will come back to you on those results :slight_smile:

Also webrtc audio processing is in pypi and someone has done a webrtcvad off it but it would be so amazing if someone could do a similar job to the voiceen/ec project and create a similar webrtc alsa fifo ec with all the bells and whistles of the audio processing lib.
webrtc is supposed to have a superior AEC alg than the speex one.