Importance of microphone array (wake -> asr)

Hi,

While reading an article


I saw a user commenting and saying:

The array mics are a critical piece of cloning a Home/Echo. They work to remove noise and make speech recognition possible. Without the array mic voice commands work will only work for a couple of feet away. You can make the noise removal work with only two mics but more is better.

The way this works is by finding the hot word in the mic input. The algorithm needs to locate the hot work in both mic input streams. Once it locates both copies of the hot word it computes a phase delay between them. That phase delay is used to remove all other audio that does not have the same phase delay. For example something making noise (TV) at another location in the room will have a different phase delay and it will get removed from the input data. With more mics you can make this process better. With one mic you can’t do it at all.

This seems better than just echo cancellation as this seems to be able to remove other sounds that originate from sources that are not the unit itself.

Is this correct? Is this possible and has this been actually done by someone?

It seems this will require some communication between the wake word detection module and the ASR module. Possibly quite a tight integration of these parts with the mic array and the data of individual microphones.

What do you think? How could this be implemented in Rhasspy?

There are some closed source beamforming algs but nearest we have are ODAS & PyRoomAcoustics.

There was always this lib but as far as I know no one have ever got it up and running.

Without the DSP algs an array is pointless in fact summed arrays just create 1st order high pass filters when side on.
The standard operating system of a Pi doesn’t have the clock speed and constant sync that some pretty low end DSP microcontrollers running a rtos.

Bespoke silicon such as Xmos https://www.xmos.ai/file/xvf3500-dsp-databrief/ provides pretty good results but there is no comparison between an application processor such as a Pi and dedicated DSP processor such as the Xmos that turns up in a few beamforming microphones currently on sale.

Beamforming alone only works with a threshold of noise and then begins to fail and state of the art is to combine beamforming with blind source separation and also AI noise reduction such as Nvidia voice.

Array mics are not the critical piece of hardware its the DSP algs and the DSP hardware to run them and there have been a range of array mics for several years on the Pi and no opensource software so that might give you the answer.

I keep meaning to have a go at compiling that BTK2.0 but its methods are quite old and still fail on noise thresholds.
Probably the easiest and cheapest way is to have distributed mics so that voice=near and noise=far on one.

There is a lot of work with AI denoise such as

Haven’t checked it out as its pytorch and pytorch audio seems vendor locked to either intel or nvidia math libs.

This can supposedly run on a PI

Roylan, thanks for providing such great resources. Yes, the algorithms is what is important, the array is the simpler component in the system.

I guess the point I was trying to convey is that wake word detection has to be array-aware to get us to “state of the art” performance. Currently, if using porcupine it just works with a single mic. It has no capability to listen to several streams, detect the wake word and then sync the streams in a specific direction.

To get to the next level, wake word detection need to also output phase differences or the direction from which the trigger is coming. This can then be used to generate better and cleaner input for ASR.

Yeah but I have hunch the porcupine is a clean model maybe expecting a clean input that maybe has need of preprocessing as you say.
I have been playing with the Google-KWS state-of-art NN models and actually you can train a KW to be resilient to noise if you can train the noise in.
I am sort of getting sick of watching my paltry 1050ti plod through a training but I have had some really good results with a single mic.
Also reverting back from state-of-art array mics to old unidirectional electrets when singular can also harness the natural ‘beamforming’ they have.
For most running a single mic they prob have the wrong type in a omnidirectional that will pick up sound (noise) equally from all directions.

Raven doesn’t cope with noise as it just captures a few samples of your voice and goes and should be great as a quick start for capturing a dataset.
Someone somewhere really needs to spruce up the logic of capture as the ASR ‘command’ voice is not part of the capture that leaves you with maybe adding the universal ‘Google command set’ for not voice.

Raven and returned ASR make state-of-art custom datasets because the dataset is exactly how you use voice commands.
With that type of dataset you could make an extremely resilient KWS with something you can train such as Precise.
As I say I have been playing with

As the streaming CRNN tensorflow-lite model it blows away Precise with accuracy & load and have been getting to grips with model creation and parameters.

I did expect to see by now some Arm boards with embedded DSP and sort of wondering if Covid has derailed much as the DSP is not expensive or the big guys in the game have control over the market.
Rockchip & Allwinner did threaten some SoCs but haven’t seen anything avail in a form we can use.

Raven is a special case for quick KW & dataset capture if anyone does make that more robust but a major part of the problem is the KWS we and models they employ are far from state-of-art or self learning as come with a static model.

State-of-the-art is returning to single mic so manufactures can make next gen phones without need of multiple mics and sound access holes.
State-of-the-art room pickup is multiple beamforming/bss distributed mics as far pickup in excess noise fields is just pointless to try and find that meager ripple in air.
Also its looking at what noise is which in a domestic situation its mainly media and the solution to that is merely volume ducking and AEC.

The above KWS tensorflow-lite model can do 20ms streaming and provide an extremely accurate envelope and run in less than 20% single core load of a pi3A+
So yeah you can run 2x KWS on a Pi3 and more just not with the KWS you have and Porcupine is an OK KWS but it gives zero metric out other than a KW has been hit so relatively useless.

State-of-the-art with KWS is https://github.com/google-research/google-research/tree/master/kws_streaming which my repo is just an adaption of that to try and make use a little easier.

State-of-the-art is likely to be a single brain that is fed from multiple distributed mics/arrays with the best signal picked and likely further preprocessing, but prob a strange focus when the KWS we have are so far away from state of the art and even the ASR as state-of-the art ASR can understand through noise.

One of the latest speech toolkits has tools for beamforming but is intel/nvidia based so likely jetson or above but not for the likes where Rhass is in the name as for voice AI raspberry is really the wrong platform.

Ps with the above kws as just tested it I can have Skill4U wittering on a Anker Powerconf where the dB @ the mic is 75dB and I can still trigger my KW accurately.
When he continues the KWS still recognises that as silence as it isn’t me speaking.

Its a single desk mic Boya unidirectional shotgun mic plugged into a $2 usb soundcard.

https://drive.google.com/file/d/1BtccaTw4R50DSBaXGZ0n2hSAAjJfW3dL/view?usp=sharing

Feed the above into the Google-kws and with the right training and dataset it will recognise the kw ‘Raspberry’

With unidirectional the best results are if the sound is ‘behind’ and you are in front which can often be a common placement to the likes tv or hifi.
The above is prob the worst scenario as the powwerconf speaker is just to my side and equally ‘infront’ as I so I had to raise my voice slightly but still recognised each time.