Mics and audio processing algs

rolyan_trauts · January 5, 2023, 11:48pm

I created this as things can get a bit longwinded but there is a huge problem with Raspberry stock and prob the best solution for a KWS/Satelite Pi02W likely will not be seen until 2024 and due to the Pi0W bump to $15 its not going to be that price any more.

We do some strange things with what we have as the 2 mic hat is planar and setup without beamforming in a conference style mic whilst really it should be vertical as the forward facing ports and pcb will give some rear rejection and increase beamforming effect.
I have done a realtime beamformer that will run on a Pi3 ProjectEars/ds at main · StuartIanNaylor/ProjectEars · GitHub

I am pretty sure I can extend that with no load to work on a 4 mic board but the geometry of respeaker boards is just totally wrong for delay sum, but worse than that the mic order seems to be random due to TDM mode so just haven’t bothered.
The single I2S port on Pi’s is a bit of an Achilles ankle when it comes to sound and that respeaker make things they can sell irrespective if they do it well or function as you might expect.

You can get a pixel ring anywhere and also you may have serveral ears in distributed room microphones but a single indicator to stop things looking like a laser disco, but as a mic board really only the 2mic is of any worth with the algs we currently have.

We also don’t seem to have any method for choosing the best stream from distributed room microphones and due to Pi stock problems we don’t really have a cost effective way to do this and why I am gravitating to ESP32-S3 as available and at a price level where simple physics of distributed mics where one is always near and far from noise can rival and beat the latest and greatest silicon tech.

Also the AEC we have is not cancelation but attenuation and the often used plastic cases we use with all-in-one speakers create a soundbox with absolutely huge SNR that even the best of class AEC would struggle with, its so much easier just to seperate as wireless audio and base unit with seperate distributed ears.

I did a dataset builder which is merely a cli kiosk to capture spoken words for a KWS where KW and words from phonetic pangrams as displayed on screen to be automatically recorded.
It makes an absolutely amazing KWS but only for the voice that does the recording GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws

But the same could be done for VAD where likely the effect will be the same and less so sensitive when another voice where a dynamic central KWS could be used for KW authentication to kickstart subsequent command processing. You can capture audio of use and do ondevice training as smaller weight models or n-grams can bias bigger models to give the results you want. This is just an example of how varied choice of infrastructure could be.

Really though the 2 mic hat and the Plugable USB Audio Adapter – Plugable Technologies and if a ESP32-S3 product can be made (you can use esp32 but even a basic CNN is a tight squeeze the vector S3 instructions can give up to a x10 ML boost).

As you can create a directionally locked beam former for the current command sentence which is quite an important difference to many conference mics and the way a voice-assistant mic array should likely work.
So with 2 mics with no rear rejection the boards and mics should really not be planar.

Then we have these plastic cases that is akin to a singer placing the mic inside a guitar body of the one they play and why ‘barge-in’ is often so ineffective.
Much of what people are sourcing and suggesting from hardware to protocol IMO doesn’t make much sense and my opinion is that we likely could get better results if we did things in different ways and just shy away from areas with known problems which actually are many.