Question: would anyone be interested in a open source DSP mic array?

We the Box Lite doesn’t have the AEC feedback and is only using a dual channel ADC, a ES7243E chip that’s around $0.40 (Note it’s a ADC not a DSP)

What mics would you want? MEMS or the capsule mics used in toys and such? MEMS are useful as they are pick and place, so assembly is cheaper.

For software at the moment, not something I’m currently getting into, I’m hoping to get that and feedback from those currently developing the systems

An ARM CPU with an entire os on a SD card seems overkill for a wireless mic, especially when the end user will never perform a safe shut down

I would expect such a basic device to be sub £10 for a wireless mic module. I would still make sure GPIO was exposed, as there is a Lot of demand for that to be exposed from the users I have talked to. And for the cost of PCB space for an unpopulated header, why not.
Lots of people want tonnes of sensors, so this is a good compromise, for near free, if the MCU has enough capacity to compile them in at a later date, and up the the end user to design whatever addons they may desire (at least PMOD is a nice enough standard to follow)

As for the use of a FPC for the Mics, that will add a lot of unnecessary cost to production. If we alwant a device to be cheap, we need low part count, pick and place compatible and single sided PCBA

If memory is the main limit, then a Tesla K80 can be grabbed with 24Gb ddr5 for under £70
I.e. Nvidia Tesla K80 24GB GPU GDDR5 PCI-E GPU Accelerator 12 Month warranty | eBay

Mems or Electret as they are called doesn’t matter.

Do it but if all your work proves fruitless then at least others did say.

That we can already purchase in ready-made hardware likely for less than you have estimated.
Personally I prefer the RadxaZero3 as it does beat a Pi4 for some ML.
A bit more at OKdo ROCK ZERO 3W 1GB with Wi-Fi/BLE without GPIO - OKdo but it has the Ooomph to cope with Python like implementations.
Then use a stereo USB sound card with the Max9814 as from use really like the analogue AGC it has https://www.aliexpress.com/item/32864107454.html and a cable.
Have as many as you wish be it Pi or MiniPC.

Like the example I sent that could be not much more than the T7 S3 – LILYGO® as one set of I2S pins is in use.

If you want to copy the onboard designs then do, as said people did say otherwise.

Though if a project is effectively a stripped down ESP32 S3 BOX Lite, that would already have support wouldn’t it?

Nice, so it’s possible to get a WiFi dual mic setup, only needing an external PSU/phone charger, for under £10? I would be interested in the breakdown of parts for that setup.

The ES7243E that the ESP32 Box lite had had AGC, I’m guessing you didn’t like it?

Assuming that the training for the STT has been trained against a ESP32-S3 Box 3, wouldn’t copying it’s mic array setup be the best option to be as compatible as possible, without any re training?

No as what they did is hack out the DSP and Espressif KWS.
As far as I am aware it got converted into a single mic 24/7 websockets broadcast.
I wouldn’t do the Lite but the full with AEC as the cost diff is minimal really.
It would just have a 3rd input to the AEC that a ESP-Squeezelite could connect to…

That is presuming you can get a small batch order done and shipped for under £10 that works…

I always wondered why Esspressif dropped the AEC to a 2 channel, if I remember rightly its controlled by the S3 and is a ADC with several gain settings?
I have forgot now, so have no idea but the quad with AEC and a gain selectable preamp would be cool. Going with the Esspressif design is prob easiest, for me not having a satelite with a mic sat on a speaker means AEC is less of an issue and the ‘Lite’ could be used.

The esp32-s3=box had a x2 mic board seperate to the main pcb, only thing stange is the mic spacing, which is less than I would choose, but copy if using there blob.

I sent you a esp32-s3-box the orig and the Lite, I would experiment with what you have before you proceed.

Welp, yeah that’s a show stopper for that ADC then. The quad ADC is the better choice then

Should be least enough to get a couple of stereo 3.5mm jacks, and pass through the signal to allow for AEC to the spare 2 analogue channels (or maybe merge into 1)

Time will tell, but it’s likely £10 a device, and something that I may be able to persuade some stockists to sell, if it’s fit for purpose & enough interest

Agreed, so far out of the DSP’s and ADC’s I’ve used, the best one was the ZL38063 on a ESP32 Wrover, hence my choice on using that chip initially.
It’s truly hard to test all possible choices and options without spending a load of cash on development. I was hoping to be able to outsource some of the choices to the community’s knowledge :slight_smile:

Agreed it is a bit on the close side, even looking at their own design guidelines
https://docs.espressif.com/projects/esp-sr/en/latest/esp32/audio_front_end/Espressif_Microphone_Design_Guidelines.html

There is a really good application note by Invensense all about basic beamforming

Really its you input frequency that sets how much sample resolution.

Speed of sound = 343,000 mm/s divide by 48khz (common sampling freq)
Gives 7.14583mm per sample, so really you want to be a multiple of that.
I have used 71.4583mm even if it is pushing into aliasing territory.

I have forgot the input freq Esspressif use, it could be 16Khz which gives 21.4375mm per sample

1 Like

That looks like I nice piece of reading material, currently above my understanding, so will read over it a few times until it starts to sink in. Spacing the mics at a multiple of that formula should be something I can aim towards. (I will check the sampling frequency later when I’m home, and use the formula provided to set the distance multipler) I’m guessing this still stands true for triangular arrays.

No I guess a triangular array is a combination of both broadside and endfire.
Delay Sum is very simple basic beamforming and just putting multiple mics on a board without the science & DSP means nothing really apart from a PCB with multiple mics.

This has been problem with the respeaker range of Pi hats from Quad to to 6 mic which are practically useless due to lack of DSP and tech expertease with software other than a multi channel mic.

Sort of similar to Matrix Voice where they all lacked the DSP and likely the computational level to do so.
There are various algs such as GCC_Phat that can approximate the time difference between x2 signals or multiple inputs.
Each microphone added increases sensitivity but also increases computation to 2^ and simple delay-sum only provide modest levels of attentuation and likely more powerful and better DSP should be used.

Hmm, well that’s a shame. So I take it that the main issue is that there is little to no community expertise to make those algorithms run on the ESP32 itself, or if we go to a DSP, any open source algorithm’s that are capable of doing the job.

I was hoping that the respeaker had community support for 360 beamforming using the 6 mics, having them as 3 stereo channels only somewhat sucks.

I am guessing there isn’t as audio DSP is a very different arena than basic control.
You can just judge on what has been provided and direction things took and what is currently missing.

Even then for what you are asking 360 beamforming is really for conference speaker phone and sit table central with many speakers.
Smart speakers by nature of being a speaker are generally directional.

Google home had 2mics and went to 3 with an alg that might be mentioned here https://arxiv.org/pdf/2401.08864v1.pdf

Have a look at some teardowns, but there is much more than sprinkling a few mics and leds.

It’s quite a different ball game when looking at the specifications of what is being used in there. An entire android based device, with 512mb of ram, and a dual core Arm CPU. It’s amazing that an ESP32 can even compete, though also rather telling when Amazon, with it’s race to the bottom with manufacturing costs isn’t using a Espressif chip

Yeah likely quite a powerful CPU guessing maybe A75 cores and likely because as I have noticed with tensorflow multiple threads provides diminishing results (2 threads optimal). Also I think A75 had the Arm8.2 mat/mul instructions that make the A76 Pi5 quite a ML monster (still prefer the [RK3588] similar price near half the wattage) (https://www.aliexpress.com/item/1005004941882246.html)

An esp32-S3 doesn’t need to compete if your only providing KWS and local DSP.
That is why the satelite model is such a crass attempt at cloning consumer e-waste.
You take the ‘Ears’ out of the likes of Alexa and have a single shared brain applying further function and DSP, whilst its impossible to compete with a makers build satelite and stupid to even try when its not needed.
The engineering in the latest Nest and Echo units is tremendous and no wonder Mycroft was such a flop.
Currently 24/7/365 broadcast always on mics is not a good idea for privacy or wifi bandwidth as you start to add rooms,

You can make things easier for yourself with opensource and not stick your microphones right on top of your speakers and take a lateral look at what modern home automation should be without cloning current commercial gadgets.

You have a ESP32-S3-Box & ESP32-S3Box-Lite and you just need to hack out so you have the x2 BSS I2S streams or anyone with a bit more C/C++ knowledge that it should be possible.
They are declared and C/C++ being pointers and own memory constructs it should be so, but I am a complete noob with C/C++ who is surprised how lacking we are in this area of embedded DSP…

Really its dependent on someone to actually guide a development and start collating datasets of use KW and command sentences with a community opt-in.
You can not bring your own to the party, as what is needed is a highly developed system of either a single hardware/dsp chain or a couple of choices that would use different models.

You can split devices into specific function and share central cost and not follow Amazons costly mistake.
You can use less performent dsp and filters if you train an ASR to accept it.
You can even have 2stage KWS that stops 24/7/365 broadcast that the 2nd filters the 1st stage false positives.