Is there no more pi microphone array hat?

James_Pfeffer · March 17, 2024, 8:10pm

I use a Jabra 410 USB speakerphone. They are about $30 on eBay. It is plug and play and nothing special to set up if using Docker. You will need to reboot after you plug it in the first time. You may also have to modify a Linux txt file that is noted on another post on this site. Its shape is worth while as you can 3d print a cylindrical enclosure to hold a pi zero 2 and set the Jabra on top. I see a lot of comments about cost. I do not want companies listening in on my conversations. That’s why Rhasspy is a great alternative. I like cheap too, but what is your privacy worth? There may be cheaper available speaker options, but the Jabra works and I can spend more time developing my custom skill programs than trying to figure out some cryptic Linux setting for a device that I have no detailed documentation on. Off the soap box……….

James_Pfeffer · March 17, 2024, 9:43pm

Failed to mention, my setup is using aplay and arecord for audio recording and playing options.

rolyan_trauts · March 20, 2024, 6:48am

I am an audio snob and like my Anker Powerconf the speaker on my speakerphone doesn’t really cut it, for a smart speaker.
I am one of those that listens to a Nest mini or Echo dot and can only think ‘Urgh!’
The standard Nest & Echo are a minimum for me as ‘Play some music’ is one of the rare occasions I do use a smart speaker.

We do have 2 excellent wireless audio apps Squeezelite as it will squeeze into a ESP32 or my own fave the almost limitless configurable Snapcast.
Mounting a small amp board onto the back of a bookshelf speaker can quickly surpass Nest and Echo devices.

By using Squeezelite/Snapcast you get great zonal audio and a platform that is pretty easy to add a microphone.
I hacked together a 2 channel delay-sum beamformer GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer that will extend far-field, but there is no signal to focus the beamformer on.
So it acts like a conference speakerphone and will beam to what ever is the predominant noise and be contantly shifting focus.
The methods of targetting a voice or focussing a beam is totally absent from the opensource we have even though there is code available and solutions.

Speakerphones do have AEC but we also have opensource for that even if not implemented.
With a bit of lateral thought can provide huge improvements and not need AEC by not mounting your mic in the same enclosure than your speaker.

A wired microphone can be small and very descrete even a dual mic and you might have more than just one in a room to give much better coverage.
Initially I expected ‘Ears’ to be 2x Mics on a ESP32-S32 in a small flat panels approx 75mm in width that clips onto a wall or stands on a desk.
It still needs to be powered and may contain a pixel bar/ring and even have audio out.
That is where I don’t see much prob with wired mics either as even wireless network mics still need a PSU, so really little difference or advantage.

I think laterial thinking on what a open source smart-speaker is needs some thought as with things like Squeezelite or Snapcast opensource can give Sonos like zonal audio that has a huge array of choice.
That zonal audio and where HomeAssistant can excel and beat commercial systems on functionality, quality and choice is a huge selling point that is likely under implemented and undersold.

For some reason whilst trying to escape Big Data ‘Smart Speakers’ opensource has been blinkered and tries to copy consumer individual product verbatum.
I don’t even think we need a ‘Smart Speaker’ just a zonal opensource microphone system that takes apart the commercial notion of a smartspeaker to working components of wireless audio, microphone and pixel indicators to give choice.
If someone wants to build that in a box, they can, but a modern room could be a very different scenario with a single large pixel indicator, wireless room audio and several dispersed microphones.

I guess its where you want to go but HA in terms smart controls and dashboards represents some of the most cutting edge Smart Home tech whilst USB speakerphones hanging out of mini computers is not much above Raspberry Pi Google AIY Voice Kit…

There is also other commercial equipment that can also fuse function to become more cost effective as a device as my Anker C300 dual mic webcam is great for audio and its far-field isn’t bad.
It can provide Frigate video and be a wireless room mic array and used in conjuntion with a wireless audio system.

I also tried some of the bi-directional audio pan-tilt cams you can get but found on the ones I got the mic audio is pretty awful.

You can still use the Respeaker 2mic that eventually always catches up to the latest release. Likely though again some laterial thought that USB devices are far more compatible and convient than Pi hats limited application.

There still is Pi microphone hats but really only the 2mic is of any use, but there is a whole range of USB devices that work on many platforms that prob could do with a HA mic for those who don’t want to build one.

The hardware, software all exist its just not implemented as a system and so ends up excluded.

JanWolf · March 20, 2024, 10:48am

Hi Stuart, sorry for that but could you be more concise and make your answer more understandable, and only answer the question ? If you want to share your works and your visions with everybody, it’s fine but I suggest you to open a specific thread. We feel that you have an expertise on some subjects but this should not to be spread around all the subjects
I think also that it’s useless to make the mail from Paulus public. It does not help the project. It’s something between you and Paulus and Michael.
Best regards.

rolyan_trauts · March 20, 2024, 10:58am

Nope as that was it and is it.
Doubt I will be contributing much more and would prefer not to be tagged in future conv.

https://www.kickstarter.com/projects/smartaudio/ankerwork-s600-all-in-one-speakerphone-0

synesthesiam · March 25, 2024, 12:34am

No worries, Don
I share the same thoughts about Stuart (not tagging him as he asked not to be). In fact, as I delve deeper into building voice satellite hardware for Nabu Casa, I keep coming across posts from Stuart either here or across Github. It’s becoming clear that I wasn’t ready to hear many of his ideas; I’m still trying to catch up.

That’s not to say I 100% agree with all of the negative things that were said, and I believe it’s that negativity that pushed me and others away. Different people (and organizations) have different priorities and different strengths. Despite being in the voice space for a few years now, I only have a rudimentary understanding of audio engineering. But we’ve made a lot of progress around Rhasspy and other projects despite not having audio expertise, though I do think we could’ve made it easier on ourselves with the right knowledge (but when is that not true? )

Regarding the topic of this thread, we are discussing satellite hardware over here. While that discussion is focused on an ESP-based satellite, I believe very similar hardware could make a great Pi HAT or USB mic. Specifically, the XMOS XU316 with two microphones, audio out (echo cancelled), and some LEDs.

If anyone knows enough circuit design to build a prototype of such a thing, I’d love to chat with you

rolyan_trauts · March 25, 2024, 7:48pm

@synesthesiam

You have to be wary of any hardware or closed software blobs that you can not run data through faster than realtime. As with the size of the datasets that is some process time.
You need to be able to preprocess the datasets of use all the way up the chain with the algs and hardware of use, which is easy enough with opensource software that you just run on a fast computer.

The Speechbrain Sepformer is an example as for it not to work worse than normal they had to finetune Whisper with a Sepformer dataset.

You where involved when Mycroft finally created a XMOS PI hat that was 2 mic. Surely anyone with a Mycroft II and surprised you don’t have one, being a Dev at the time.

synesthesiam · March 25, 2024, 8:15pm

I do have a few Mark II’s with the SJ201 daughter board (containing an XMOS XVF3510). It works pretty well for audio, but there are a few things I would want to change. The non-right-angle header and 12V power plug alone will probably put a lot of people off

rolyan_trauts · March 26, 2024, 10:39am

Again confused as why would you pick a microcontroller for a maker product that has near zero maker community?
Without research I presume Xmos do provide licensed software that likely is the same that is baked into there XVF3510.
Also from memory as this is how long I have been saying you need to use the KWS to lock the beamform for that command sentence was still missing.
Its still the same as its buying in knowledge because the community lacks the DSP/ML skills to create the essential initial audio processing for a room voice microphone.

The dumbest thing on the SJ201 daughter board was to hardsolder microphones and make placement and isolation a near impossible task.
Like the Esp32-S3-Box the microphones are on a small pcb thats connects by a FPC (Like Pi Cam Ribbons)
Likely 12v is a good input voltage that is a good source for an audio amplifier that is less toy-like, A DC to 5v stepdown is a very common component circuit.

For testing those problems do not matter and maybe some empirical data could be provided.
The AEC on those is pretty good as none linear, but that doesn’t cover the noise by common media such as TV, Music, Radio …

Tha above is the problem as you have that currently with the esp32-s3-box and from the results you are getting due to lack of algs and DSP in the community and anybody who is capable of steering it.

This is where my head explodes in absolute confusion as you do have a Microcontroller that is capable that does have a community, but unfortunately the skills of that community is limited.
There is no problem running a KWS on esp32-s3 but the chosen KWS uses a closed source blob provided by Google and uses layers not supported and likely for any micro-controller it is the same.
The esp32-s3 can very well run KWS and I have said many times that likely a CNN, DS-CNN, BC-RESnet and maybe a CRNN all documented in detail at google-research/kws_streaming/README.md at master · google-research/google-research · GitHub with a training API that also includes tf4micro, as said on many times.

Unfortunately the gimic of custom KW was sold as a key feature as that is something even big data can not afford as KW dictate is due to the datasets they hold.
What dscripka did is exceptionally good to quickly get a KW in operation but sadly no guidelines where given and no option to collect correct KW was in place with an option to forward and send as opensource data.
GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity. is brilliant for that, but is a rather fat KWS for many microcontrollers and actually less accurate than many with dedicated KW datasets as mentioned above.

Swapping to another microcontroller because you lack the tech and steering skills to create a solution is still going to be the same. In fact even worse as the community for support or dev is even more sparse than the current esp32-s3.

The esp32-s3 is an esspressif technology demonstrator where they give a framework, software blobs, working hardware back with a github of PDF circuit diagrams and bill-of-materials and even the PCB Gerber files.
Esspressif does know enough to design a circuit and has and all that info has been available to you for some time…

rolyan_trauts · March 26, 2024, 11:17am

Likely abandoning optimised C programmed microcontrollers and throwing Victorian engineeering of application SoC’s such as Pi and other will allow the use of hobbyist Python use with much completed permissive opensource ready to be rebranded and claimed as own. Also the algs used can be run on faster computers for dataset preprocessing.

The respeaker 2 mic is still avail and likely and enclosure for a stereo mic ready to plug in would be beneficial to the community.
speed of sound in dry air at 20 °C = 343 m / s
343,000 / 48000 (48Khz common max sample rate) = 7.1458333
So 2 mics centered at 71.458mm likely is a good choice with current ADC’s before aliasing starts to be a problem.
Speex does have AEC and Pi’s have Pixel rings galore… also Openwakeword also fits and runs.

Also a tip the OKdo ROCK ZERO 3W 1GB with Wi-Fi/BLE without GPIO - OKdo is a Cortex A55 that for ML can often outperform the A73 of the Pi4 for £18.60.
Maybe a stereo mic or USB stereo mic as USB2.0 cables can be fairly long and multiples can be used unlike a Hat. Or plug a enclosed mic into an available 2 mic soundcard (plugable or axagon)

Plugable uses http://www.xiryus.com/wp-content/uploads/2021/06/SSS1629A5-SPEC.pdf which I am not sure but could be exactly the same just a silicon relabelling of Cmedia CM6533 https://www.cmedia.com.tw/support/download_center?type=477_datasheet

From Pi to MiniPC can use USB and no driver needed.

Array Microphone -USB-SA Array Microphone by Andrea Electronics have been doing one at fairly silly prices for some time as all it is, is a stereo mic and stereo mic usb sound card. The above 2 chipsets are from the Plugable stereo Mic and Axagon USB sound cards.

Alextrical · March 26, 2024, 1:03pm

I would also be happy to make a PCB compatible with the Pi either via the GPIO header or USB. If I play my cards right, it may be possible to make a device that works a RPI or has a second PCB added to allow the ESP32-S3 to be connected instead.

Alextrical · March 26, 2024, 1:44pm

In your opinion what would be an ideal setup for a satellite speaker, be it RPi, ESP32 or a Conference speaker on a 150ft long USB cable. It would be great to develop the next step for the open source assistant community.

I think most consumers want a single point satellite, with only a power lead to it, and we have seen the Google home and Amazon echo’s make an impact on the market, I’m assuming people want something similar to those capabilities, without breaking the bank for the hardware costs (£30 to 100 is a good price range)

rolyan_trauts · March 26, 2024, 2:20pm

Satelite speaker doesn’t really compute with me as what they are is just consumer versions where they have bunged wireless audio, beamform mic and pixel ring into a plastic package.

Wireless audio, beamform mic and pixel ring are all individual elements and there is no need for the term satelite as its purely an enclosure where by choice you might put them together.
A room might have a single Pixel inidicator that is much larger than a consumer unit. It may have dedicated wireless audio as opensource already has 2 excellent wireless zonal audio systems with Squeezelite and Snapcast.
A room for coverage may have several mics to provide total room coverage.

A satellite only exists because Google & Amazon and likes have created them and they like selling multiples of them.
This satellite thing is nonsense to me and the only thing we struggle for in opensource is a zonal wireless microphone system that is the input as a wireless audio system is the output as we already have zonal wireless audio systems (squeezelite & snapcast also they are pretty damn great).

I had a hunch we could have a websockets server connecting all wireless KWS microphones and selecting the highest KW hit sensitivity as the stream to use for that zone.

Consumers have no alternative than single point satelites, because that is what fits the business model of the likes of Google & Amazon.
Opensource should be like HomeAssistant where it doesn’t define and confine what users should have. It creates choice and there is no such thing as a satelite its just that in that scenario you have chosen to sit your mics on top of a very small speaker in a plastic box and if that floats your boat then go with it.

Others may take some Audiophile grade speakers of the latest and greatest in design and wire them up to a wireless audio system that is much more than just a ‘satelite’.
A room might have a bespoke central indicator, or maybe its just a screen that auto-changes channel on use.

Corporates want users to have consumer electronic devices because that is what creates them revenue and they have been dumping for a loss to create a moat for there specific services.
Consumers may want single point satelites, but users of the latest and greatest in home control, audio and visual, may want something like HA that is open and doesn’t define and confine to a single satelite.

We already have zonal wireless audio and pixel rings galore and I think what we are missing are ‘zonal kws’ devices that its choice to how many are in a room and if they are in a single box oddly called a satelite.

rolyan_trauts · March 26, 2024, 2:37pm

Talking electrical with you we just need a decent analogue AGC like on the https://www.aliexpress.com/item/1005006233720383.html
Maybe 2x of them in a ready made enclosure the lineout signal can go for fairly long lengths and is no different to a satelite needing a power cable.

I think those as they are cheap and easily interface to $10 stereo ADC USB soundcards such as the Plugable Plugable USB Audio Adapter – Plugable Technologies

Likely slightly better than the Respeaker 2mic but that is still very valid and all-in-one for $10 especially if it didn’t have a right angle conector and the mics actually pointed at you than the ceiling, but you can always wall mount, than on a table.

The bits we are missing are the algs as Google has had voiceprint voice filter tech for quite some time.
The latest in greatest in binaural source seperation and voiceprint speech enhancement has seen a plethora of papers submitted recently.

I do have my delay-sum hack code GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer and was always hoping a proper coder might help Neon optimise the FFT routines.
There are also various filters on github that with a smattering of Rust or C/C++ could likely work on the Pi3 and above sort of level.

Its the software we are missing as its not been implemented and the blinkered focus of satelite with mics ontop of speaker is actually the worst scenario, unless very well engineered like the consumer products we see and that is far beyond simply 3d printing an enclosure as some have done previously.

Beamforming, AEC, Filters and stereo ADC’s from the 2 mic respeaker to various soundcards have always been available as the KWS.
We don’t have any good datasets and that is where the Big Guys have opensource killed.
We should of been collecting them a long time back and for some reason that obvious necessity has been ignored.

There is no ideal setup and we are short of quality building bricks (datasets) to build upon what we have.

Alextrical · March 26, 2024, 4:19pm

True, but a single central device works well for consumers as they take up little space, and combine multiple services. Sticking a Mic on top of a speaker isn’t a logically good idea, but its the most compact way that consumers are happy with.

That could work, though would require the consumer to either have a rats nest of wires coming from a central point, or require multiple Wireless nodes, each streaming a microphone signal back, and then have the end user to map the layout of the Mics to the closest speaker for that zone, all of that adds convolution to the end user who most likeley wants something plug and play

Indeed, there is no limit on how much someone can spend, but there is a reason the Echos and google home have been adopted so broadly, I think is because they are a good enough unit for most users for playing music.

Of course, everything the corporations is done for the most profit/ recurring review, with the minimal work. The amazon echo was intended to sell amazon products and music to the masses, that’s why its sold at a loss and heavily locked down so we cant stream local music. Though as they can see that’s not as profitable as expected, they are dropping services and quality. Most users here have likely come from those ecosystems, due to the reduction of quality of service, and I’m guessing most of them would like a device that can deliver a similar service to what the commercial smart speaker could do in its hay day.

is there a particular good looking project that incorporates both Audio pickup and 5W Audio playback in a single package, that works well for the community projects?

True, but there is only so much the community can do, most of us are doing this for the fun of it in evenings, where as they big corps are doing it with a mountain of cash backing them for potential future revenue streams. I’m not sure if its so much ignored, or if the community didn’t have the skills available, or interest in the non low hanging fruits

As for the hardware, I’m trying to get my head around what you are suggesting. Would you have for example in a living room, a single microphone in each corner of the room, and a central speaker for rendering the feedback? I’m guessing most consumers wouldn’t be happy doing this. Have a look at conference room audio setups, I’ve seen very few of them with remote microphones, most of them have a combined speaker and microphone in a single unit, even on a large desk (8m) they seem to decide that its not worth the mess of wires

rolyan_trauts · March 26, 2024, 4:57pm

Consumers are happy with the highly developed and engineered housings that the likes of Google and Amazon can push out with economies of scale.
The point is when a maker sticks a mic on a speaker in a plastic box, irrespective of AEC that is all it is and nope so far it has made no maker that happy RiP Mycroft.

No we are talking mainly wireless for coverage and this is what the esp32-s3-box is currently doing, streaming 24/7 room audio.
You can on a wireless node have multiple wired microphones, if that helps with coverage, again its choice.
We don’t actually have a system that does far-field well and you don’t have to if you just add more mics to gain coverage.
Some rooms are of a size that a single microphone will never do.
The device I am suggesting is a ‘KWS mic’ that only broadcasts on KW hit and only the device with the highest sensitivity hit is streamed until end of command sentence. How many you have total or in a room is choice.

Exactly its about choice as I have a $10 amp on the back of bookshelf speaker I already had (I have 2 actually as true stereo sounds so much better), or I could put it in a box and call it a Satelite.
The point is as a user I am not defined and confined to only a toy like box for my audio. I don’t have the engineering skills that Google & Amazon have or the economies of sale. So from cost effective 2nd user speakers to the latest in greatest I can have choice, but would unlikely be able to manufacture the likes of a Nest or Echo speaker.

There are quite a few such as GitHub - sonocotta/esp32-audio-dock: Audio docks for ESP32 mini (ESP32, ESP32C3, ESP32S2 and ESP8266 mini modules from Wemos)
I am not a fan and again against bundling all as a single option.
There are many squeezelite boards but prefer the choice of a lineout to give choice over the amplifier I should use and speaker. Also unlike consumer satelites I can upgrade amp or speaker without needing to change satelite.
Also I have preference to Snapcast as it has a much finer sync timings and is multi-channel and could also be your TV’s surround sound. Also it can be any device with choice of any soundcard or DAC for audio out.
Again its purely about choice and not defining and confining to a specific role and letting users decide on what and how they will use it.

I am not suggesting anything, I am saying its choice and users can have whatever setup they wish.
Its you who is defining a single satelite and confining to that definition and keep repeating this strange conception that it can only be so.

synesthesiam · March 26, 2024, 9:24pm

I think you might be confusing openWakeWord (which runs on the Pi) with microWakeWord which runs on the ESP32-S3. In fact, microWakeWord is based on the inception model from the Google KWS streaming repo

I like this idea, and it could actually be done today with a regular ESP32 (not the S3). Do you think beamforming is less important if you have handful of these KWS mics in a single room?

I saw your code for the delay sum beamformer, and am interested to talk more about it. I wonder if it could run on the ESP32 using the esp-dsp library for FFT. Looks like there’s some potential for a NEON-optimized version using kissfft.

I think this would be awesome. I could imagine even a single device where you plug it into power and it acts as an ESP satellite, but when you plug it into a Pi via USB it acts as a USB mic/speaker combo.

rolyan_trauts · March 27, 2024, 11:15am

No I am not microWakeWord is new and the repo only a month old and I had switched off way before that.
I don’t know inception or how its being run (streaming or not) it still likely suffers from having very limited datasets (allow users to submit and promote certain KW).
I think you will find they are using keras-applications/keras_applications/inception_v3.py at master · keras-team/keras-applications · GitHub than the GitHub - google-research/google-research: Google Research
A micro KWS could of happened a long time ago if it wasn’t for the lack of datasets. Synthetic datasets work but are a lesser substitute to actual real datasets. If we actually started collating KW with a opt-in we would have so much choice of what KWS we could run.

I think both is likely the best but likely yeah we can bypass technology by simple hard physics of locations and multiple locations.
Using the same KWS and KW hit argmax to pick the best stream.
(All ‘ears’ would connect with a small buffer delay and you just drop connection to the ones not needed).

The delay-sum just hacked portaudio and a circular buffer to robins code and there are many FFT libs that are Neon optimised kissfft or pocketfft could likely speed things up and lighten load even if not that heavy already.
It just uses GCC-PHAT to get the time delay between the mics and then sums the 1st mic with that sample offset.
Its a very simple beamformer but because its a simple sum likely it will have no signature or artefacts that could confuse an ASR such as Whisper.

Maybe but without algs to run more mics means absolutely nothing.
We do have the respeaker 2mic & at least 2 stereo mic USB soundcards.
We just dont have any source seperation algs and just the simple beamformer I did.
Without its just yet another multi-mic recording device with not real advantage.

Use OpenWakeWord and promote 3 KW’s microWakeWord uses Okay Nabu, Hey Jarvis & Alexa
Alexa because its anothers KW is likely a bad idea HomeAssistant, Hey HA … maybe
Add the code to capture the KW and package them and maybe even go back to the days when I was shouting for a Mic word prompter to record datasets.
If done raw on 2 mic or stereo usb even better.
Provide meta data as we don’t need names or address, just native speakers and the region/country they would associate with.
Commonvoice is near useless as it has near no metadata and has more non-native english speakers than English and there is a huge spectral difference.
Age, Gender are also great we just don’t need identity data.

As for creating PCB’s again without the algs/software to run more advanced source seperation and beamforming it prob pointless.

Espressif have already done the circuits of the Esp32-S3-box version.
What would be cool is to hack off all the unessacary of the technology demonstrator as it does have beamforming algs.
Use the 4 channel ADC they use and on the same PCB have another ESP32 running squeezelite where the DAC lineout connects to the AEC ref of the ESP32-S3 ADC 3rd channel.
The rest. screen and all the other bumf get rid so its purely far-field mic with dual micro-kws and attached zonal wireless audio.

Alextrical · March 27, 2024, 12:14pm

I entirely agree with that, the dev kit is just that, for development and POC, its not ideal for the majority of users as a smart speaker

We could use the ESP32-Korvo v1.1 as a base design, and dedicating the ES8311 to one ESP32, and the ES7210 to the ESP32-S3. I’m guessing the Dual ESP32 approach would allow for better handling of services, but would that reduce some of the user experience at initial setup, as I’m guessing both ESP32’s would have to be configured independently (unless we could get them to sync WiFi setup via UART to each other?)

Screen definitely isn’t needed
Battery charge circuit isn’t either
Neo pixel ring would be useful
Hardware Mute would be wanted by a fair few
5-20W mono amp would be nice to make a contained smart speaker

Have a look at the project I’ve been working over on GitHub
it is currently based around the schematics for the ESP32-LyraTD-MSC and the updates of the ESP32-S3-BOX-3, using the ZL38063 DSP at its core. I had been avoiding the ES7210 & ES8311 combo, as I believed that was putting extra workload on the ESP32-S3 that could of been handled externally on the DSP (AEC, NS, BF, AGC functions), which has support in the ESP-ADF. Though the closed source nature and inaccessibility of the tuner software to the public is less than ideal.

I suppose the question I should be asking is (as I’m considering changing the project trajectory), do you think that the ESP32-S3 with a ES7210 ADC has a mature enough software process to not need a dedicated DSP that can handle the AEC, NS, BF, AGC functions?

rolyan_trauts · March 27, 2024, 1:13pm

I don’t know for sure, you are in the UK are you not.
Msg me your address and I will send you the orig and lite version of the esp32-s3-box haven’t got the 3rd version.

I am hoping the BSS (Blind Source Seperation) software is enough. I have hunch what esspressif is doing is using BSS to split sources and then runs KWS on each split stream to select the working stream.
That is the catch with BSS as it will split audio sources very well but what channel they end up in is near random.
Likely if you jetison all that unessacary then there is room for x2 fatter KWS to select the command sentence stream for websockets.

The esp32-s3 with its vector instructions is almost 10x faster than the standard esp32 so also the beamforming code should port over whilst removing portaudio for the esspressif I2S streams.

Also likely we don’t need a DAC on the esp32-s3 and also the squeezelite can be a seperate PCB that is just connected from its dac output to the esp32-s3 ADC 3rd input.

Really all that is needed is the 2 mics, ADC & S3 on a tiny PCB with the 2mics on the same FPC ribbon. I have forgot actually what dimentions and sample rate the audio comes in at on the box design but the details are all there,

Likely it needs some sort of fusion service to commision a small batch run, but what you can do hack a box to do the same with bespoke firmware.

If anyone can hack the Box and get the stereo audio stream offboard as samples its likely from a listen you could tell how effective it could be.
The AEC is likely very sensitive to any latency with a small tail due to its memory constraints and can not remember if that is hardcoded in the blob or can be extended.