Is there no more pi microphone array hat?

Again confused as why would you pick a microcontroller for a maker product that has near zero maker community?
Without research I presume Xmos do provide licensed software that likely is the same that is baked into there XVF3510.
Also from memory as this is how long I have been saying you need to use the KWS to lock the beamform for that command sentence was still missing.
Its still the same as its buying in knowledge because the community lacks the DSP/ML skills to create the essential initial audio processing for a room voice microphone.

The dumbest thing on the SJ201 daughter board was to hardsolder microphones and make placement and isolation a near impossible task.
Like the Esp32-S3-Box the microphones are on a small pcb thats connects by a FPC (Like Pi Cam Ribbons)
Likely 12v is a good input voltage that is a good source for an audio amplifier that is less toy-like, A DC to 5v stepdown is a very common component circuit.

For testing those problems do not matter and maybe some empirical data could be provided.
The AEC on those is pretty good as none linear, but that doesn’t cover the noise by common media such as TV, Music, Radio …

Tha above is the problem as you have that currently with the esp32-s3-box and from the results you are getting due to lack of algs and DSP in the community and anybody who is capable of steering it.

This is where my head explodes in absolute confusion as you do have a Microcontroller that is capable that does have a community, but unfortunately the skills of that community is limited.
There is no problem running a KWS on esp32-s3 but the chosen KWS uses a closed source blob provided by Google and uses layers not supported and likely for any micro-controller it is the same.
The esp32-s3 can very well run KWS and I have said many times that likely a CNN, DS-CNN, BC-RESnet and maybe a CRNN all documented in detail at google-research/kws_streaming/README.md at master · google-research/google-research · GitHub with a training API that also includes tf4micro, as said on many times.

Unfortunately the gimic of custom KW was sold as a key feature as that is something even big data can not afford as KW dictate is due to the datasets they hold.
What dscripka did is exceptionally good to quickly get a KW in operation but sadly no guidelines where given and no option to collect correct KW was in place with an option to forward and send as opensource data.
GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity. is brilliant for that, but is a rather fat KWS for many microcontrollers and actually less accurate than many with dedicated KW datasets as mentioned above.

Swapping to another microcontroller because you lack the tech and steering skills to create a solution is still going to be the same. In fact even worse as the community for support or dev is even more sparse than the current esp32-s3.

The esp32-s3 is an esspressif technology demonstrator where they give a framework, software blobs, working hardware back with a github of PDF circuit diagrams and bill-of-materials and even the PCB Gerber files.
Esspressif does know enough to design a circuit and has and all that info has been available to you for some time…

Likely abandoning optimised C programmed microcontrollers and throwing Victorian engineeering of application SoC’s such as Pi and other will allow the use of hobbyist Python use with much completed permissive opensource ready to be rebranded and claimed as own. Also the algs used can be run on faster computers for dataset preprocessing.

The respeaker 2 mic is still avail and likely and enclosure for a stereo mic ready to plug in would be beneficial to the community.
speed of sound in dry air at 20 °C = 343 m / s
343,000 / 48000 (48Khz common max sample rate) = 7.1458333
So 2 mics centered at 71.458mm likely is a good choice with current ADC’s before aliasing starts to be a problem.
Speex does have AEC and Pi’s have Pixel rings galore… also Openwakeword also fits and runs.

Also a tip the OKdo ROCK ZERO 3W 1GB with Wi-Fi/BLE without GPIO - OKdo is a Cortex A55 that for ML can often outperform the A73 of the Pi4 for £18.60.
Maybe a stereo mic or USB stereo mic as USB2.0 cables can be fairly long and multiples can be used unlike a Hat. Or plug a enclosed mic into an available 2 mic soundcard (plugable or axagon)

Plugable uses http://www.xiryus.com/wp-content/uploads/2021/06/SSS1629A5-SPEC.pdf which I am not sure but could be exactly the same just a silicon relabelling of Cmedia CM6533 https://www.cmedia.com.tw/support/download_center?type=477_datasheet

From Pi to MiniPC can use USB and no driver needed.

Array Microphone -USB-SA Array Microphone by Andrea Electronics have been doing one at fairly silly prices for some time as all it is, is a stereo mic and stereo mic usb sound card. The above 2 chipsets are from the Plugable stereo Mic and Axagon USB sound cards.

I would also be happy to make a PCB compatible with the Pi either via the GPIO header or USB. If I play my cards right, it may be possible to make a device that works a RPI or has a second PCB added to allow the ESP32-S3 to be connected instead.

1 Like

In your opinion what would be an ideal setup for a satellite speaker, be it RPi, ESP32 or a Conference speaker on a 150ft long USB cable. It would be great to develop the next step for the open source assistant community.

I think most consumers want a single point satellite, with only a power lead to it, and we have seen the Google home and Amazon echo’s make an impact on the market, I’m assuming people want something similar to those capabilities, without breaking the bank for the hardware costs (£30 to 100 is a good price range)

Satelite speaker doesn’t really compute with me as what they are is just consumer versions where they have bunged wireless audio, beamform mic and pixel ring into a plastic package.

Wireless audio, beamform mic and pixel ring are all individual elements and there is no need for the term satelite as its purely an enclosure where by choice you might put them together.
A room might have a single Pixel inidicator that is much larger than a consumer unit. It may have dedicated wireless audio as opensource already has 2 excellent wireless zonal audio systems with Squeezelite and Snapcast.
A room for coverage may have several mics to provide total room coverage.

A satellite only exists because Google & Amazon and likes have created them and they like selling multiples of them.
This satellite thing is nonsense to me and the only thing we struggle for in opensource is a zonal wireless microphone system that is the input as a wireless audio system is the output as we already have zonal wireless audio systems (squeezelite & snapcast also they are pretty damn great).

I had a hunch we could have a websockets server connecting all wireless KWS microphones and selecting the highest KW hit sensitivity as the stream to use for that zone.

Consumers have no alternative than single point satelites, because that is what fits the business model of the likes of Google & Amazon.
Opensource should be like HomeAssistant where it doesn’t define and confine what users should have. It creates choice and there is no such thing as a satelite its just that in that scenario you have chosen to sit your mics on top of a very small speaker in a plastic box and if that floats your boat then go with it.

Others may take some Audiophile grade speakers of the latest and greatest in design and wire them up to a wireless audio system that is much more than just a ‘satelite’.
A room might have a bespoke central indicator, or maybe its just a screen that auto-changes channel on use.

Corporates want users to have consumer electronic devices because that is what creates them revenue and they have been dumping for a loss to create a moat for there specific services.
Consumers may want single point satelites, but users of the latest and greatest in home control, audio and visual, may want something like HA that is open and doesn’t define and confine to a single satelite.

We already have zonal wireless audio and pixel rings galore and I think what we are missing are ‘zonal kws’ devices that its choice to how many are in a room and if they are in a single box oddly called a satelite.

Talking electrical with you we just need a decent analogue AGC like on the https://www.aliexpress.com/item/1005006233720383.html
Maybe 2x of them in a ready made enclosure the lineout signal can go for fairly long lengths and is no different to a satelite needing a power cable.

I think those as they are cheap and easily interface to $10 stereo ADC USB soundcards such as the Plugable Plugable USB Audio Adapter – Plugable Technologies

Likely slightly better than the Respeaker 2mic but that is still very valid and all-in-one for $10 especially if it didn’t have a right angle conector and the mics actually pointed at you than the ceiling, but you can always wall mount, than on a table.

The bits we are missing are the algs as Google has had voiceprint voice filter tech for quite some time.
The latest in greatest in binaural source seperation and voiceprint speech enhancement has seen a plethora of papers submitted recently.

I do have my delay-sum hack code GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer and was always hoping a proper coder might help Neon optimise the FFT routines.
There are also various filters on github that with a smattering of Rust or C/C++ could likely work on the Pi3 and above sort of level.

Its the software we are missing as its not been implemented and the blinkered focus of satelite with mics ontop of speaker is actually the worst scenario, unless very well engineered like the consumer products we see and that is far beyond simply 3d printing an enclosure as some have done previously.

Beamforming, AEC, Filters and stereo ADC’s from the 2 mic respeaker to various soundcards have always been available as the KWS.
We don’t have any good datasets and that is where the Big Guys have opensource killed.
We should of been collecting them a long time back and for some reason that obvious necessity has been ignored.

There is no ideal setup and we are short of quality building bricks (datasets) to build upon what we have.

True, but a single central device works well for consumers as they take up little space, and combine multiple services. Sticking a Mic on top of a speaker isn’t a logically good idea, but its the most compact way that consumers are happy with.

That could work, though would require the consumer to either have a rats nest of wires coming from a central point, or require multiple Wireless nodes, each streaming a microphone signal back, and then have the end user to map the layout of the Mics to the closest speaker for that zone, all of that adds convolution to the end user who most likeley wants something plug and play

Indeed, there is no limit on how much someone can spend, but there is a reason the Echos and google home have been adopted so broadly, I think is because they are a good enough unit for most users for playing music.

Of course, everything the corporations is done for the most profit/ recurring review, with the minimal work. The amazon echo was intended to sell amazon products and music to the masses, that’s why its sold at a loss and heavily locked down so we cant stream local music. Though as they can see that’s not as profitable as expected, they are dropping services and quality. Most users here have likely come from those ecosystems, due to the reduction of quality of service, and I’m guessing most of them would like a device that can deliver a similar service to what the commercial smart speaker could do in its hay day.

is there a particular good looking project that incorporates both Audio pickup and 5W Audio playback in a single package, that works well for the community projects?

True, but there is only so much the community can do, most of us are doing this for the fun of it in evenings, where as they big corps are doing it with a mountain of cash backing them for potential future revenue streams. I’m not sure if its so much ignored, or if the community didn’t have the skills available, or interest in the non low hanging fruits

As for the hardware, I’m trying to get my head around what you are suggesting. Would you have for example in a living room, a single microphone in each corner of the room, and a central speaker for rendering the feedback? I’m guessing most consumers wouldn’t be happy doing this. Have a look at conference room audio setups, I’ve seen very few of them with remote microphones, most of them have a combined speaker and microphone in a single unit, even on a large desk (8m) they seem to decide that its not worth the mess of wires

Consumers are happy with the highly developed and engineered housings that the likes of Google and Amazon can push out with economies of scale.
The point is when a maker sticks a mic on a speaker in a plastic box, irrespective of AEC that is all it is and nope so far it has made no maker that happy RiP Mycroft.

No we are talking mainly wireless for coverage and this is what the esp32-s3-box is currently doing, streaming 24/7 room audio.
You can on a wireless node have multiple wired microphones, if that helps with coverage, again its choice.
We don’t actually have a system that does far-field well and you don’t have to if you just add more mics to gain coverage.
Some rooms are of a size that a single microphone will never do.
The device I am suggesting is a ‘KWS mic’ that only broadcasts on KW hit and only the device with the highest sensitivity hit is streamed until end of command sentence. How many you have total or in a room is choice.

Exactly its about choice as I have a $10 amp on the back of bookshelf speaker I already had (I have 2 actually as true stereo sounds so much better), or I could put it in a box and call it a Satelite.
The point is as a user I am not defined and confined to only a toy like box for my audio. I don’t have the engineering skills that Google & Amazon have or the economies of sale. So from cost effective 2nd user speakers to the latest in greatest I can have choice, but would unlikely be able to manufacture the likes of a Nest or Echo speaker.

There are quite a few such as GitHub - sonocotta/esp32-audio-dock: Audio docks for ESP32 mini (ESP32, ESP32C3, ESP32S2 and ESP8266 mini modules from Wemos)
I am not a fan and again against bundling all as a single option.
There are many squeezelite boards but prefer the choice of a lineout to give choice over the amplifier I should use and speaker. Also unlike consumer satelites I can upgrade amp or speaker without needing to change satelite.
Also I have preference to Snapcast as it has a much finer sync timings and is multi-channel and could also be your TV’s surround sound. Also it can be any device with choice of any soundcard or DAC for audio out.
Again its purely about choice and not defining and confining to a specific role and letting users decide on what and how they will use it.

I am not suggesting anything, I am saying its choice and users can have whatever setup they wish.
Its you who is defining a single satelite and confining to that definition and keep repeating this strange conception that it can only be so.

I think you might be confusing openWakeWord (which runs on the Pi) with microWakeWord which runs on the ESP32-S3. In fact, microWakeWord is based on the inception model from the Google KWS streaming repo :slightly_smiling_face:

I like this idea, and it could actually be done today with a regular ESP32 (not the S3). Do you think beamforming is less important if you have handful of these KWS mics in a single room?

I saw your code for the delay sum beamformer, and am interested to talk more about it. I wonder if it could run on the ESP32 using the esp-dsp library for FFT. Looks like there’s some potential for a NEON-optimized version using kissfft.

I think this would be awesome. I could imagine even a single device where you plug it into power and it acts as an ESP satellite, but when you plug it into a Pi via USB it acts as a USB mic/speaker combo.

No I am not microWakeWord is new and the repo only a month old and I had switched off way before that.
I don’t know inception or how its being run (streaming or not) it still likely suffers from having very limited datasets (allow users to submit and promote certain KW).
I think you will find they are using keras-applications/keras_applications/inception_v3.py at master · keras-team/keras-applications · GitHub than the GitHub - google-research/google-research: Google Research
A micro KWS could of happened a long time ago if it wasn’t for the lack of datasets. Synthetic datasets work but are a lesser substitute to actual real datasets. If we actually started collating KW with a opt-in we would have so much choice of what KWS we could run.

I think both is likely the best but likely yeah we can bypass technology by simple hard physics of locations and multiple locations.
Using the same KWS and KW hit argmax to pick the best stream.
(All ‘ears’ would connect with a small buffer delay and you just drop connection to the ones not needed).

The delay-sum just hacked portaudio and a circular buffer to robins code and there are many FFT libs that are Neon optimised kissfft or pocketfft could likely speed things up and lighten load even if not that heavy already.
It just uses GCC-PHAT to get the time delay between the mics and then sums the 1st mic with that sample offset.
Its a very simple beamformer but because its a simple sum likely it will have no signature or artefacts that could confuse an ASR such as Whisper.

Maybe but without algs to run more mics means absolutely nothing.
We do have the respeaker 2mic & at least 2 stereo mic USB soundcards.
We just dont have any source seperation algs and just the simple beamformer I did.
Without its just yet another multi-mic recording device with not real advantage.

Use OpenWakeWord and promote 3 KW’s microWakeWord uses Okay Nabu, Hey Jarvis & Alexa
Alexa because its anothers KW is likely a bad idea HomeAssistant, Hey HA … maybe
Add the code to capture the KW and package them and maybe even go back to the days when I was shouting for a Mic word prompter to record datasets.
If done raw on 2 mic or stereo usb even better.
Provide meta data as we don’t need names or address, just native speakers and the region/country they would associate with.
Commonvoice is near useless as it has near no metadata and has more non-native english speakers than English and there is a huge spectral difference.
Age, Gender are also great we just don’t need identity data.

As for creating PCB’s again without the algs/software to run more advanced source seperation and beamforming it prob pointless.

Espressif have already done the circuits of the Esp32-S3-box version.
What would be cool is to hack off all the unessacary of the technology demonstrator as it does have beamforming algs.
Use the 4 channel ADC they use and on the same PCB have another ESP32 running squeezelite where the DAC lineout connects to the AEC ref of the ESP32-S3 ADC 3rd channel.
The rest. screen and all the other bumf get rid so its purely far-field mic with dual micro-kws and attached zonal wireless audio.

I entirely agree with that, the dev kit is just that, for development and POC, its not ideal for the majority of users as a smart speaker

We could use the ESP32-Korvo v1.1 as a base design, and dedicating the ES8311 to one ESP32, and the ES7210 to the ESP32-S3. I’m guessing the Dual ESP32 approach would allow for better handling of services, but would that reduce some of the user experience at initial setup, as I’m guessing both ESP32’s would have to be configured independently (unless we could get them to sync WiFi setup via UART to each other?)

Screen definitely isn’t needed
Battery charge circuit isn’t either
Neo pixel ring would be useful
Hardware Mute would be wanted by a fair few
5-20W mono amp would be nice to make a contained smart speaker

Have a look at the project I’ve been working over on GitHub
it is currently based around the schematics for the ESP32-LyraTD-MSC and the updates of the ESP32-S3-BOX-3, using the ZL38063 DSP at its core. I had been avoiding the ES7210 & ES8311 combo, as I believed that was putting extra workload on the ESP32-S3 that could of been handled externally on the DSP (AEC, NS, BF, AGC functions), which has support in the ESP-ADF. Though the closed source nature and inaccessibility of the tuner software to the public is less than ideal.

I suppose the question I should be asking is (as I’m considering changing the project trajectory), do you think that the ESP32-S3 with a ES7210 ADC has a mature enough software process to not need a dedicated DSP that can handle the AEC, NS, BF, AGC functions?

I don’t know for sure, you are in the UK are you not.
Msg me your address and I will send you the orig and lite version of the esp32-s3-box haven’t got the 3rd version.

I am hoping the BSS (Blind Source Seperation) software is enough. I have hunch what esspressif is doing is using BSS to split sources and then runs KWS on each split stream to select the working stream.
That is the catch with BSS as it will split audio sources very well but what channel they end up in is near random.
Likely if you jetison all that unessacary then there is room for x2 fatter KWS to select the command sentence stream for websockets.

The esp32-s3 with its vector instructions is almost 10x faster than the standard esp32 so also the beamforming code should port over whilst removing portaudio for the esspressif I2S streams.

Also likely we don’t need a DAC on the esp32-s3 and also the squeezelite can be a seperate PCB that is just connected from its dac output to the esp32-s3 ADC 3rd input.

Really all that is needed is the 2 mics, ADC & S3 on a tiny PCB with the 2mics on the same FPC ribbon. I have forgot actually what dimentions and sample rate the audio comes in at on the box design but the details are all there,

Likely it needs some sort of fusion service to commision a small batch run, but what you can do hack a box to do the same with bespoke firmware.

If anyone can hack the Box and get the stereo audio stream offboard as samples its likely from a listen you could tell how effective it could be.
The AEC is likely very sensitive to any latency with a small tail due to its memory constraints and can not remember if that is hardcoded in the blob or can be extended.

Indeed I am UK based, I’m in Truro, Cornwall. That’s a very kind offer, it would be nice to give those devices a look over.

As you mentioned getting audio out of the devices, I would be happy to modify them to provide a 3.5mm audio jack output, though the DAC on both devices will only output mono, but it should still be good enough for testing.
Its interesting to see that the BOX Lite has no AEC feedback from the DAC

I have to admit the software for this project is still currently above my understanding, though I’m hoping that I will be able to learn in my free time, while also trying to make something that is of use to not only my family but hopefully the community too.

I’m coming around to the thought of using a ESP32 for audio playback and a ESP32-S3 for audio capture, it does feel like it adds convolution to the end user for initial setup, but if that allows us to increase the processing power to improve the user experience post setup, so be it. I’m sure if the Proof of concept device works, we can look at reducing costs (possibly by using a S3 for both roles, and following some RPi manufacturing concepts)

Sounds good to me, effectively one PCB would house a ESP32 with a ES8311, a line level (Interconnect) output and possibly an audio amplifier, along with support components
The second PCB would be a ESP32-S3 with a ES7210, with line level input for AEC, a Hardware mute control (using a data buffer) and I would probably aim to populate with 3 mics that the ADC is capable of handling.
Would there be a need for the mics to be on a FPC PCB? I know most smart speakers have the MEMS units mounted pointing at the ceiling, but we could mount them on a sub PCB to get them at right angles pointing outwards around a circle (sorry my ignorance of audio design is likely starting to show)

They are in the post.
I suggest you stick to 2mic as with bind source seperation you split into nMic number of streams.
In domestic scenario’s often its just the command voice and a 3rd party source of noise.
Supposedly it does do 3 mic but that means running x3 KWS if I am right about operation.
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

You really need to use the Esspressif IDF and not the arduino framework but I think it has extentions for VS Code and eclipse that should make a decent IDE.

Likely you will have to hack the AFE where its fed to wakenet
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/README.html

https://docs.espressif.com/projects/esp-dsp/en/latest/esp32/esp-dsp-benchmarks.html

Likely it will as the optimisation for the S3 is already done so its just a matter of porting the FFT code.
FFTs 16 bit Fixed Point which the audio is, is much faster on a S3
The S3 has 2x I2S ports that are bi-directional so you could use a quad ADC with 3 mics and a broadside array will have more side attenuation and still have the ref input.
The only real load is the GCC_PHAT where it tries to correlate the time position of the samples in 2 channels.
If you do that once then the rest of the calcs should be much simpler trigonometry and maybe you could create a x4 mic beamformer running Gcc_Phat on x2 mics only as you know the geometry of the x4 mics and the direction of sound.
A x4 mic square is essential 2x broadside arrays that could make a final endfire with only a single Gcc_Phat calc of x2 channels.