Best ESP32 based hardware for satellite

davosian · July 20, 2021, 7:06am

I am looking for a set of satellite microphones to be used in each room of my home. Voice output will be handled through Sonos speakers in the same room.

Based on your experience, what ESP32 based hardware can you recommend for satellites? The best wake word detection and voice recording is more important to me than the size of the device. Is this something a tiny esp32 based device can deliver? Or should I aim for USB based microphones connected to something like a pi?

@romkabouter I came across your repository at GitHub - Romkabouter/ESP32-Rhasspy-Satellite: The repo has implementing an esp32 standalone MQTT audio streamer. Is is desinged to work as a satellite for Rhasspy (https://rhasspy.readthedocs.io/en/latest/). It supports multiple devices
Can you recommand a particular piece of hardware from the ones you support?

Looking forward to your recommendations.

rolyan_trauts · July 20, 2021, 11:28am

If you can hold on and use a zero for now or with the price of a $5 esp32 maybe use and consume for now.

The new esp32-s3 is practically built for networked voice & vision AI where they are even doing libs for aec & BSS.
There probably isn’t a best esp32 hardware for satellite as always on streaming via MQTT is an extremely bad idea for quite a few reasons.

The esp32-wrover dev kits has a little extra sram that makes them a little bit more fexible and interfacing I2s mics is quite easy.
You might want to wait for the ESP32-S3 but there are a whole range of libs by esspressif such as esp-sr/audio_front_end at esp32s3_dev · espressif/esp-sr · GitHub that are not implemented in the above but maybe might be at one stage.

romkabouter · July 20, 2021, 12:12pm

I only have the Matrix Voice and the M5 Atom.
The matrix voice has some nice feature like the led ring but does not have a nice case. Audio output is not that great and needs work (which I am currently doing)
The M5 Atom seems to have a low gain, but the form factor is very small and work quite well in my experience (and setup)

If formfactor does not really matter, you have more options with a device like a Pi.
A reason to use a Pi as well is that the ESP32 streamer does not have local hotword detection (yet), so it is streaming audio over your wifi network. This may or may not be an issue for you.
Regardless of rolyan’s yet another rant about the MQTT streaming, this is currently the way Rhasspy works. I am waiting for porcupine to release a library for the esp32, but I do not know if this will happen.
If it does, I want to implement local hotword detection so constant streaming might not be needed anymore

rolyan_trauts · July 20, 2021, 3:57pm

Both espressif and tensorflow provide NN models for KWS on esp32 and have done some time.

MQTT is a broadcast protocol supposedly for lightweight messaging and even one heavy binary stream isn’t a great idea but as you add rooms its gets even heavier and there is no rant that obviously that is a bad idea.
Also not having a local KWS means it is streaming all the time and all nodes are subject to this broadcast traffic.

I could do a esp32 KWS just like atomic did but get better results by better choice of dataset and training methods (google-research/kws_streaming at master · google-research/google-research · GitHub)

But as said waiting out for the esp32-s3 as can use a better model type and its looking like it can also do beamingforming.
S3 is just the next gen XTensa LX7 MCU will eventually be priced about same as esp32 and espressif are also providing beamforming / aec and with the new vector instructions and that bit more oomf it will prob be the perfect satellite platform.

davosian · July 20, 2021, 8:41pm

Hm, taking advantage of the ESP32-S3 certainly sounds tempting. My plan is to have a first installment ready by the end of this year, but if I understand you correctly, this might still be a little bit early for a ESP32-S3 based solution. I saw that the chip has been announced back in January already, but it is not yet available and neither are the libraries you mentioned, right?

Sounds like using a Pi might be a solid interim solution for now.

Thanks for the valuable insight, @rolyan_trauts

davosian · July 20, 2021, 9:00pm

I have two Matrix Voice (without ESP32 though) and a Respeaker 4 mic array which I will probably use to start out until we have a more polished ESP32 based solution ready. As you already mentioned, the Matrix is quite large and I have not tried it with Rhasspy yet, but as far as quality goes, it should do the trick.

The Respeaker 4 works well when I am not too far away, but otherwise (5 meters or more) getting it to trigger the hotword (porcupine) takes a few attempts most of the time. I am actually curious to see how it compares to the Matrix Voice with its higher number of microphones. I also tried a Jabra 510 which works very well when close, but not at all when I am further away (which is probably what it has been designed for). I have to admit however, that I did not do any fine tuning of the microphone sensitivity yet on either device.

This situation is actually what triggered my initial question: I am thinking of putting more than one satellite mic in a room so that I am close to one most of the time. Especially the small ESP32 based combos could simply be plugged into an outlet.

Alternatively, I might put Zigbee based buttons around the house and trigger the hotword through pressing the button. This would also allow me to lower the sound on any speakers or TV within the same room (this of course I could do already without using a button by subscribing to the hotword triggered events).

From what I gather so far, however, I will give it a few more months and see where we are at then when it comes to solid satellite options.

Either way, keep up this great work @romkabouter - I will follow along!

romkabouter · July 21, 2021, 7:03am

Yes, when attached to the Pi it is a good Mic in my experience.
Microphones is an importent issue for voice assistants.

If you want more than one (small) mic in a room hidden away, the M5 Atom is very small. A bit like a cube of 1.5cm sides. The main issue with a lot of mikes is the network load I guess, at least untill there is good local hotword detected.

My original idea was to create audioswitches. A small device in a room entrance, triggered by a local hotword/command (like lights on). So when you walk into a room you can call “lights on” and the light in that room turn on.
Without local hotword detection, that is still an idea and not implemented

That is a nice idea!

rolyan_trauts · July 21, 2021, 1:58pm

Far field really is all about AGC (Auto gain) and that the AGC is not so high that the SNR (signal to noise ratio) means much is now noise.
You get reverberation and the doppler effect of frequency of larger rooms and spaces but much is that sound levels diminish quickly over distance.

So check you have some form of AGC running and if not supported try software AGC.

Dunno how many times I have to say this but microphones are not an important issue for voice assistants its the DSP processing of array microphones that is important.
I can stick a $0.5 electret on $3 usb soundcard and have a good microphone in fact one that will likely work better than both of the above.

I have repeatedly stated that a KWS server is all that is needed using websockets as its lightweight and supported on ESP32 btu also allows string and binary bi-directional communication.
Your limited to using the same hardware in a room but all mics should 1st send KW probability hit and then start to broadcast.
The KWS server will just process the best signal and kick the others.
A Distributed Array just needs a co-ord server and that can output to std-out or even MQTT if you where so strangely inclined.

This topic has been raised on multiple times for quite some time and bemused to why its not been implemented as yes rather than single long range all encompassing mics you can do it simply by positioning distributed room mics.
Also what if probably more troublesome for VoiceAi is noise from anything from a washing machine to TV and with distributed mics, by numbers you are increasing the chance of at least one to be voice=near / noise=far.

But as for matrix its old news and an old post says it all.

rolyan_trauts · July 28, 2021, 1:53pm

PS as an update I think the esp32-s3 will be on sale November time.
Initial prices about double the price of the ESP32 wrover chip which from dim memory is about same as when the 1st esp32 got released.

Radxa just released a interesting quad core A53 1.8Ghz called a Radza Zero as its amazingly Pi Zero size but an Amlogic s905y2 on 12nm hence the higher clock speed than the Pi3.
Still really early days as devs are still putting together working images.
Interesting though as it does have 8 channel I2S/PDM

davosian · July 28, 2021, 2:53pm

Thanks for this great insight @rolyan_trauts. Looks like a bright future is ahead of us in this regard (expect for the supply shortage).

On the software side, getting local hotword detection working will probably be key to reach the next milestone for that kind of setup.

In the meantime, I will put a few raspi/mic setups around my flat and collect some experience. Also, will try to optimize what I got by tweaking the settings.

Moving to a new place in December. Hopefully by then I will know which route to take. On the other hand, it would be ok to start out with what I got and improve the setup next year as long as I manage to get to a “usable” state for our smart home use case.

davosian · July 28, 2021, 2:53pm

Yep, also looking forward to this one…

rolyan_trauts · July 28, 2021, 5:16pm

if any interest

rolyan_trauts · August 3, 2021, 10:14pm

Update on the esp32-s3 and libs

davosian · August 4, 2021, 9:21am

This is great stuff! As mentioned in the article, espressif is even working (or at least using) a two mic array for this soc.

rolyan_trauts · August 4, 2021, 10:38pm

Yeah also because the SP32-S3 is the newer XTensa LX7 that has vector instructions to accelerate AI it should be able to provide more complex models such as ds-cnn which is a good combination of 100% tf embedded compatibility and accuracy.

The esp32 has been able to do a CNN for some time with little else for any other processing even if it could be a simple satellite KWS.
The extra the LX7 dual core gives with the pre written audio front-end algorithm’s should make it a no-brainer for satellite KWS as Espressif have focused strongly on that application.
Any I2S mic can connect to a ESP32 and forgot how many channels it has thinks its x2 (so x4)
Still prob not till November and at first prices will be a premium until many of the clone boards and market starts to sell in bulk.

They did demo vid dunno how it will be on reciept.

davosian · August 6, 2021, 1:16pm

Christmas is coming closer

rolyan_trauts · September 21, 2021, 3:39am

34 Expected 28/10/2021

https://www.mouser.co.uk/ProductDetail/Espressif-Systems/ESP32-S3-DevKitC-1-N8R2?qs=Wj%2FVkw3K%2BMCYPoeNuhXFsw==

Expect it will be about the same as official dev kits are £10 which think was about same as orig esp32 which ended up with clones for around $5

davosian · September 21, 2021, 6:06am

this is soon - looking forward to it!

red · November 1, 2021, 9:16pm

This is a really nice idea ! - like it!
Finally, it would be amazing if you could maintain the hotwords for all satellites centrally (e.g. in Rhasspy or so).

romkabouter · November 1, 2021, 9:55pm

I am experimenting with this Build a Keyword Spotting Model with Your Own Voice in 30K RAM

Already implemented it with a default yes/no because I was lazy.
What I need is a simple tool to upload and train data, which above mentioned seems to provide.
You still need to provide a lot of data for the keywords, but basically “lights on” and “lights off” should be a good start.

It does not seem to perform well in my satellite code, but stripped down to the core is works on my M5 Atom Echo

Maintaining in Rhasspy is a good idea, but I have not given that much thought yet.