Best ESP32 based hardware for satellite

rolyan_trauts · August 3, 2021, 10:14pm

Update on the esp32-s3 and libs

davosian · August 4, 2021, 9:21am

This is great stuff! As mentioned in the article, espressif is even working (or at least using) a two mic array for this soc.

rolyan_trauts · August 4, 2021, 10:38pm

Yeah also because the SP32-S3 is the newer XTensa LX7 that has vector instructions to accelerate AI it should be able to provide more complex models such as ds-cnn which is a good combination of 100% tf embedded compatibility and accuracy.

The esp32 has been able to do a CNN for some time with little else for any other processing even if it could be a simple satellite KWS.
The extra the LX7 dual core gives with the pre written audio front-end algorithm’s should make it a no-brainer for satellite KWS as Espressif have focused strongly on that application.
Any I2S mic can connect to a ESP32 and forgot how many channels it has thinks its x2 (so x4)
Still prob not till November and at first prices will be a premium until many of the clone boards and market starts to sell in bulk.

They did demo vid dunno how it will be on reciept.

davosian · August 6, 2021, 1:16pm

Christmas is coming closer

rolyan_trauts · September 21, 2021, 3:39am

34 Expected 28/10/2021

https://www.mouser.co.uk/ProductDetail/Espressif-Systems/ESP32-S3-DevKitC-1-N8R2?qs=Wj%2FVkw3K%2BMCYPoeNuhXFsw==

Expect it will be about the same as official dev kits are £10 which think was about same as orig esp32 which ended up with clones for around $5

davosian · September 21, 2021, 6:06am

this is soon - looking forward to it!

red · November 1, 2021, 9:16pm

This is a really nice idea ! - like it!
Finally, it would be amazing if you could maintain the hotwords for all satellites centrally (e.g. in Rhasspy or so).

romkabouter · November 1, 2021, 9:55pm

I am experimenting with this Build a Keyword Spotting Model with Your Own Voice in 30K RAM

Already implemented it with a default yes/no because I was lazy.
What I need is a simple tool to upload and train data, which above mentioned seems to provide.
You still need to provide a lot of data for the keywords, but basically “lights on” and “lights off” should be a good start.

It does not seem to perform well in my satellite code, but stripped down to the core is works on my M5 Atom Echo

Maintaining in Rhasspy is a good idea, but I have not given that much thought yet.

rolyan_trauts · November 3, 2021, 6:22am

There is also similar for the pico

But like everything we have without audio processing libs of beamforming or blind source separation they return pretty poor results in the presence of noise.

Still not sure what the state is with the esp32-s3 which is far more powerful vector math than just simple microcontrollers that is specifically aimed at a complete voice package.
There are some kits out £20 and they seem to be going through revisions and mainly engineering samples, but think everything is being effected by the silicon shortages.

Also now the big thing is federated learning that solves huge problems in privacy a general purpose global model can be trained and have its weights effected by a locally trained small model so that user data can garner specific user accuracy without problems of privacy.

Its all part of Android 12 and especially the new pixel 6 phones where through use and correction a local model begins to learn user regional accent and expressions. Which all part of the new Tensor chip they have provided with quite a strong embedded TPU (vector math).

romkabouter · November 3, 2021, 1:54pm

Yeah, I was wondering about that as well but I am just experimenting a bit

rolyan_trauts · November 11, 2021, 2:23pm

Just noticed a new audio project for xtensa based

https://thesofproject.github.io/latest/index.html

romkabouter · November 25, 2021, 8:13am

rolyan_trauts · November 26, 2021, 5:59am

The box itself is very cheap $50 and think its using the Alexa repo for esp32 but will have to do some more reading.
The devkitC is available for $15 https://www.mouser.co.uk/ProductDetail/Espressif/ESP32-S3-DevKitC-1-N8R8?qs=sGAEpiMZZMv0NwlthflBi%2BzwRcElzYQ0Q0bvCS%2BJ0vw%3D

As what is of interest is how well

https://www.espressif.com/en/solutions/audio-solutions/esp-afe

Works as AEC, BSS & NS finally arrive on a platform that makes extremely cost effective satellites but unless Hermes audio and IMO the manner of satellite implementation has changed. we have the satellites but Rhasspy still is in need of major change.

Still though to have mics included, case & display, $50 is hard to beat.
I ordered from aliexpress cost ￡45.46 delivered to the UK

romkabouter · November 26, 2021, 7:07am

I ordered one as well, see if I van get local wakeword running and have something left for other processes.

At least support for the esp32 satellite. With local wakeword, audio streaming over mqtt is still not a big issue in my opinion.
Local wakeword for a pi is already easy, so I still agee that mqtt is not the best for audio streaming alone, but is still a good choice as glue between al the different functionaliteit. And streaming audio also works well actually, so I still disagree that Rhasspy needs a major change on this.

JanWolf · November 26, 2021, 10:08am

Ordered one also ! Hope this will fill the need.

rolyan_trauts · November 26, 2021, 12:20pm

Yeah it should run any tensorflow model @romkabouter and I keep meaning to have a look at the newer federated learning that is in there pixel phones.
In the pixel phones it combines a standard model with local training model, but how and what exactly dunno as haven’t been active with tensorflow lately.
I got one just to see how espressif have it set up and delve into the code as its much easier to hack away then build up from scratch.
I dunno if the mics are pdm or I2s and also inetrested in what they have used.

Also with Rhasspy haven’t paid much attention but still calling for audio input/output zones so multiple satellites and rooms can be provided for by a single brain.

Uncompressed raw audio over any protocol in the 2020’s is a horrendous solution in its own right but to encapsulate that in a lightweight control broadcast traffic is only glue as s hit apparently sticks.

romkabouter · November 26, 2021, 3:49pm

When there is local KWS on the satellite, there is only a short while of actually sending uncompressed audio.
I assume that does not outweigh the major change needed for a different solution atm.

So while you have a good point in theory about raw audio, the actual problem with it is not so very big in my opinion. Maybe Mike is going to make that major change you so much desire in the future, I do not know and do not really care either

romkabouter · November 26, 2021, 3:51pm

Yes, I am curious about that too. Probably no custom keywords, but interesting none the less

rolyan_trauts · November 27, 2021, 6:47am

Again Hermes audio and the satellite implementation of Rhasspy did not have local KWS on a satelilite and like the Matrix and ESP products you seem to support this is your firmware we are talking about here they are purely always on broadcast satellites.

Even if a local KWS satellite was used you seem to forget about the all important ASR that contains the intent rather than purely a keyword to start broadcasting.
I am aware you do not know and don’t really care even though you often comment on the matter.

Zones is something else fundamental that Rhasspy misses call it Zones, Rooms or whatever but its absolutely essential to be able to easily group and associate input/output which audio is one of those.

Its sort of sad to see what was a very busy and active community turn into the trickle that it is today that can not be attributed to some of the community not caring, but hey at least they tried.

romkabouter · November 27, 2021, 7:55am

Maybe you should put more effort in actually contributing to Rhasspy to resolve the issues you are so passionatly about instead of always typing the same long texts regarding the matter.

Would be more productive, you post a lot of negativity but have zero contribution on change

And Rhasspy actually supports groups by the way.