Best ESP32 based hardware for satellite

rolyan_trauts · November 3, 2021, 6:22am

There is also similar for the pico

But like everything we have without audio processing libs of beamforming or blind source separation they return pretty poor results in the presence of noise.

Still not sure what the state is with the esp32-s3 which is far more powerful vector math than just simple microcontrollers that is specifically aimed at a complete voice package.
There are some kits out £20 and they seem to be going through revisions and mainly engineering samples, but think everything is being effected by the silicon shortages.

Also now the big thing is federated learning that solves huge problems in privacy a general purpose global model can be trained and have its weights effected by a locally trained small model so that user data can garner specific user accuracy without problems of privacy.

Its all part of Android 12 and especially the new pixel 6 phones where through use and correction a local model begins to learn user regional accent and expressions. Which all part of the new Tensor chip they have provided with quite a strong embedded TPU (vector math).

romkabouter · November 3, 2021, 1:54pm

Yeah, I was wondering about that as well but I am just experimenting a bit

rolyan_trauts · November 11, 2021, 2:23pm

Just noticed a new audio project for xtensa based

https://thesofproject.github.io/latest/index.html

romkabouter · November 25, 2021, 8:13am

rolyan_trauts · November 26, 2021, 5:59am

The box itself is very cheap $50 and think its using the Alexa repo for esp32 but will have to do some more reading.
The devkitC is available for $15 https://www.mouser.co.uk/ProductDetail/Espressif/ESP32-S3-DevKitC-1-N8R8?qs=sGAEpiMZZMv0NwlthflBi%2BzwRcElzYQ0Q0bvCS%2BJ0vw%3D

As what is of interest is how well

https://www.espressif.com/en/solutions/audio-solutions/esp-afe

Works as AEC, BSS & NS finally arrive on a platform that makes extremely cost effective satellites but unless Hermes audio and IMO the manner of satellite implementation has changed. we have the satellites but Rhasspy still is in need of major change.

Still though to have mics included, case & display, $50 is hard to beat.
I ordered from aliexpress cost ￡45.46 delivered to the UK

romkabouter · November 26, 2021, 7:07am

I ordered one as well, see if I van get local wakeword running and have something left for other processes.

At least support for the esp32 satellite. With local wakeword, audio streaming over mqtt is still not a big issue in my opinion.
Local wakeword for a pi is already easy, so I still agee that mqtt is not the best for audio streaming alone, but is still a good choice as glue between al the different functionaliteit. And streaming audio also works well actually, so I still disagree that Rhasspy needs a major change on this.

JanWolf · November 26, 2021, 10:08am

Ordered one also ! Hope this will fill the need.

rolyan_trauts · November 26, 2021, 12:20pm

Yeah it should run any tensorflow model @romkabouter and I keep meaning to have a look at the newer federated learning that is in there pixel phones.
In the pixel phones it combines a standard model with local training model, but how and what exactly dunno as haven’t been active with tensorflow lately.
I got one just to see how espressif have it set up and delve into the code as its much easier to hack away then build up from scratch.
I dunno if the mics are pdm or I2s and also inetrested in what they have used.

Also with Rhasspy haven’t paid much attention but still calling for audio input/output zones so multiple satellites and rooms can be provided for by a single brain.

Uncompressed raw audio over any protocol in the 2020’s is a horrendous solution in its own right but to encapsulate that in a lightweight control broadcast traffic is only glue as s hit apparently sticks.

romkabouter · November 26, 2021, 3:49pm

When there is local KWS on the satellite, there is only a short while of actually sending uncompressed audio.
I assume that does not outweigh the major change needed for a different solution atm.

So while you have a good point in theory about raw audio, the actual problem with it is not so very big in my opinion. Maybe Mike is going to make that major change you so much desire in the future, I do not know and do not really care either

romkabouter · November 26, 2021, 3:51pm

Yes, I am curious about that too. Probably no custom keywords, but interesting none the less

rolyan_trauts · November 27, 2021, 6:47am

Again Hermes audio and the satellite implementation of Rhasspy did not have local KWS on a satelilite and like the Matrix and ESP products you seem to support this is your firmware we are talking about here they are purely always on broadcast satellites.

Even if a local KWS satellite was used you seem to forget about the all important ASR that contains the intent rather than purely a keyword to start broadcasting.
I am aware you do not know and don’t really care even though you often comment on the matter.

Zones is something else fundamental that Rhasspy misses call it Zones, Rooms or whatever but its absolutely essential to be able to easily group and associate input/output which audio is one of those.

Its sort of sad to see what was a very busy and active community turn into the trickle that it is today that can not be attributed to some of the community not caring, but hey at least they tried.

romkabouter · November 27, 2021, 7:55am

Maybe you should put more effort in actually contributing to Rhasspy to resolve the issues you are so passionatly about instead of always typing the same long texts regarding the matter.

Would be more productive, you post a lot of negativity but have zero contribution on change

And Rhasspy actually supports groups by the way.

rolyan_trauts · November 27, 2021, 7:55pm

I did at one stage, but things are so far out of whack and certain members are far too ready to turn to ad hominem than discuss or state they don’t care.
Rhasspy may have an over complex programmatical method of providing groups that like skills really there should be a plethora of uses for in existence, but the absence speaks volumes to current implementation needs and requirements.
Every time I hear Hermes Audio mentioned its a red flag to not waste my time, so I no longer do.
Occasionally as in the 1st reply of this thread, I do post on what has been some long term voids in an adequate voice /smart AI.

romkabouter · November 27, 2021, 9:54pm

I cannot recall that I saw any pull request from you. Ever. But if you did, sorry about that and keep up the good work.

Also, keep up the good work for Rhasspy with regards to implementing the missing voids so we can move away from raw audio streaming. Hopefully some pull requests can be seen from you

I do not disagree with you on raw audio streaming, but I am not a Rhasspy developer and unlike you I do not have those big red flags on it. I would otherwise most probably have done something trying to change it. Instead my firmware just supports Rhasspy as it is, like it did a long time ago for Snips (where the whole mqtt audio shizzle is coming from in the first place)
If Rhasspy changes, the firmware will change along with it.

rolyan_trauts · November 28, 2021, 7:30am

If I am going to be honest @romkabouter Rhasspy sort of doesn’t fit my needs anymore and I also am not a developer even though I have done quite a bit of work with tensorflow my main interest has been getting the working audio engineering and how on a Raspberry Pi platform we are near totally absent of the critical audio processing stages.

The esp32-S3 could be a game changer but how well it works has still to be seen as it does have AEC (Acoustic Echo Cancellation), BSS (Blind Source Seperation) & NS (Noise Suppression).
BSS is an alternative to Beamforming and it will be interesting to results as the best solutions use both and the combination of beamforming & BSS is probably how at least the Google devices work.
Amazon might use multiple mic beamforming alone with 6 array microphones whilst google used 2 but has I think a 3 array linear in its Nest Audio products.
More microphones increases resolution and accuracy of beamforming but each additional microphone exponentially increases process load.

So how 2 mics with BSS alone will pan out currently is of interest but the algorithms are incredibly math intensive but they have advanced.
Speechbrain and professor Grondin F SpeechBrain: Speech Processing provides some great examples that hark from his excellent ODAS app. [2103.03954] ODAS: Open embeddeD Audition System

The other advantage to the ESP32-S3 is its a vector math enabled microprocessor of low cost. The dev-kit board alone has arrived @ $15 and is not far off what the original ESP32 did so if a success then economies of sale should see similar prices maybe as low as sub $5 ESP32 boards.

Due to low cost this should allow the possibility of distributed wide array microphones where multiple 2mic array microphones can use KW hit probability to select the best signal source for a voice sentence.
This means you garner a hybrid of a conference wide array microphone system with low cost distributed array microphones to rival and better commercial alternatives.

rolyan_trauts · November 28, 2021, 8:01am

You have been providing firmware for microphones that are absent of any audio pre-processing that have all the show stoppers that noise, distance and echo provide to recognition.

What you have been providing is essentially pointless as it provides no better results than a single el cheapo mic on a sound card.
You have disliked my posts informing others that apart from being a remote microphone do not expect anything more or anything like Google or Amazon performance as certain critical essential processes of voice AI are missing from your firmware.

MQTT is a control protocol that is lightweight and should be encrypted and by embedding audio into its payload because its broadcast traffic nodes that are not audio specific have to decrypt heavy audio traffic to just reject as not needed.
This is fine on the audio sender/receiver but completely kills over the use of lightweight control nodes in that MQTT network.
Audio does not need a broadcast traffic it has a sender and a destination and metadata and websockets is far more appropriate as its very simple to detect binary and string packets (audio & metadata) providing a very efficient way to transmit and separate the two whilst only occurring between the sender and recieve nodes than a whole MQTT network.
Also because only command sentence is sent after KW authorisation its very likely no encryption is needed or at least encryption is choice to create even lighter audio system transmit.
Where many room/zone systems can work over fairly low bandwidth WiFi/ethernet but transmission is sperate from MQTT as its pretty essential for any control protocol to be lightweight & encrypted and filling with unnecessary audio binaries creates huge load whilst there is not need for MQTT on a microphone/audio satelite like most commercial RTP protocols work, so MQTT can remain lightweight & secure and a control protocol.

romkabouter · November 28, 2021, 8:44am

That is incorrect. I actually stated myself in various occasions that the firmware is not more than a remote microphone.
Indeed it does not better results than a single cheap mike, but that is also not ever what I have advertised in any way.
The firmware offers a way to have a mike away from a server hardware having any sound input.
If that is pointless in your opinion, then that is what it is, your opinion.
I also created it for myself in the first place, because my server is hidden in a closet somewhere and I wanted a remote mike. But I saw interest from others for exactly that reason so opened it up on Github.
If you or someone else can provide good AEC and NS for my firmware, I would very much appreciate that because I have about zero knowledge about that audio processing stuff.

For MQTT and audio I agree that it not ideal, but you really should blame Snips (and now Sonos)
Rhasspy did nothing more than jump in the hole when Snips was sold. Also MQTT is not a control protocol, it is a message protocol for large variety of use-cases.
And basically, if you have Rhasspy setup with Raspberry’s or a NUC, you would never have to stream audio over the network. Everything can be local and no MQTT audio is not needed for that at all.
The firmware I wrote does sadly not support local KWS atm, so the only way to achieve a remote mike system was to connect an audio stream. I already had that working for Snips via MQTT so it was quickly done for Rhasspy.
I found other features more important than to find another way for audio broadcast. And hopefully I can find some time in the future to do local KWS.
That will result in a very short time of audio broadcasting, so that is why I think that this whole MQTT audio issue is blown very much out of proportion by you.
I had a branch for websockets audio as well, but stopped working on it due to time issues.

If you really need AEC and NS, which I can very much understand, then do not use the ESP32 firmware. But that is true for every mike you blindly connect to Rhasspy, because a simple USB mike does no AEC or NS either.

So can we please leave this pointless discussion behind us and spend our energy in a positive way?

rolyan_trauts · November 28, 2021, 9:25am

This is exactly what I am telling you as AEC & NS has been part of the ESP32 software stack for 4 years.

Like MQTT its the manner you are using it as like many AEC audio in & out needs to be critically in sync for it to work.

That is with the older lesser powered ESP32 which supposedly it wasn’t that great like all the AEC & NS software we have for any platform unless using dedicated and expensive hardware.
The ESP32-S3 has an update of these libs and also more power so that we might not have the only option of dedicated and expensive hardware.

Multiple remote mics if via a local KWS which also for several years can run on ESP32 but has had a major boost on the vector math and improved performance of the S3.

So on all other platforms we are stuck in the exact same position of expensive hardware whilst with the introduction of the S3 we also gain BSS and a vendor who is actively supporting low cost solutions for this very specific area and its ESP32-S3.

The pointless discussion is the continuation of the cul-de-sac that is Snips and there methods as it is no longer, is not available and the remnants we have left are extremely poor in performance.
There is no way I can be positive of input that is going to purely Ostrich from the essential start of a VoiceAI system of audio processing because they have zero knowledge and continually dragging the community down that cul-de-sac.

We are nearing a low cost and importance of a singular microphone of use and its time to be positive and discuss the best ways forward to implement that.

romkabouter · November 28, 2021, 12:00pm

I know that, but I do not have the knowledge to implement it at the moment. If you have a good repo I can use as example I will check that out. I hope with that S3 and the code examples I can figure those out

rolyan_trauts · November 28, 2021, 1:30pm

Its been published by espressif for several years https://github.com/espressif/esp-sr/blob/master/docs/audio_front_end/README.md and I have posted on here before.

It mainly includes AEC (Acoustic Echo Cancellation) , BSS (Blind Source Separation), NS (Noise Suppression). ESP-SR encapsulates the above algorithms into simple APIs. Without undersatanding details of these algorithms, developers only need to arrange and combine the algorithms that need to be used for specific application scenarios, and input the voice data to be processed according to the API format requirements and can get the results.

Which is part of https://github.com/espressif/esp-sr but the whole repo for the esp-box has been published.

Interestingly the mics are analogue and may well be unidirectional with and intergrated ADC http://www.everest-semi.com/pdf/ES7210%20PB.pdf

wakenet strangely has opted for a in-house and charges for custom wakewords whilst tensorflow embedded has been posted on this forum several times in alleged long winded posts.

github.com

espressif/esp-sr/blob/master/docs/wake_word_engine/ESP_Wake_Words_Customization.md

#Espressif Speech Wake-up Solution Customization Process
---

#### 1.1 Speech Wake Word Customization Process
Espressif provides users with the offline wake word customization service, which allows users to use both publicly available wake words (such as "Hi Lexin", ”Alexa”, and “Espressif”) and customized wake words.

 1. If you want to use publicly available wake words for commercial use
	- Please check the wake words provided in [esp-sr](https://github.com/espressif/esp-sr);
	- We will continue to provide more and more wake words that are free for commercial use.

 2. If you want to use custom wake words, we can also provide the offline wake word customization service.
	- If you provide a training corpus 
		- It must consist of at least 20,000 qualified corpus entries (see the section below for detailed requirements);
		- It will take two to three weeks for Espressif to train and optimize the corpus after the hardware design meets our requirement;
		- It will be delivered in a static library of wake word;
		- Espressif will charge training fees based on the scale of your production.
		
	- Otherwise
		- Espressif will collect and provide all the training corpus;
		- Espressif will deliver a static library file of successfully trained wake word to you, but won't share the corpus;

This file has been truncated. show original

Its all been posted before but was a tight squeeze even on the 8mb PSRAM ESP32.

The LX7 with the vector instruction set (Neon like SIMD) should be able greatly accelerate ML models so that more accurate complex models can be use like much of the work I posted with https://github.com/google-research/google-research/tree/master/kws_streaming

I presume they are using unidirectional microphones on the ADC as all I2S/PDM mics seem omnidirectional but there must be a rationale to the ADC route than a lower cost I2S/PDM route.