New 2-mic HAT on kickstarter

donburch · January 17, 2022, 11:13am

I found another clone reSpeaker 2-mic hat on Kickstarter with 20 days to go. The developer has replied

Yes, audio codecs are available in the market. We built this HAT with the ICWM8960. We tried to make the cost lower so that more people can have this product.

which leaves me feeling very disappointed. Personally £12 plus £15 postage from US to Aus for yet another clone using the poorly-supported seeed driver, well it just does not excite me.

rolyan_trauts · January 17, 2022, 1:03pm

Yeah is much the same Don and irrespective of how cheap they still don’t have any audio processing algs to make far field and noise tolerance work for keyword & speech recognition.

The best priced unit at the moment is the esp32-s3-box as not a hat but its everything in. You will have to change to your country for price but UK its £34.

Display, 2x mics, speaker + amp and in a case stand.

Its very new and slowly getting more support.

ESP32-S3 Box - 16MB Flash, 8MB PSRAM

Its does have aec and blind source seperation algs and works quite well as is but likely there will be a few customised for various solutions.
I have mine and keep meaning to have a go but yeah it works quite well.

donburch · January 18, 2022, 10:34pm

It took many months, but I finally realised that you were not objecting to the RasPi multi-mic HATS themselves - but that they lack DSP ! So I assume DSP processing could be added to the reSpeaker driver or as a module for Rhasspy ? I note that the seeed website talks about using picovoice, porcupine and Mycroft Precise. RasPi Zero is already slow for a satellite, and adding DSP functions has to make it slower

I now believe that Voice Assistant means multiple satellite devices, and it doesn’t matter if they are Raspberry Pi’s, ESP32’s or google minis - as long as they operate locally, and are cheap enough to put in all the main rooms.

These ESP32-S3-BOX units certainly look great and appears to have all the pieces for Voice Assistants in one box. The price you pointed to is comparable with purchasing a RasPi Zero, reSpeaker and all the pieces. But (like the reSpeaker products) it is only a “development kit” and needs software to turn its “AI Ability” into a working product. I am eagerly waiting your feedback / opinion on these ESP-S3-BOX devices - particularly on when the available firmware meets the “potential” of the hardware.

Are you using yours totally stand-alone; with a Rhasspy base station; or with some other system to do the STT, intent recognition and TTS ? I remember you have previously had concern about using MQTT to transmit audio to a base station.
I understand that ESP has a different CPU so it would be a big job to port the existing Rhasspy code to run on ESP.

rolyan_trauts · January 18, 2022, 11:53pm

There is AEC (Acoustic Echo Cancelation) that runs on the pi, but now for several years noone has managed Pi opensource beamforming and use of BSS (Blind source seperation).
The ESP32 is a lowcost microcontroller as everything advances microcontrollers are flexing power embedded SoCs once had and embedded SoCs like the RK3588 are sort of entry level desktop lvel.
There is something about the audio algs that suits a RTOS (real time OS) on a microcontroller that can be done simpler by hardware guaranteeing ref audio and signal audio sync that on a application non realtime OS such as Linux or windows as processes can queue it adds more complexity.
Only algs I have seen are commercial anyways if it can be done or not and the Math in them is complex so prob why we haven’t had a community contribution.

Really what you want is AEC->Beamforming-BSS

So here is a rough explanation AEC focuses the waveforms and also cancels playing sounds then beamforming merely ponits the mic the direction but BSS can use the far of aec and near of the beamform signal and reduce the noise.

Its likely both Google & Amazon use such a scheme prob they are more clever but generally BSS is slightly better than beamforming and you can have one or the other but the sum of the 2 with beamforming providing a near & far signal for BSS it gets even better.

On a Pi we as said we have not managed to get opensource realtime multi mic beamforming & BSS and all the systems we have are high noise intolerant and extremely poor in comparison to those that do such as Google & Amazon.
The ESP32-S3-BOX has some firmware already done with KWS & ASR and display and stuff and is standalone.
I am expecting Rhasspy has hit a cul-de-sac with recent announcements, but never did like the way it handles satelites or the whole Hermes control structure but yeah I think it would be much better with a base station but currently they do work standalone.
My personal take is all that is needed in a satelite is a KWS & a network synced audio client such as airplay, squeezelite or snapcast as then each satelite should be very low load and cheap.
The way voice action happens for a home the guts can all be centrally shared and be of high quality and likely more than a Pi.
The Pi’s just don’t make good satelites or central bases as a satelite with something like Rhasspy on needs considerable power to get good results and as a centrally shared base station is probably not powerful enough.

If you take Mycroft’s all in one which is like Rhasppy its a Pi4 with a custom 2 mic AEC audio board and screen for $299

The ESP32-S3 is a microcontroller and so fundementally different Esspressif have written thier own optimised C code for Audio front end, KWS & ASR and you couldn’t really port Rhaspy you would prob have to start from scratch with C and likely never manage to get it to fit in the available resources even though optimised bespoke C code is far smaller than libs and all the guff that comes in a full OS such as Linux.

Its more than just being a different CPU they are different types of devices but yeah they prob would make excellent satelites to a ‘base station’, but for me still a display and all that is overkill for what is needed and was hoping the esp-s3 alone for $12 and some I2S mics would be all that is needed but they use a 3rd channel on a ADC as a loopback to sync the ref signal so stuck with the esp32-s3-box or there audio dev boards which is similar to the alg problem with the Pi.

Maybe Mycroft might start selling there audio board but likely prices will not be that attractive for many.
With a Raspberry Pi you can make a pretty lack lustre AI for reasonable money and for $299 with a Pi and hardware you can make something that still isn’t comparable to the recognition ability and noise handling of relatively cheap big data voice AI, as really its not state-of-the-art, as a testing and comparison will likely tell.

The Esp32-S3-Box is just interesting to land running with a voiceAI project that is a target to beat.
The ESP32-S3 are new but likely with economies of sale to be a couple of $ like the esp32 did.
The BSS alg espressif did still works without need for the AEC as think thier AEC is more line cancellation than true AEC.
The idea is to have really low cost ‘Ears’ that are purely distributed network KWS to a central ‘base station’ that uses the best KW probability of an distributed array or singular Ear.
The base station will store the last working KW and add to a dataset and actively retrain on usuage so the KWS learns users and gains accuracy and ships new KWS firmware OTA.
The Esspressif models for KWS is closed source even if free but I think likely I would use a more state of art streaming KWS that the like of Google-Research demonstrate that is open source and allows users to create custom KW with better accuracy.
Base station would use HDMI cec as often there is already a far better display and speaker system already in the room.

State-of-the-art ASR is likely what is incorporated in the new Pixel 6 phones as it is offline and runs on a tensor core of 4 TOPs at just 2 watts and development in this arena is fantasically rapid its hard to keep up with developments, but sadly much of what we have opensource is being left behind or relatively lesser copies.

donburch · January 22, 2022, 12:36am

Agree 100%.

Rolyan, are you using the current version Rhasspy in Base + Satellite configuration ? It sounds as though you are thinking of an old version which only ran stand-alone.

Huh ? If your Base station is a powerful server to give the processing power and audio quality you expect, then surely you don’t want a bulky noisy server sitting next to your living room TV. Home Assistant OS certainly seems to be taking the approach that devices should connect through the LAN, which allows the server to sit in a comms cabinet or back room.

My limited experience of 3 Satellites is that RasPi Zero is too slow to recommend, but RasPi 3A+ and 3B work OK with reSpeaker HATs and Rhasspy (microphone, speaker, LAN and Porcupine wakeword modules only). I doubt that using RasPi 4 for a Satellite would be worth the extra cost. There are other boards with DSP - but at much higher price. ESP32-S3 with DSP sounds much better than any of the current options - I can’t wait !

For Base station, my RasPi 4 (4GB) runs Home Assistant OS with MQTT and Rhasspy add-on (using STT, intent recognition and TTS modules only). I only have a 2-bedroom apartment with few devices connected to HA, and performance is OK. More important is knowing that I can easily upgrade to a NUC or server

From your years of audio industry experience, probably you expect a much higher audio quality than my “OK” Lets face it - no local solution will ever be able to apply enough processing power to compete with the results from Amazon/Google/etc. But that is not the objective of Rhasspy.

I see current Rhasspy as a modular framework with multiple options for each stage of voice processing. And Raspberry Pi is just a convenient low cost way for people to start with Home Assistant and Rhasspy. I expect that if/when ESP32-S3 is interfaced to Rhasspy (using MQTT or another protocol) it will quickly become the most recommended Satellite hardware.

.

I have absolutely no experience with ESP hardware, so a lot of the second half of your post went over my head ESP32-S3-BOX has a nice little demo, but there’s a huge gap between turning one LED on and off, and running a whole house. Of course it is a development kit, meaning it is up to others to develop the software. Waiting to see how ESP32-S3 progresses…

and so we get back to the same basic issue as DSP on RasPi & reSpeaker. Software is the key.

Sounds great, but (a) is this available now as open source, and (b) does it run locally - or locks us into dependency on Google’s massive cloud processing ?

At the same time, this is an opportunity to re-consider what might be a better control structure suitable for ESP32 satellites, and how can it be implemented.

rolyan_trauts · January 22, 2022, 5:10am

I am not using any as think the satelite option is a bad method and the KWS we have are not much better and on top of that the audio processing algs are missing.

That is just it from Nuc’s, Ryzens to Arm there is a huge array of small format, very quiet, very powerful Pc’s but HDMI cec would be a major consideration to me, but no noisy bulky server is needed.

Yes, yes and no.
Even Google has gone offline with its ASR for its new Pixel phones, low cost SBC with NPUs and a whole rake of newer toolkits.

Likely websockets as gives a really easy way to detect string and binary data in a msg flow so txt control and binary mic audio are easily seperated that is a 1-1 connection and not broadcast traffic.
Don’t really like the term satelilites as they should be nothing more than distributed network KWS as nothing else is needed.
They should be system agnostic and a common HMI as a keyboard, mouse or webcam or whatever and prob follow a commercial scheme that becomes an open protocol as AudioRTP sort of has.

Things are changing so fast and there are so many options that there is much more to re-consider than the control structure for a ESP32-S3 KWS, but for me I have gone back to a Google Nest Audio as it just works so much better for now.