The best bet is probably the esp32-S3 but we are going to see a load of very capable low cost microcontrollers that are perfect for wireless KWS roles.
I am still have made no progress as they have released another product called a esp32-s3-box-lite and I really don’t care about screens, output, speaker its purely the input to be able to run a descent KWS and have a speech enhancement pipeline.
The original esp32-s3-box had a analogue loopback from the dac to sync the AEC ref on a 3rd ADC channel so the idea of just putting 2x I2S mics on a standard dev kit became a show stopper.
The new lite box has a 2 channel ADC so hopefully there is a software update on the way but also supposedly you can just buy the board alone.
I am going to keep calling them ‘KWS Ears’ to stress how simple needs are but why I keep focussing on cost is not just the cost of a singular unit for a room its that a room could contain multiples to form a distributed array microphone.
A softmax probability from a single KWS is a good enough metric if the highest value in an array to use that mic for the current ASR sentence. The ones that didn’t hear the KW are likely not hear the ASR sentence and its that simple as each ‘ear’ is completely ignorant of another’s existence.
Hopefully the ESP32-S3 boards will follow the same economies of sale that previous ones did and maybe get as cheap as $5 as you could have 2 or 3 in each room if you so wanted to as each additional mic can be placed to provide further isolation from noise sources and making far sources nearer.
I am not on a Espressif sales pitch but its the only source of free speech enhancement I know of AEC & BSS I don’t like how the base libs are blobs but hey, if another microcontroller comes along then hey but Espressif does have a history of being extremely cost effective wireless microcontrollers.
I don’t want the ‘KWS ears’ to be a Rhasspy, Mycroft, Sepia, Project Alice or one of a plethora of projects all doing the same I just want to set a basic ‘KWS ear’ system that is simple interoperable with all and doesn’t pander to any other projects protocols.
I couldn’t care less about branding or ownership its just a very simple websocket client/server queue that is file based on zones and acts as a bridge to the input of any ASR.
That is the only dictate as the zone file structure on the input matches the output to where ASR txt is dropped.
Audio is coupled by a linux asound loopback not some weird and wonderful protocol other than the websocket of the server side of the KWS bridge. Probably doesn’t even need the file system as likely the current sink of a loopback is more than enough info.
Its why I have been waiting for product as it will be built up from the audio source path with thought to always boil down to the lowest common denominator of simplicity and interoperability without bloat.
Its why I have stayed on the forum as there are many actors here @synesthesiam to say the least but I find it infuriating that many projects are applying their own methods purely to have ‘their’ own methods even though the Sepia initiative seem to be trying to address this.
This is Linux this is opensource and all the pipeline stages of VoiceAI are distinct and we should be able to partition and give choice to any as Linux and opensource does.
I am sort of critical to the MkII of Mycroft but what they have is an excellent skill server and wish they would concentrate on that as yeah I would love to tack that onto the end of my ASR of choice and so on.
But going back to ‘KWS ears’ I think opensource can do more, be better and more cost effective but its sheer stupidity to try and verbatim copy commercial offerings as likely your going to fail and there could be much better ways of doing things, for less and one of them is integration and reuse and interoperability.
I haven’t ruled the Pi either as Zero2 & Pi3A+ are both great products but until someone does provide effective AudioDSP utils I have a bottleneck for even the base function of an ‘ear’ its still a platform that easily installs network synced multiroom audio such as Airplay, Snapcast & I think squeezelite (is it synced?) That will take considerable work to port to microcontrollers where efforts with the original ESP32-S3 where slightly too constrained.
The 2mic & 4mic hat on the Pi are extremely cost effective and all is needed is the efficient code we run the rest of out Linux audio system on which isn’t python and its sort of sad as the hardware is capable.