Again confused as why would you pick a microcontroller for a maker product that has near zero maker community?
Without research I presume Xmos do provide licensed software that likely is the same that is baked into there XVF3510.
Also from memory as this is how long I have been saying you need to use the KWS to lock the beamform for that command sentence
was still missing.
Its still the same as its buying in knowledge because the community lacks the DSP/ML skills to create the essential initial audio processing for a room voice microphone.
The dumbest thing on the SJ201 daughter board
was to hardsolder microphones and make placement and isolation a near impossible task.
Like the Esp32-S3-Box the microphones are on a small pcb thats connects by a FPC (Like Pi Cam Ribbons)
Likely 12v is a good input voltage that is a good source for an audio amplifier that is less toy-like, A DC to 5v stepdown is a very common component circuit.
For testing those problems do not matter and maybe some empirical data could be provided.
The AEC on those is pretty good as none linear, but that doesn’t cover the noise by common media such as TV, Music, Radio …
Tha above is the problem as you have that currently with the esp32-s3-box and from the results you are getting due to lack of algs and DSP in the community and anybody who is capable of steering it.
This is where my head explodes in absolute confusion as you do have a Microcontroller that is capable that does have a community, but unfortunately the skills of that community is limited.
There is no problem running a KWS on esp32-s3 but the chosen KWS uses a closed source blob provided by Google and uses layers not supported and likely for any micro-controller it is the same.
The esp32-s3 can very well run KWS and I have said many times that likely a CNN, DS-CNN, BC-RESnet and maybe a CRNN all documented in detail at google-research/kws_streaming/README.md at master · google-research/google-research · GitHub with a training API that also includes tf4micro, as said on many times.
Unfortunately the gimic of custom KW was sold as a key feature as that is something even big data can not afford as KW dictate is due to the datasets they hold.
What dscripka did is exceptionally good to quickly get a KW in operation but sadly no guidelines where given and no option to collect correct KW was in place with an option to forward and send as opensource data.
GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity. is brilliant for that, but is a rather fat KWS for many microcontrollers and actually less accurate than many with dedicated KW datasets as mentioned above.
Swapping to another microcontroller because you lack the tech and steering skills to create a solution is still going to be the same. In fact even worse as the community for support or dev is even more sparse than the current esp32-s3.
The esp32-s3 is an esspressif technology demonstrator where they give a framework, software blobs, working hardware back with a github of PDF circuit diagrams and bill-of-materials and even the PCB Gerber files.
Esspressif does know enough to design a circuit and has and all that info has been available to you for some time…