ESP32 could well function as a duplex device the ram considerations for ring-buffers and generally resources are pretty low but quite a lot is possible.
Also there are a few different types of ESP32 but the Wrover version often is just a tad more expensive but has more RAM and probably more likely.
There are some dev boards loaded with mics and codecs but really they are just a Wrover mounted with ancillaries @ approx pi zero price or above and prob a bit pointless as an I2S mic & Amp are really dirt cheap now.
I really like what Atomic did in the above vid and have been banging for a while about either a tensorflow, keras or pytorch KWS as for some reason the Precise implementation seems to be very heavy.
Be wary of the Google command set and the old adage of “Garbage in / Garbage out” as this seems very true of models.
I was using the Linto HMG with the “Visualise” key word from the Google Command set ver 2.0.
The GUI is just a handy tool as it shows false positives/negatives with an easy button to play and this led me to realise how bad many of the samples are in the Google Command set.
Just really simple stuff of badly cut words, very bad recordings or pronunciation which I had presumed would of already been trimmed from the dataset.
Not so approx 10% is bad and if you take the time running and deleting bad your overall accuracy will sky rocket.
Add a few of your own recording, pitch shift slightly, trim and normalise with a touch of variation and add background noise will create a qty that will also greatly increase accuracy.
Using a distinct 3 phoneme word ‘visualise’ helps much but doesn’t have a snappy name like Marvin.
The ESP-32 does have a AMR-WB encoder or WAV but quite a choice of decoders.
I really like what Atomic has done as if you could include the keyword hit score in the audio stream you could broadcast from keyword hit to silence and use the hit score metadata to pick the best mic signal from an array of mics that is far better than just an RMS target.
The $5 ESP32 are exceptionally cheap don’t need an SD card to program and also do models with a U.FL antenna connector that can greatly help with signal level of on board types.
There is no far field if you can cheaply place a distributed array which is an extremely beneficial option whilst we still have lack of Linux opensource beamforming.
Picking the nearest mic on the strongest keyword hit doesn’t need any fancy algs.
The raspberry Pi is the same and for audio in you can just wire up 2x I2s mics very simply to gpio but the esp32 wrover is actually much cheaper than a Pi0
I did notice this design by the invensense engineers https://invensense.tdk.com/wp-content/uploads/2015/02/Low-Noise-Directional-Studio-Microphone-Reference-Design1.pdf
Ignore the array and think of as 2 mics part of an I2S pair and I presule we could do as they did it. Digitally would be just subtracting the value of 1 mic from the value in the other mics delay buffer but still have to try.
https://invensense.tdk.com/wp-content/uploads/2015/02/Microphone-Array-Beamforming.pdf