ESP32 could well function as a duplex device the ram considerations for ring-buffers and generally resources are pretty low but quite a lot is possible.
Also there are a few different types of ESP32 but the Wrover version often is just a tad more expensive but has more RAM and probably more likely.
There are some dev boards loaded with mics and codecs but really they are just a Wrover mounted with ancillaries @ approx pi zero price or above and prob a bit pointless as an I2S mic & Amp are really dirt cheap now.
I really like what Atomic did in the above vid and have been banging for a while about either a tensorflow, keras or pytorch KWS as for some reason the Precise implementation seems to be very heavy.
Be wary of the Google command set and the old adage of “Garbage in / Garbage out” as this seems very true of models.
I was using the Linto HMG with the “Visualise” key word from the Google Command set ver 2.0.
The GUI is just a handy tool as it shows false positives/negatives with an easy button to play and this led me to realise how bad many of the samples are in the Google Command set.
Just really simple stuff of badly cut words, very bad recordings or pronunciation which I had presumed would of already been trimmed from the dataset.
Not so approx 10% is bad and if you take the time running and deleting bad your overall accuracy will sky rocket.
Add a few of your own recording, pitch shift slightly, trim and normalise with a touch of variation and add background noise will create a qty that will also greatly increase accuracy.
Using a distinct 3 phoneme word ‘visualise’ helps much but doesn’t have a snappy name like Marvin.
The ESP-32 does have a AMR-WB encoder or WAV but quite a choice of decoders.
I really like what Atomic has done as if you could include the keyword hit score in the audio stream you could broadcast from keyword hit to silence and use the hit score metadata to pick the best mic signal from an array of mics that is far better than just an RMS target.
The $5 ESP32 are exceptionally cheap don’t need an SD card to program and also do models with a U.FL antenna connector that can greatly help with signal level of on board types.
There is no far field if you can cheaply place a distributed array which is an extremely beneficial option whilst we still have lack of Linux opensource beamforming.
Picking the nearest mic on the strongest keyword hit doesn’t need any fancy algs.
The raspberry Pi is the same and for audio in you can just wire up 2x I2s mics very simply to gpio but the esp32 wrover is actually much cheaper than a Pi0
Just one update.
I did have it compiled in windows visual code+platformio installation with the latest code atomic14 made available.
One blocking point was also the mic interference i was getting cause i ran the wires bellow the board. It was picking up the wifi radio interference. After i changed that the wake word detection rate increased a lot.
Also wired the output board (MAX98357) to the speaker. It works great.
Not derailing the thread. I did think of that option, but after watching the video i shared, i loved the way it was laid off and explained and just for that it made my day!
I’ll stick down this path till the end. The solution you pointed can be another project or branch as there are many branches in the rhasspy project.
Although i see that having ie 5 satellites streaming all the time to the rhasspy could have a impact on functionality and make it not work.
The wake word detection on the ESP32 is the way i want to go.(Done!)
Passing the audio stream to rhasspy is the next step.
Keeping traffic low and under control makes a tidy network
(I see what the reader is thinking, yes we can implement VLANS…)
You don’t need to stream all the time as you can use VAD to stream from KW hit till silence.
This also gives the option of maybe MQTT or another port to send KW hit level to accompany the stream.
That way you can have distributed mics that do not broadcast all the time and you can encode to amr-wb to reduce bandwidth.
I wouldn’t want constant streaming mics but have no problem with a sentence from kw hit till vad silence or timeout.
I have a hunch you might see another instalment from atomic for a KWS on ESP32
PS if anyone is interested in setting up a github and discussing some ideas.
I think Atomic since he kicked started the should be given the chance to opt-in, maybe lead.
Personally I think there is are some good reasons to focus on the AI Thinker A1S as its a Wrover with a AC101 coded built in.
The all-in-one dev kit is available for £14 but the actual A1S modules is less than $5.
The dev board does work with the ESP32 ADF and it can be found here.
You just have to download the toolchain set the adf path to this directory and the idf path to the one contained.
If you combine with a single good quality uni-directional mic you can provide extremely cheap high quality distributed mics.
The ADF contains AMR-WB encode and have been playing with the ALC but doesn’t see to work but many heads make short work.
I am just using the dev board for now but the idea of a simple cheap custom board for the A1S module is enticing.
Hi, I made a simple INMP441 device on " [ESP32-Rhasspy-Satellite] " https://github.com/Romkabouter/ESP32-Rhasspy-Satellite/issues/52 for my esp32cam. I hear correctly the voice (alexa) that I have collected with a script querying mqtt, but wakeword is not recognised by Porcupine on Rhasspy (2.5.9). Any ideas?
This is my settings.ini of ESP32-Rhasspy-Satellite:
[General]
hostname=192.168.1.108
deployhost=192.168.1.108
siteId=satellite
;supported: M5ATOMECHO=0, MATRIXVOICE=1, AUDIOKIT=2, INMP441=3
device_type=INMP441
[Wifi]
ssid=xxx
password=xxx
;uncomment next 4 lines and fill with your own values if you want to use a static ip address
ip=192.168.1.23
gateway=192.168.1.1
subnet=255.255.255.0
dns1=192.168.1.1
;optional: second dns server
;dns2=192.168.xxx.xxx
[OTA]
;supported: upload, ota, matrix
;-upload: device should be attached to computer via usb
;-ota: will use espota
;-matrix: matrix voice should be attached to a raspberry pi with matrix software.
; deployhost should be set to ip of the pi
method=upload
password=OTApassword
port=3232
[MQTT]
hostname=192.168.1.108
port=1883
username=
password=
I have this error:
[DEBUG:2021-04-03 22:59:21,748] rhasspywake_porcupine_hermes: Enabled
[ERROR:2021-04-03 22:59:21,729] rhasspyserver_hermes:
Traceback (most recent call last):
File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/quart/app.py", line 1821, in full_dispatch_request
result = await self.dispatch_request(request_context)
File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/quart/app.py", line 1869, in dispatch_request
return await handler(**request_.view_args)
File "/usr/lib/rhasspy/rhasspy-server-hermes/rhasspyserver_hermes/__main__.py", line 923, in api_listen_for_command
async for response in core.publish_wait(handle_intent(), [], message_types):
File "/usr/lib/rhasspy/rhasspy-server-hermes/rhasspyserver_hermes/__init__.py", line 985, in publish_wait
result_awaitable, timeout=timeout_seconds
File "/usr/lib/python3.7/asyncio/tasks.py", line 423, in wait_for
raise futures.TimeoutError()
concurrent.futures._base.TimeoutError
[DEBUG:2021-04-03 22:59:21,752] rhasspyasr_kaldi_hermes: <- AsrStopListening(site_id='default', session_id='default-default-aa04462f-f0cf-40b0-97e7-8a199ee4884c')
[DEBUG:2021-04-03 22:59:21,755] rhasspyasr_kaldi_hermes: -> AsrRecordingFinished(site_id='default', session_id='default-default-aa04462f-f0cf-40b0-97e7-8a199ee4884c')
I believe the microphone on the rhasspy instance should be disabled (not set to hermes). The way it is right now it expects to get mic data for the site name of the rhasspy instance and not reading from “satellite”. Once you disable the mic in the rhasspy instance it would read from the satellite siteid. Also note that I observed that a rhasspy instance can only do wake word detection for only one audio source.
This should be 3, not INMP441. But apparently that works too
The mic should be set the hermes, but your server should be something other than “satellite”, is it?
If you have 1 device, you can set the server is to satellite, and remove all satellite-id from the inputboxes.
Glad you got it sorted. Hopefully you will share/publish your INMP441 work.
That is not accurate. The default satellite configuration is meant for a main and a satellite where both are raspberry pis, each running its own wake detection engine locally. I believe it works fine that way.
When the satellite is an ESP32 which doesn’t do its own wake detection you will need a rhasspy raspi instance to do detection for the ESP32. Each rhasspy instance can only do one wake detection process at a time so you can’t have a main do both its own audio processing and one for the satellite. This is why you need to disable the processing of audio on the main and set it to process for the satellite by adding the satellite name. Perhaps having both use the same site id will also work, I have not tried that.
The reason why one rhasspy instance can only do one wake detection at a time seems to be historical. It is not a bug. I believe synesthesiam mentioned that he would want to refactor that someday. It probably needs an architectural change.
You can on ESP32 but the models you can run are limited but from what I have seen much is due to poor choice of training.
Many seem to use the Google command set which really is a benchmark dataset and deliberately contains wide universal variation with approx 10% bad.
It needs to be that way as all state-of-art KWS would be registering 100% and its use as accuracy benchmark would be little use.
I have learnt some tricks from the Google-KWS repo https://github.com/google-research/google-research/tree/master/kws_streaming which adds a 3rd classification to KW & !KW which they call ‘silence’ which is just background noise and that really helps accuracy as many low volume signals that could be a lottery between KW & !KW have another classification entropy and that simple addition increases accuracy.
I thought it was some sort of VAD/SAD implementation but it is a sort of catch-all for anything not KW or !KW.
If you are making a custom model then use the voices that will use it and record on the device of use where KW is obviously the KW and then some choice sentences split into words of ‘voice of use’ become !KW.
When you choose a KW it needs to be as unique as possible with hopefully a 3 phone/syllable word.
Also they use the background_noise folder to produce the ‘silence’ category but also mix into KW, !KW.
Atomic did a great implementation of code and that broadcast on KW till silence (remote vad) should be very possible, his dataset choice and training methods where not so good from choice of KW to dataset.
Which is great as you could use his code but work on a better dataset and choice of KW that is ‘your voice’ on ‘your device’ in ‘your room’.
The further you digress from actual use obviously the less accuracy you get, but from playing with the GoogleKWS I can get 100% validation with models that are extremely tolerant of high levels of noise that vastly surpass any KW system of Rhasspy and much is purely training.
TensorflowML is under heavy development but MFCC front end and models is part of the Google Streaming KWS I do have a couple of ESP32 just never got round to trying but on a Pi3A+ they are extremely lite running less than 20% on a single core.
Prob though the new ESP32-S3 which has SIMD vector acceleration specifically for NN models is likely to be a killer platform for satelite VoiceAI.
English phonetic pangrams
Pangrams which use all the phonemes, or phones, of English (rather than alphabetic characters):
“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.” (for certain US accents and phonological analyses)
“Shaw, those twelve beige hooks are joined if I patch a young, gooey mouth.” (perfect for certain accents with the cot-caught merger)
“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?” (perfect for certain Received Pronunciation accents)
“The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.” (a phonetic, not merely phonemic, pangram. It contains both nasals [m] and [ɱ] (as in ‘symphony’), the fricatives [x] (as in ‘loch’) and [ç] (as in ‘hue’), and the ‘dark L’ [ɫ] (as in ‘all’) – in other words, it contains different allophones.)
I have been trying to twist Atomics arm to give it another go but failing