DIY Alexa on ESP32 with INMP441

PS if anyone is interested in setting up a github and discussing some ideas.

I think Atomic since he kicked started the should be given the chance to opt-in, maybe lead.

Personally I think there is are some good reasons to focus on the AI Thinker A1S as its a Wrover with a AC101 coded built in.

The all-in-one dev kit is available for £14 but the actual A1S modules is less than $5.
The dev board does work with the ESP32 ADF and it can be found here.

You just have to download the toolchain set the adf path to this directory and the idf path to the one contained.

If you combine with a single good quality uni-directional mic you can provide extremely cheap high quality distributed mics.

The ADF contains AMR-WB encode and have been playing with the ALC but doesn’t see to work but many heads make short work.

I am just using the dev board for now but the idea of a simple cheap custom board for the A1S module is enticing.

Hi, I made a simple INMP441 device on " [ESP32-Rhasspy-Satellite] " https://github.com/Romkabouter/ESP32-Rhasspy-Satellite/issues/52 for my esp32cam. I hear correctly the voice (alexa) that I have collected with a script querying mqtt, but wakeword is not recognised by Porcupine on Rhasspy (2.5.9). Any ideas?

Did you properly declare the satellite id in your rhasspy instance?
Maybe post your rhasspy config for review.

I confirm I declared satellite siteIds.
This my config:

{
"dialogue": {
    "satellite_site_ids": "satellite",
    "system": "rhasspy"
},
"handle": {
    "satellite_site_ids": "satellite",
    "system": "hass"
},
"home_assistant": {
    "access_token": "xxx",
    "handle_type": "event",
    "url": "http://192.168.1.108:8123"
},
"intent": {
    "satellite_site_ids": "satellite",
    "system": "fsticuffs"
},
"microphone": {
    "command": {
        "record_arguments": "udpsrc port=12333 ! rawaudioparse use-sink-caps=false format=pcm pcm-format=s16le sample-rate=16000 num-channels=1 ! queue ! audioconvert ! audioresample ! filesink location=/dev/stdout",
        "record_program": "gst-launch-1.0"
    },
    "system": "hermes"
},
"mqtt": {
    "enabled": "true",
    "host": "192.168.1.108",
    "port": "1883"
},
"speech_to_text": {
    "satellite_site_ids": "satellite",
    "system": "kaldi"
},
"wake": {
    "debug": "true",
    "porcupine": {
        "keyword_path": "alexa_linux.ppn",
        "sensitivity": "0.5"
    },
    "raven": {
        "keywords": {
            "alexa": {
                "enabled": true
            }
        }
    },
    "satellite_site_ids": "satelitte",
    "system": "porcupine"
}

}

This is my settings.ini of ESP32-Rhasspy-Satellite:

[General]
hostname=192.168.1.108
deployhost=192.168.1.108
siteId=satellite
;supported: M5ATOMECHO=0, MATRIXVOICE=1, AUDIOKIT=2, INMP441=3
device_type=INMP441

[Wifi]
ssid=xxx
password=xxx

;uncomment next 4 lines and fill with your own values if you want to use a static ip address
ip=192.168.1.23
gateway=192.168.1.1
subnet=255.255.255.0
dns1=192.168.1.1
;optional: second dns server
;dns2=192.168.xxx.xxx

[OTA]
;supported: upload, ota, matrix
;-upload: device should be attached to computer via usb
;-ota: will use espota
;-matrix: matrix voice should be attached to a raspberry pi with matrix software.
;         deployhost should be set to ip of the pi
method=upload
password=OTApassword
port=3232

[MQTT]
hostname=192.168.1.108
port=1883
username=
password=

I have this error:

[DEBUG:2021-04-03 22:59:21,748] rhasspywake_porcupine_hermes: Enabled                                                                                                                                                                                                                   
[ERROR:2021-04-03 22:59:21,729] rhasspyserver_hermes:                                                                                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                                                                      
  File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/quart/app.py", line 1821, in full_dispatch_request                                                                                                                                                                           
    result = await self.dispatch_request(request_context)                                                                                                                                                                                                                               
  File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/quart/app.py", line 1869, in dispatch_request                                                                                                                                                                                
    return await handler(**request_.view_args)                                                                                                                                                                                                                                          
  File "/usr/lib/rhasspy/rhasspy-server-hermes/rhasspyserver_hermes/__main__.py", line 923, in api_listen_for_command                                                                                                                                                                   
    async for response in core.publish_wait(handle_intent(), [], message_types):                                                                                                                                                                                                        
  File "/usr/lib/rhasspy/rhasspy-server-hermes/rhasspyserver_hermes/__init__.py", line 985, in publish_wait                                                                                                                                                                             
    result_awaitable, timeout=timeout_seconds                                                                                                                                                                                                                                           
  File "/usr/lib/python3.7/asyncio/tasks.py", line 423, in wait_for                                                                                                                                                                                                                     
    raise futures.TimeoutError()                                                                                                                                                                                                                                                        
concurrent.futures._base.TimeoutError                                                                                                                                                                                                                                                   
[DEBUG:2021-04-03 22:59:21,752] rhasspyasr_kaldi_hermes: <- AsrStopListening(site_id='default', session_id='default-default-aa04462f-f0cf-40b0-97e7-8a199ee4884c')                                                                                                                      
[DEBUG:2021-04-03 22:59:21,755] rhasspyasr_kaldi_hermes: -> AsrRecordingFinished(site_id='default', session_id='default-default-aa04462f-f0cf-40b0-97e7-8a199ee4884c')

I believe the microphone on the rhasspy instance should be disabled (not set to hermes). The way it is right now it expects to get mic data for the site name of the rhasspy instance and not reading from “satellite”. Once you disable the mic in the rhasspy instance it would read from the satellite siteid. Also note that I observed that a rhasspy instance can only do wake word detection for only one audio source.

This should be 3, not INMP441. But apparently that works too :smiley:

The mic should be set the hermes, but your server should be something other than “satellite”, is it?
If you have 1 device, you can set the server is to satellite, and remove all satellite-id from the inputboxes.

The yavilevich method is the way, thank you!

The default/satelitte configuration is not working.

I have last version of Rhasspy (2.5.10) of docker in a Qnap NAS.

1 Like

Glad you got it sorted. Hopefully you will share/publish your INMP441 work.

That is not accurate. The default satellite configuration is meant for a main and a satellite where both are raspberry pis, each running its own wake detection engine locally. I believe it works fine that way.

When the satellite is an ESP32 which doesn’t do its own wake detection you will need a rhasspy raspi instance to do detection for the ESP32. Each rhasspy instance can only do one wake detection process at a time so you can’t have a main do both its own audio processing and one for the satellite. This is why you need to disable the processing of audio on the main and set it to process for the satellite by adding the satellite name. Perhaps having both use the same site id will also work, I have not tried that.

The reason why one rhasspy instance can only do one wake detection at a time seems to be historical. It is not a bug. I believe synesthesiam mentioned that he would want to refactor that someday. It probably needs an architectural change.

Thank you yavilevich !

There is a possibility that esp32 can do local wake detection?

It should be possible but not sure about the quality of detection that can be achieved.

Atomic14 (the topic of this original post) got close. check his solution.

I am not familiar with other implementations.

that seems a very interresting project
is there a lot of latency ?
could you make a small video like Atomic14’s ?

You can on ESP32 but the models you can run are limited but from what I have seen much is due to poor choice of training.
Many seem to use the Google command set which really is a benchmark dataset and deliberately contains wide universal variation with approx 10% bad.
It needs to be that way as all state-of-art KWS would be registering 100% and its use as accuracy benchmark would be little use.

I have learnt some tricks from the Google-KWS repo https://github.com/google-research/google-research/tree/master/kws_streaming which adds a 3rd classification to KW & !KW which they call ‘silence’ which is just background noise and that really helps accuracy as many low volume signals that could be a lottery between KW & !KW have another classification entropy and that simple addition increases accuracy.
I thought it was some sort of VAD/SAD implementation but it is a sort of catch-all for anything not KW or !KW.

If you are making a custom model then use the voices that will use it and record on the device of use where KW is obviously the KW and then some choice sentences split into words of ‘voice of use’ become !KW.
When you choose a KW it needs to be as unique as possible with hopefully a 3 phone/syllable word.
Also they use the background_noise folder to produce the ‘silence’ category but also mix into KW, !KW.

Atomic did a great implementation of code and that broadcast on KW till silence (remote vad) should be very possible, his dataset choice and training methods where not so good from choice of KW to dataset.
Which is great as you could use his code but work on a better dataset and choice of KW that is ‘your voice’ on ‘your device’ in ‘your room’.
The further you digress from actual use obviously the less accuracy you get, but from playing with the GoogleKWS I can get 100% validation with models that are extremely tolerant of high levels of noise that vastly surpass any KW system of Rhasspy and much is purely training.

TensorflowML is under heavy development but MFCC front end and models is part of the Google Streaming KWS I do have a couple of ESP32 just never got round to trying but on a Pi3A+ they are extremely lite running less than 20% on a single core.

Prob though the new ESP32-S3 which has SIMD vector acceleration specifically for NN models is likely to be a killer platform for satelite VoiceAI.

English phonetic pangrams

Pangrams which use all the phonemes, or phones, of English (rather than alphabetic characters):

  • “With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.” (for certain US accents and phonological analyses)
  • “Shaw, those twelve beige hooks are joined if I patch a young, gooey mouth.” (perfect for certain accents with the cot-caught merger)
  • “Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?” (perfect for certain Received Pronunciation accents)
  • “The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.” (a phonetic, not merely phonemic, pangram. It contains both nasals [m] and [ɱ] (as in ‘symphony’), the fricatives [x] (as in ‘loch’) and [ç] (as in ‘hue’), and the ‘dark L’ [ɫ] (as in ‘all’) – in other words, it contains different allophones.)

I have been trying to twist Atomics arm to give it another go but failing :slight_smile:

Thanks rolyan_trauts ! What you write is fascinating, and I realize that I know nothing about it.
I understand that you have experience in the implementation of these models. Do you think you are able to implement a compatible model for ESP32 to be integrated into the ESP32-Rhasspy-Satellite program so as to realize the local wake word?

Nope as its just my opinion but the manner of satellites/hermes audio is just a really poor implementation and was done backwards with server first and protocol with satelites being this addon.

Also I wouldn’t really want to hack anything but its already there in what Atomic did it just needs to broadcast from KW and controlled to stop via remote vad and not just a long timeout.

What atomic did was great but even that CNN if used with a better dataset maybe use the models from G-s-kws it would be far more accurate.

As far as I can gather to use remote vad you need to communicate with hermes audio which is just such a brainfart to send audio chunks over mqtt encrypted and that alone takes much resources.
I think the http chunk method doesn’t exist but I don’t use because of my dislike of a whole lot of pointless that the only rationale is some sort of branding I just don’t get.
If you can do remote vad over http with MQTT just being a message control protocol it should be then the answer is yes but would like a ESP32 wizz rather than me hacking Atomics code.

I have MS and my memory with coding is extremely frustrating so tend to not, but yeah with better datsets and what I have been playing with out of curiosity is that KWS can be far more accurate and tolerant with better recipes of dataset.
I haven’t tried the TFL4M but see references for the microcontroller version quite often as Google seem quite active there. We have all seen Atomics code work and I know with a better dataset it would be more accurate.

Tensorflow lite for Arm mobile can delegate layers out to any delegate from full TF to another ‘engine’ but with the microcontroller it likely your not going to have larger libs installed and use only layers in TFL4M.
Google have variation of a CNN that I think is streaming on ESP32 and only a couple of % behind state-of-art NN in accuracy, but so is any with a better dataset.

A KWS should be outside of rhaspy and be like any form of HMI where it connects to a server that converts to protocols needed as there is no need to embed system protocols in any HMI even a KWS and said this a couple of times.

So the answer is no not because a ESP32 KWS would not work or well but purely because the Rhasspy infrastructure is whack.

I recently found out about the Knowles Smart Mics. (IA611). (thanks to Atomic14).
They are small digital MEMS microphones for a few dollars (not much more than the INMP441) but they integrate logic and a dsp to do the keyword recognition themselfs (only needing about 2 mA in listen mode).

I thought about buying a eval kit (to have board and not just a small SMD mems mic) but: They do not provide the API and firmware open on their website. You need to apply for it. And you can only apply as company. I tried it with company “None” but well it’s been a week and now anser. Not very DIY/maker friendly :frowning:

Atomic apparently is trying to get the Dev for a review

I have a PR open for this.

I need to implement it for the other wakeword systems as well.
Also, I am hoping that porcupine creates a lib for esp32 so I can implement a local hotword detection for the esp32-satellite.

4 Likes

@romkabouter
For local hotword detection, an alternative could be implement TensorflowLite model as Atomic14 did in Alexa Project. What do you think?

Yes, I was thinking about that but I think it will require a lot of effort for users to be able to use it.
This with regards to training and such, but I will have a better look. Maybe it is easier then I think :slight_smile: