DIY Alexa on ESP32 with INMP441

yavilevich · April 3, 2021, 5:02pm

Did you properly declare the satellite id in your rhasspy instance?
Maybe post your rhasspy config for review.

mtt · April 4, 2021, 8:02am

I confirm I declared satellite siteIds.
This my config:

{
"dialogue": {
    "satellite_site_ids": "satellite",
    "system": "rhasspy"
},
"handle": {
    "satellite_site_ids": "satellite",
    "system": "hass"
},
"home_assistant": {
    "access_token": "xxx",
    "handle_type": "event",
    "url": "http://192.168.1.108:8123"
},
"intent": {
    "satellite_site_ids": "satellite",
    "system": "fsticuffs"
},
"microphone": {
    "command": {
        "record_arguments": "udpsrc port=12333 ! rawaudioparse use-sink-caps=false format=pcm pcm-format=s16le sample-rate=16000 num-channels=1 ! queue ! audioconvert ! audioresample ! filesink location=/dev/stdout",
        "record_program": "gst-launch-1.0"
    },
    "system": "hermes"
},
"mqtt": {
    "enabled": "true",
    "host": "192.168.1.108",
    "port": "1883"
},
"speech_to_text": {
    "satellite_site_ids": "satellite",
    "system": "kaldi"
},
"wake": {
    "debug": "true",
    "porcupine": {
        "keyword_path": "alexa_linux.ppn",
        "sensitivity": "0.5"
    },
    "raven": {
        "keywords": {
            "alexa": {
                "enabled": true
            }
        }
    },
    "satellite_site_ids": "satelitte",
    "system": "porcupine"
}

}

mtt · April 4, 2021, 8:08am

This is my settings.ini of ESP32-Rhasspy-Satellite:

[General]
hostname=192.168.1.108
deployhost=192.168.1.108
siteId=satellite
;supported: M5ATOMECHO=0, MATRIXVOICE=1, AUDIOKIT=2, INMP441=3
device_type=INMP441

[Wifi]
ssid=xxx
password=xxx

;uncomment next 4 lines and fill with your own values if you want to use a static ip address
ip=192.168.1.23
gateway=192.168.1.1
subnet=255.255.255.0
dns1=192.168.1.1
;optional: second dns server
;dns2=192.168.xxx.xxx

[OTA]
;supported: upload, ota, matrix
;-upload: device should be attached to computer via usb
;-ota: will use espota
;-matrix: matrix voice should be attached to a raspberry pi with matrix software.
;         deployhost should be set to ip of the pi
method=upload
password=OTApassword
port=3232

[MQTT]
hostname=192.168.1.108
port=1883
username=
password=

I have this error:

[DEBUG:2021-04-03 22:59:21,748] rhasspywake_porcupine_hermes: Enabled                                                                                                                                                                                                                   
[ERROR:2021-04-03 22:59:21,729] rhasspyserver_hermes:                                                                                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                                                                      
  File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/quart/app.py", line 1821, in full_dispatch_request                                                                                                                                                                           
    result = await self.dispatch_request(request_context)                                                                                                                                                                                                                               
  File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/quart/app.py", line 1869, in dispatch_request                                                                                                                                                                                
    return await handler(**request_.view_args)                                                                                                                                                                                                                                          
  File "/usr/lib/rhasspy/rhasspy-server-hermes/rhasspyserver_hermes/__main__.py", line 923, in api_listen_for_command                                                                                                                                                                   
    async for response in core.publish_wait(handle_intent(), [], message_types):                                                                                                                                                                                                        
  File "/usr/lib/rhasspy/rhasspy-server-hermes/rhasspyserver_hermes/__init__.py", line 985, in publish_wait                                                                                                                                                                             
    result_awaitable, timeout=timeout_seconds                                                                                                                                                                                                                                           
  File "/usr/lib/python3.7/asyncio/tasks.py", line 423, in wait_for                                                                                                                                                                                                                     
    raise futures.TimeoutError()                                                                                                                                                                                                                                                        
concurrent.futures._base.TimeoutError                                                                                                                                                                                                                                                   
[DEBUG:2021-04-03 22:59:21,752] rhasspyasr_kaldi_hermes: <- AsrStopListening(site_id='default', session_id='default-default-aa04462f-f0cf-40b0-97e7-8a199ee4884c')                                                                                                                      
[DEBUG:2021-04-03 22:59:21,755] rhasspyasr_kaldi_hermes: -> AsrRecordingFinished(site_id='default', session_id='default-default-aa04462f-f0cf-40b0-97e7-8a199ee4884c')

yavilevich · April 4, 2021, 8:30am

I believe the microphone on the rhasspy instance should be disabled (not set to hermes). The way it is right now it expects to get mic data for the site name of the rhasspy instance and not reading from “satellite”. Once you disable the mic in the rhasspy instance it would read from the satellite siteid. Also note that I observed that a rhasspy instance can only do wake word detection for only one audio source.

romkabouter · April 4, 2021, 2:25pm

This should be 3, not INMP441. But apparently that works too

The mic should be set the hermes, but your server should be something other than “satellite”, is it?
If you have 1 device, you can set the server is to satellite, and remove all satellite-id from the inputboxes.

mtt · April 6, 2021, 6:23am

The yavilevich method is the way, thank you!

The default/satelitte configuration is not working.

I have last version of Rhasspy (2.5.10) of docker in a Qnap NAS.

yavilevich · April 6, 2021, 7:52pm

Glad you got it sorted. Hopefully you will share/publish your INMP441 work.

That is not accurate. The default satellite configuration is meant for a main and a satellite where both are raspberry pis, each running its own wake detection engine locally. I believe it works fine that way.

When the satellite is an ESP32 which doesn’t do its own wake detection you will need a rhasspy raspi instance to do detection for the ESP32. Each rhasspy instance can only do one wake detection process at a time so you can’t have a main do both its own audio processing and one for the satellite. This is why you need to disable the processing of audio on the main and set it to process for the satellite by adding the satellite name. Perhaps having both use the same site id will also work, I have not tried that.

The reason why one rhasspy instance can only do one wake detection at a time seems to be historical. It is not a bug. I believe synesthesiam mentioned that he would want to refactor that someday. It probably needs an architectural change.

mtt · April 8, 2021, 1:07pm

Thank you yavilevich !

There is a possibility that esp32 can do local wake detection?

yavilevich · April 8, 2021, 5:05pm

It should be possible but not sure about the quality of detection that can be achieved.

Atomic14 (the topic of this original post) got close. check his solution.

I am not familiar with other implementations.

lilbuh · April 8, 2021, 9:21pm

that seems a very interresting project
is there a lot of latency ?
could you make a small video like Atomic14’s ?

rolyan_trauts · April 9, 2021, 7:25am

You can on ESP32 but the models you can run are limited but from what I have seen much is due to poor choice of training.
Many seem to use the Google command set which really is a benchmark dataset and deliberately contains wide universal variation with approx 10% bad.
It needs to be that way as all state-of-art KWS would be registering 100% and its use as accuracy benchmark would be little use.

I have learnt some tricks from the Google-KWS repo https://github.com/google-research/google-research/tree/master/kws_streaming which adds a 3rd classification to KW & !KW which they call ‘silence’ which is just background noise and that really helps accuracy as many low volume signals that could be a lottery between KW & !KW have another classification entropy and that simple addition increases accuracy.
I thought it was some sort of VAD/SAD implementation but it is a sort of catch-all for anything not KW or !KW.

If you are making a custom model then use the voices that will use it and record on the device of use where KW is obviously the KW and then some choice sentences split into words of ‘voice of use’ become !KW.
When you choose a KW it needs to be as unique as possible with hopefully a 3 phone/syllable word.
Also they use the background_noise folder to produce the ‘silence’ category but also mix into KW, !KW.

Atomic did a great implementation of code and that broadcast on KW till silence (remote vad) should be very possible, his dataset choice and training methods where not so good from choice of KW to dataset.
Which is great as you could use his code but work on a better dataset and choice of KW that is ‘your voice’ on ‘your device’ in ‘your room’.
The further you digress from actual use obviously the less accuracy you get, but from playing with the GoogleKWS I can get 100% validation with models that are extremely tolerant of high levels of noise that vastly surpass any KW system of Rhasspy and much is purely training.

TensorflowML is under heavy development but MFCC front end and models is part of the Google Streaming KWS I do have a couple of ESP32 just never got round to trying but on a Pi3A+ they are extremely lite running less than 20% on a single core.

Prob though the new ESP32-S3 which has SIMD vector acceleration specifically for NN models is likely to be a killer platform for satelite VoiceAI.

English phonetic pangrams

Pangrams which use all the phonemes, or phones, of English (rather than alphabetic characters):

“With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.” (for certain US accents and phonological analyses)

“Shaw, those twelve beige hooks are joined if I patch a young, gooey mouth.” (perfect for certain accents with the cot-caught merger)

“Are those shy Eurasian footwear, cowboy chaps, or jolly earthmoving headgear?” (perfect for certain Received Pronunciation accents)

“The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.” (a phonetic, not merely phonemic, pangram. It contains both nasals [m] and [ɱ] (as in ‘symphony’), the fricatives [x] (as in ‘loch’) and [ç] (as in ‘hue’), and the ‘dark L’ [ɫ] (as in ‘all’) – in other words, it contains different allophones.)

I have been trying to twist Atomics arm to give it another go but failing

mtt · April 12, 2021, 2:50pm

Thanks rolyan_trauts ! What you write is fascinating, and I realize that I know nothing about it.
I understand that you have experience in the implementation of these models. Do you think you are able to implement a compatible model for ESP32 to be integrated into the ESP32-Rhasspy-Satellite program so as to realize the local wake word?

rolyan_trauts · April 12, 2021, 4:11pm

Nope as its just my opinion but the manner of satellites/hermes audio is just a really poor implementation and was done backwards with server first and protocol with satelites being this addon.

Also I wouldn’t really want to hack anything but its already there in what Atomic did it just needs to broadcast from KW and controlled to stop via remote vad and not just a long timeout.

What atomic did was great but even that CNN if used with a better dataset maybe use the models from G-s-kws it would be far more accurate.

As far as I can gather to use remote vad you need to communicate with hermes audio which is just such a brainfart to send audio chunks over mqtt encrypted and that alone takes much resources.
I think the http chunk method doesn’t exist but I don’t use because of my dislike of a whole lot of pointless that the only rationale is some sort of branding I just don’t get.
If you can do remote vad over http with MQTT just being a message control protocol it should be then the answer is yes but would like a ESP32 wizz rather than me hacking Atomics code.

I have MS and my memory with coding is extremely frustrating so tend to not, but yeah with better datsets and what I have been playing with out of curiosity is that KWS can be far more accurate and tolerant with better recipes of dataset.
I haven’t tried the TFL4M but see references for the microcontroller version quite often as Google seem quite active there. We have all seen Atomics code work and I know with a better dataset it would be more accurate.

Tensorflow lite for Arm mobile can delegate layers out to any delegate from full TF to another ‘engine’ but with the microcontroller it likely your not going to have larger libs installed and use only layers in TFL4M.
Google have variation of a CNN that I think is streaming on ESP32 and only a couple of % behind state-of-art NN in accuracy, but so is any with a better dataset.

A KWS should be outside of rhaspy and be like any form of HMI where it connects to a server that converts to protocols needed as there is no need to embed system protocols in any HMI even a KWS and said this a couple of times.

So the answer is no not because a ESP32 KWS would not work or well but purely because the Rhasspy infrastructure is whack.

DanielW · April 19, 2021, 3:05pm

I recently found out about the Knowles Smart Mics. (IA611). (thanks to Atomic14).
They are small digital MEMS microphones for a few dollars (not much more than the INMP441) but they integrate logic and a dsp to do the keyword recognition themselfs (only needing about 2 mA in listen mode).

I thought about buying a eval kit (to have board and not just a small SMD mems mic) but: They do not provide the API and firmware open on their website. You need to apply for it. And you can only apply as company. I tried it with company “None” but well it’s been a week and now anser. Not very DIY/maker friendly

rolyan_trauts · April 19, 2021, 7:42pm

Atomic apparently is trying to get the Dev for a review

romkabouter · April 28, 2021, 4:32pm

I have a PR open for this.

I need to implement it for the other wakeword systems as well.
Also, I am hoping that porcupine creates a lib for esp32 so I can implement a local hotword detection for the esp32-satellite.

mtt · April 30, 2021, 6:46am

@romkabouter
For local hotword detection, an alternative could be implement TensorflowLite model as Atomic14 did in Alexa Project. What do you think?

romkabouter · April 30, 2021, 8:37am

Yes, I was thinking about that but I think it will require a lot of effort for users to be able to use it.
This with regards to training and such, but I will have a better look. Maybe it is easier then I think

mtt · April 30, 2021, 11:19am

I’m agree with you. Anyway it could be possible to provide more solutions to realize hotword detection, that a user can select configuring setting.ini file.

rolyan_trauts · April 30, 2021, 1:47pm

The only thing Atomic got wrong was the dataset as it confuses all as the ‘Goggle command dataset’ is a benchmark dataset to test accuracy and hence contains a high proportion of bad samples and extremely varied accent content.
If you are making a custom dataset you should use the input voices only as why add native accents from around the world without the possibility they are coming to visit.

Also he used Marvin which is 2 syllable and fairly short whilst 3 syllable and filling the frame as ‘heymarvin’ would also of been much better, contains more phones and is more unique.

A couple of choice sentences can return 40 or more words and the only hassle is repeating your KW the same amount of times.
Record on the device of use and mic of use and then tools such as Sox can quickly augment to result in 1-2k KW & !KW.

Atomic also followed the basic audio Tensorflow example which is more of an introduction than a supposed working KWS where google have published a framework of current state-of-art KWS.

I have published a repo just to make it easier to get started and install tensorflow & tensorflow addons (not needed for esp32 as delegation is not supported)

I also created a sample repo of how to create datasets as a last thing that was missing from Atomics KWS was a silence classification which acts as a catch all between spoken KW and spoken !kw and greatly increases accuracy.

https://www.tensorflow.org/lite/microcontrollers supports ESP32 and the CNN examples in the G-KWS are perfect to export to microcontroller and use the front-end ops for microcontrollers.

When you create your dataset the quantities of samples greatly effect the classification weights and with the 3 classifications of Silence, !KW & KW the quantities they contain can be tweaked to attain the results required.
Training is in for stages with prob 2k steps each being a minimum and 8k starting to get to a point of no real return.

You can get an extremely noise resilient KWS that far surpasses ASR which is not tolerant of noise.