DIY Alexa on ESP32 with INMP441


Just started with Home assistant and already amazed with how advanced things are. Was away from electronics hobby for a few years and was still in the arduino / PIC dark ages.
I was wondering how to setup a homemade alexa without having the cloud interaction. Rhasspy seems to be a winner. Thanks to all the good work made by @synesthesian.
In my case my home assistant is one virtual apliance install in one VM without access to hardware. Also i don’t find pratical having to speak near the server. So the ESP32 seems a good idea for having a satelite solution to pass the voice commands to Rhasspy from more that one place.

Then i stumble with this video from Atomic14 in youtube. Not sure i can post it here. ( )

This is half the work done, already has the wake up voice detection and passing the wav to a destination. Left a comment for Atomic14 to have a look into this forum. Hope this is helpfull for everyone following this topic.

I feel grateful for getting in touch with people that have the same hobbys as i have.

Hope i’m not building castles in the air and this is all doable.

I just received my ESP32 dev kit + INMP441and setting up the Visual code install.

Count me in for any testing/developing on this branch of the project. Not a pro programmer, but will do my best.

Thank you!


Welcome and good luck!

Hi beared,

I had the same Idea and found your post on the search for solutions. Thanks for the hint at the Atomic14-Repo. At the moment I’m setting up Rhasspy on a faster desktop with pi1s I had lying around as satellites/wakeword-clients with microphone as a replacement for my snips-system.

As I only have 3 pi1s lying around and around 6 Rooms to cover with voice-control I’m always looking around for some costworthy alternatives to the pi. With this setup a satellite would only cost around 20 Dollars, which would allow me to even equip our basement with voice-control ;-).

I think these are no castles in the air, but the job could be done about 90% by Atomic14.
Will also try to get the hardware components so I can also try to get it working with rhasspy and contriubte some code-snippets. (I’m a Java-Dev, so C++ is writable but not my speciality) Will post the code-Repo here as soon as I have the hardware and the time to make some progress.

1 Like

I also toyed with the idea a while ago:

For instance, I have this little piece of hardware lying on my desk:

Unfortunately I haven’t found the time yet to try to implement something. But I’d definitely like to have something cheaper and less power-hungry than a Raspberry Pi to work as a Rhasspy satellite.

@koan that is a nice device!
I wanted to have a small device as well and this seems exactly what I need.
My plans are to have a device like this each room, with some wakewords.
For instance, when my daughter enters her bedroom, she can say “lights on” and the light in her room switches on.
There would be this deviceL

  • lights on wakeword
  • lights off wakeword

Both of them taking action right away, so it is not a voice assistant but rather a very simple, very view command system.

Why? Because when you enter a small room, first activating the assistant and then ask it to switch a light takes too much time. It is much faster to flick a switch.

The Matrix Voice is not suited for such use cases, mine is not even in use at this moment since I have no nice case for it.

So you would train it then to recognize just a limited set of commands/wakewords? That’s actually a nice idea, a sort of ‘audio switch’ instead of a tactile switch.

Yes indeed, most of you want to turn on a couple of lights or activate a scene.
So, if you walk into your livingroom you can say “lights on” or something like that.
Rhasspy can be trained with Raven and you should be able to get something going with MQTT, Node-red to create actions directly responding to wakeword activation. In this case turn a set of light on or off.
Even better would be on device wakeword, but that is harder to do

1 Like

Hi Cephos,

Have you tried to compile the code from Atomic14 repo?
I’ve tried but got stuck on the fmax function not included in the C compiler i’m using.

Installed the Visual code and PlatformIO from scratch in Windows but i’m missing something cause there are many compile errors.

Bucket of cold water… as i was very interested in this.

It`s really good idea to use small hidden device instead Raspberry.
Unfortunately I am not a programmer and cannot help. But there is a link to espressif voice project.

And similar DIY project.

1 Like

Haven’t really looked at I2S on the ESP32 but at $5 a pop with built in flash they make great little devices.

I will have a look at that vid later one time.

I did notice it mention the Google Command set which is just a collection of audio files and for some reason no-ones mentions this but its full of bad word recordings that really have an affect on accuracy.
I spent ages cleaning it up and removed almost 10% of samples from the ver 2.0 collection and got a huge accuracy increase.

1 Like

Hopefully this won’t derail the thread - but couldn’t an ESP32 device simply function as a duplex audio streaming device and let the base process the stream for wake words/intents? That’s what the base already does, and it removes programming/wake word detection complexities from the ESP32, basically turning the device into a mic/speaker stream that can be consumed by the base station.

I can’t imagine the audio streaming network usage to be large (I’d guesstimate less than 15KB/s).

If this is trivial enough to do, I guess it would be up to Rhasspy to support multiple simultaneous in/out audio streams that could be aliased like a satellite id (so different intents could trigger from the ids of different streams, so “turn the lights on” on stream1 is for one room, stream2 is for another, etc.)

Hmm, interesting. Your post prompted me to search a bit about the INMP441 (I hadn’t heard of it before). There’s a Youtube video about using it with ESP32 which you may already have seen ( )
I think I’ll order an INMP441 to play with!

Any I2s mic will do but the INMP441 are very cheap on ali express

I got some pre brexit but seems a lot are now not supplying the UK

ESP32 could well function as a duplex device the ram considerations for ring-buffers and generally resources are pretty low but quite a lot is possible.

Also there are a few different types of ESP32 but the Wrover version often is just a tad more expensive but has more RAM and probably more likely.
There are some dev boards loaded with mics and codecs but really they are just a Wrover mounted with ancillaries @ approx pi zero price or above and prob a bit pointless as an I2S mic & Amp are really dirt cheap now.

I really like what Atomic did in the above vid and have been banging for a while about either a tensorflow, keras or pytorch KWS as for some reason the Precise implementation seems to be very heavy.

Be wary of the Google command set and the old adage of “Garbage in / Garbage out” as this seems very true of models.
I was using the Linto HMG with the “Visualise” key word from the Google Command set ver 2.0.
The GUI is just a handy tool as it shows false positives/negatives with an easy button to play and this led me to realise how bad many of the samples are in the Google Command set.
Just really simple stuff of badly cut words, very bad recordings or pronunciation which I had presumed would of already been trimmed from the dataset.
Not so approx 10% is bad and if you take the time running and deleting bad your overall accuracy will sky rocket.
Add a few of your own recording, pitch shift slightly, trim and normalise with a touch of variation and add background noise will create a qty that will also greatly increase accuracy.

Using a distinct 3 phoneme word ‘visualise’ helps much but doesn’t have a snappy name like Marvin.

The ESP-32 does have a AMR-WB encoder or WAV but quite a choice of decoders.
I really like what Atomic has done as if you could include the keyword hit score in the audio stream you could broadcast from keyword hit to silence and use the hit score metadata to pick the best mic signal from an array of mics that is far better than just an RMS target.

The $5 ESP32 are exceptionally cheap don’t need an SD card to program and also do models with a U.FL antenna connector that can greatly help with signal level of on board types.
There is no far field if you can cheaply place a distributed array which is an extremely beneficial option whilst we still have lack of Linux opensource beamforming.
Picking the nearest mic on the strongest keyword hit doesn’t need any fancy algs.

The raspberry Pi is the same and for audio in you can just wire up 2x I2s mics very simply to gpio but the esp32 wrover is actually much cheaper than a Pi0

I did notice this design by the invensense engineers
Ignore the array and think of as 2 mics part of an I2S pair and I presule we could do as they did it. Digitally would be just subtracting the value of 1 mic from the value in the other mics delay buffer but still have to try.

1 Like

Just one update.
I did have it compiled in windows visual code+platformio installation with the latest code atomic14 made available.
One blocking point was also the mic interference i was getting cause i ran the wires bellow the board. It was picking up the wifi radio interference. After i changed that the wake word detection rate increased a lot.
Also wired the output board (MAX98357) to the speaker. It works great.

Hi there,

Yes time is not always available. I managed to get the HW setup up and running. Now just need to make the coding going.

Hi Alexey,

Thank you for sharing that. Added to my library.
I’ll also will take a look at that. But for now i want to follow this path till the end.

Hi there,

Not derailing the thread. I did think of that option, but after watching the video i shared, i loved the way it was laid off and explained and just for that it made my day!
I’ll stick down this path till the end. The solution you pointed can be another project or branch as there are many branches in the rhasspy project.
Although i see that having ie 5 satellites streaming all the time to the rhasspy could have a impact on functionality and make it not work.
The wake word detection on the ESP32 is the way i want to go.(Done!)
Passing the audio stream to rhasspy is the next step.
Keeping traffic low and under control makes a tidy network :slight_smile:
(I see what the reader is thinking, yes we can implement VLANS…)

You don’t need to stream all the time as you can use VAD to stream from KW hit till silence.
This also gives the option of maybe MQTT or another port to send KW hit level to accompany the stream.

That way you can have distributed mics that do not broadcast all the time and you can encode to amr-wb to reduce bandwidth.

I wouldn’t want constant streaming mics but have no problem with a sentence from kw hit till vad silence or timeout.

I have a hunch you might see another instalment from atomic for a KWS on ESP32

Maybe have a look at my streamer for ideas:

Check also the rewrite branch, much cleaner already but stripped of some features (w.i.p.)