My experience so far

I’d like to share my experience in the last two month with Rhasspy and different configurations.

My main objective is to have a simple voice system (in the living room or kitchen) to control my home automation (via Home Assistant) and a simple shopping list (made custom by me).

It’s been a long time since I knew Rhasspy but I never had the occasion to use it.

I started with a simple Raspberry pi 4 with Jabra Speak 510 to test the installation, first configurations and understand how it works. All good.

All my tests were in a silent environment, so the wake up word was understood correctly and so the intents. I found that the intent API of Home Assistant are not available anymore but I didn’t searched for a solution since a simple automation via rest API was good and simple enough.

Then I moved to a base/satellite configuration .The base is always a Rhasspy instance (docker) in my central server (NUC i7).

For the Satellite I used 3 different configuration:

  1. Raspberry pi 4 with both PS3 Eye and Jabra
  2. ESP32 with INMP441/MAX98357A (custom made)
  3. M5 Echo

For the last two setup I used the great ESP32-Rhasspy-Satellite made by Romkabouter (https://github.com/Romkabouter/ESP32-Rhasspy-Satellite/) with some small modification made by me (I added a LED ring and some other small stuffs).

The Rasp4 configuration was simple enough, I used the official tutorial (Tutorials - Rhasspy).

I used the internal MQTT broker (in the Base instance) to avoid any interference with my main MQTT broker and have a dedicated one.

I think that a dedicated Raspberry (especially in these days with Raspberry shortage) is quite oversized if used as a simple MQTT microphone.

So I moved to a simpler (hardware wise) configuration using ESP32 with I2S mic and speaker: the response is good enough but my tests were always in a silent environment. I have to test it in a louder room (I have 2 small children so…) and understand if the INMP441 (or maybe a dual channel mic) is good enough as a “panoramic microphone”. Do u have any experience on this? I’m planning to build a 3d printed enclose, any tips for the microphone placing (top or bottom, orientation, …)?

The MAX98357A (with a small 3W speaker) was good enough for a minimum sound feedback and vocals (I tried some intent like “what time is it”).

I had a problem with Hermes MQTT: if the Base and the Satellite have different IDs, I didn’t find a way to redirect the sound output to the satellite, since the messages have different topic. On this regard I found this (Speech from Base to Satellite) where the solution was to mirror all the base MQTT messages to the satellite topic, but this won’t work if there are more than 1 satellite (or all the satellite will reproduce the message!). Do u have any considerations on this? Maybe a dialogue machine?

Last test was the M5 echo but I think that this device is good for educational porpoise but completely useless in a real environment: the mic radius is very small and the speaker is very very weak. Maybe with a battery, as a sort of personal device… don’t know. Anyway: it was funny to add also this to my electronic toys harem lol

If any of u are interested in specific details pls let me know, I wrote this just to share my thoughts and have some feedback from you.

2 Likes

The INMP441 are great low cost I2S mics.
What you can do is make a fixed broadside or endfire ‘beamformer’ with x2 to improve directionality but it can change overall EQ and the standard esp32 runs short of clocks quite quickly with any complex dsp but delay & sum is very low load, even if not the best due to EQ change.

https://github.com/42io/esp32_kws did a great repo for kws on the esp32 but eventually when prices drop the esp32-s3 is likely to be preferential as it can do SIMD vector instructions which give a huge perf increase even though its specs are fairlly similar.

Invense do a great intro to basic delay sum broadside / endfire arrays.

Pull Requests are always welcome :wink:

What does “the sound output” stand for in your case? If you use the siteId in your mqtt response, the output will be on the satellite. Can you give some more info on how you have this set up?

Have you read the wiki on this? Home · rhasspy/rhasspy Wiki · GitHub
It has pointers on how you can achieve what I think you want.

I think the main issue that the mic input is very low, I have not yet found a fix. If there is any.

I saw the DPS and WHDL code used for the Matrix Voice, don’t know if I really need to reach that kind of complexity, but is something to keep in mind.

Thanks! I also found this, very similar (same mechanism on the kws part):

THIS is definitely something to look at, thanks!

Yep! :slight_smile: once tested :wink:
I made a new component that merge the INMP/MAX with the blink/status part of the M5 Echo Led, and I pull it once ready!

The Satellite will ask the Base “what’s time is it?”; the Base send the TTS to a topic similar to hermes/audioServer/baseID/audioFrame (sorry I dont remember the correct topic and now I’m at work) but the satellite is subscribed to hermes/audioServer/satelliteID/audioFrame.
The only way to make it work is to change the baseID and the satelliteID to make them equals.
I need the Base to understand that the request came from satellite1 and publish the audio output with the topic to which the satellite1 is subscribed.

Is it something related to this? I think that the small form factor can influence the overall performance… also the sound output is very poor.

I think you have some incorrect settings.
The base listens to all hermes/audioServer/<siteID>/audioFrame topics for all siteId you fill in (comma separated) on the “Satellite siteIds:” field in the various setting sections.
Make sure your base and sat have different siteId’s

Set the Audio Play to “Hermes MQTT”

Have a look at the DS (Delay sum) C/C++ code in my https://github.com/StuartIanNaylor/ProjectEars repo

It’s the opposite: I want the Satellite to play the TTS produced by the Base.
In the example, I ask “What time is it?” to the Satellite. It sends the audio sentence to the Base, trigger the automation (Home Assistant) that produce the answer. Home Assistant send the time text to the Base for the TTS.
The Base publish the produced TTS on the topic hermes/audioServer/Base/audioFrame.
But the Satellite is subscribed only to the topic hermes/audioServer/Satellite/audioFrame.
So no sound is produced!
It works only if the Satellite is subscribed to the hermes/audioServer/Base/audioFrame topic, hence the need to have the two identical IDs.

I’d like to find a way for the Base to publish the audio on the related Satellite topic, the one who trigger the request.

Thanks!
So maybe a Matrix Voice can be a faster solution? Do u have any experience with that?

Don’t know, how Home Assistant handles that, but in general, any automation system should be able to address any response back via Rhasspy to the one satellite that had been used as (voice) input.
Using the “base’s” resources to generate audio via it’s TTS system is possible, see once more Tutorials - Rhasspy, especially the graphic in “Shared MQTT Broker”.
As @romkabouter already has mentionned: You will have to put each satellite in the base’s list of satellites to serve with this specific service. Make sure, each satellite has a unique name.

I also had some trouble at first because I was thinking of “publishing” as sending messages from one machine to another. But its actually easier than that.

Make sure your base and satellite machines are all given unique SiteIDs, like:
image

On your Base Rhasspy, under settings for the services you want to call from your satellites, simply enter the SiteIDs into the Satellite IDs field, like:
image
This tells the base to listen for (and respond to) messages from any of these satellites.

That is it !! :slight_smile:

Behind the scenes

The satellite does NOT send any message to baseID - instead

  • The satellite publishes a message using the appropriate topic which includes its own satelliteID.
  • The base rhasspy is subscribed to (listening to) messages for the satelliteID (and any othersatelite you specified).
  • The base then publishes its response with a topic which includes the satelliteID.

Both satellite and base are publishing messages with the same siteID in the topic.

How did you setup HA for this? Because there is your error most probably

No, you need to setup HA correctly :wink:

Yes, this is described for various usages on the wiki

Reading the wiki, I made some mistakes… I’ll try a different configuration and retest… thanks!

On the hardware side, I’m kinda stuck for the satellite. I’m tempted to try the ESP32-LyraTD with the built-in FPGA but, even if it’s a two years-old piece of hardware, I didn’t find any kind of test or experiment or blog inside the usual community (like hackaday). Do u think it worth a try? It’s cheaper than a raspi lol.

Always, never stop tinkering!

I knew it was the wrong place to ask such a question lol
Even a Korvo1 or Korvo2; all of them cost from 50 to 70€.
I first need to understand the better one for my purposes.

The Raspberry Pi Zero 2 if it wasn’t for stock and scalpers is only $15 and very hard to beat.
The standard esp32 with no ps-rams is very short of resources and even with.
The newer esp32-s3 has much more scope but prices on those are much higher and the esp32-box they did is like the Zero and currently out of stock as not sure what happened to the supposed esp32-box-lite release.

ESP-BOX… very cool piece of hardware.
It’s available on Mouser

Its great that its a showcase of what you can do when pushing the newer esp32-s3 chips to the max but contains quite qty of espressif closed source blobs.
Personally I just want the ADC & ESP32-S3 on a low cost dev board as the rest is superficial to my needs but unfortunately doesn’t exist.
I would check if Mouser are up to date as I don’t think espressif see’s itself as a product supplier it merely did some limited runs as a product showcase.

It was showcased and we have the documented design also with software libs to provide various low cost wireless designs but likely they would need to be fabbed, but try mouser as maybe.

I have rhasspy set to aplay and the sound is coming out of the speaker plugged into the matrix voice. Broker is set to external. Here is an HA automation which, with help, I used to format the time and reply to the matrix being asked “What is the time”.

service: mqtt.publish
data:
  topic: hermes/dialogueManager/endSession
  payload_template: |-
    {% if now().strftime("%M") | int == 0 %}
      {"sessionId": "{{ trigger.event.data._intent.sessionId }}", "text": "The time is {{ now().hour }} hundred "}
    {% elif now().strftime("%M") | int < 10 %}
      {"sessionId": "{{ trigger.event.data._intent.sessionId }}", "text": "The time is {{ now().hour }} oh {{ now().minute }} "}
    {% else %}
      {"sessionId": "{{ trigger.event.data._intent.sessionId }}", "text": "The time is {{ now().hour }} {{ now().minute }} "}
    {% endif %}

Thanks… u’re all right: my problem was in the automation part, I didn’t manage the sessionId.

Anyway, I bought both the LyraTD and the ESP-BOX :slight_smile:

I’ll keep updated the thread once received the package.

2 Likes

Cool I thought everywhere was out of stock, which I had stopped checking as got myself a esp32-s3-box-lite.

I will have both when the lite arrives as already have the esp32-s3-box as the 4 channel ADC on the esp32-s3-box is hard to source, the lite version doesn’t have the dock and also uses a 2 channel adc which is easy to source as been wondering if the firmware can be mangled to run on a standard esp32-s3 dev kit.
I can hack a bit of C and I am groaning at the idea, but interested if there is any noticeable difference in recognition. So I can test that at least :slight_smile: