Proof of Concept working for open response intent handling

Hello All,

I’m relatively new to this community but have been diving in deep since I got here. My main goal for my entire home setup is to be completely offline capable. Of course while my house (which includes the internal network, home automation, and all that goes with that) has an internet connection I want it to be able to take advantage of those services but I want my house to be able to function without an internet connection. To that end I have a full home server (soon to be expanded to a separate NAS and server setup), locally managed wifi mesh, home assistant to manage the home automation (slowly growing). Of course one of my next steps was voice interaction with it all. I’ve checked several of them out over the years and recently came back around to Rhasspy. It fits me perfectly right now.

Within a week I had a base/satellite environment setup connected to my Home Assistant instance with several custom intent scripts handling various tasks for me. One of the first complications I came across, and it looks like several other have it also, was how to handle getting “random” response text back from the user. For example, they may want to say something like “play a song” and want Rhasspy to ask “what song?”. Getting Rhasspy to recognize a song name that Rhasspy wasn’t already trained to understand as an intent is difficult at best. Well, I am pretty sure I have a working proof of concept where this works.

I am pretty sure it would be difficult to do in a single setup on a PI. My setup involves passing almost all of the processing off to the base Rhasspy instance. I used the guide for setting up a satellite from the docs site. Then I went a few steps further. First I handed the MQTT processing off to the MQTT broker setup in HA. Then I setup a Mimic3 TTS server and pointed Rhasspy at that, currently I have the base point to Mimic3 and the satellite is set to Hermes MQTT, not sure if that’s right but it works well for me. From there I started trying to make my own skill based on the Rhasspy Hermes App library and using a mix of the continue_session example app and a whole lot of trial and error I was able to make a skill that asks me to says something, captures what I say, and then repeats it back to me (not by just playing back the WAV file but by actually translating the speech to text and back). The problem at that point was it never really understood what I said, if I said “I’m saying something back” it would translate it to all sorts of various garbage, but the session handling was working, mostly. So next step was to figure that out, an in comes Vosk. It took several attempts to get that working right, using just the standard setup within Rhasspy didn’t provide a complete enough model to understand everything. So I messed with trying to setup my own server for a while and couldn’t quite get that right. In then end I realized all I needed to do was replace the model in the Rhasspy setup, so that’s what I did, pulled down the largest english model available and replaced the model in the Rhasspy profile. Next time I tried my skill it worked perfectly, it understood everything I said and repeated back exactly my words each time. It hasn’t been extensively tested yet but so far things look really promising!!

So, things still to do/figure out:

  • Currently each response is recognized as a failed intent, so the error beep plays, need to figure out how to handle that properly
  • I need to change this into a fully working concept, probably going to start with the play song scenario, but I’ve always been playing with building a Grocy skill so it may be used in a shopping list skill first.

Would love to talk to people about this setup to see if anyone else has got things like this working? or any other thoughts and ideas?

I am on board would be happy to collaborate with you to make this a reality.

I just got done writing a proof of concept to handle the wake word with tensor flow model and a minimal satellite with a yaml configuration. I have determined that rhasspy’s capabilities are rather limited; but they are at a good spot to make better.

the text to speech aspect works great; but the audio to text is lacking. (it’s possible it is a lack of understanding on my side I just found/started using Rhasspy a few weeks ago)

Ok, first let me start by saying I’m not sure I am as advanced as you so you may be way ahead of me already in some places. I’ve taken a look at the wake word handling and I think I could make something simple but would take some time to get up to what you’ve built. But I am happy to have anyone take a look at what I am working on and help where possible.

I just uploaded the test file I’ve been playing with to my Grocy app repo.

It really is just a basic example of getting Rhasspy to repeat back to me what I say to it, which has been working in my setup. As mentioned in the first post, the setup is key to this working. This intent generally worked fine before I switched to Vosk for Speech-to-Text. Before I switch it would respond but it never got the words right, so technically the intent code works fine. After I switched to Vosk running the full model (that is key) it responds perfectly every time.

Pretty sure I’ve decided my first implementation of this is going to be in the Grocy app. Maybe, creating new chores, or tasks, or something.

Nice,

Don’t worry about this; we are all here to help each other out.

I will review your code a little later tonight. I also had an idea along these lines; but it may require something out of bands on what is currently supported by Rhasspy. But maybe if we prototype something they will incorporate it into the Server side.

Ok I went through your example tonight; I see what you put together. What is your idea for handling the response?

I have the code handling the response currently. The most recent version uses the on_dialogue_intent_not_recognized function decorator to handle the response. Currently all that does is take the spoken text and repeat it back and then asks for a new phrase. It repeats that until you say “No” to it and then it stops responding. This seems to be working fine except that the phrase I speak back is not recognized as an intent so the satellite plays the unrecognized intent sound, even though my Rhasspy app handles the response and continues tue session successfully. I think that is just a setup thing on the satellite. But I’m thinking if I just shut off that sound in the config I would want to handle unrecognized intents some other way to let the user know that something wasn’t recognized. So maybe a custom unrecognized handler?

Do you use node red or home assistant?

I use home assistant mainly.

But lately I’ve been of the mind that certain integrations will work better as their own integrations, so I’ve been liking the path of the Hermes MQTT app for some things, currently thinking to use this for a Jellyfin control app and the Grocy integration I’ve been working on. My thought here is that the integration that HA provides to these services is not complete and probably never will be as complete as the full access to the apps API. For example, with Jellyfin, I couldn’t find a way to get a list of songs/media out of HA, seems the media stuff hasn’t been added to the API yet? I don’t know, I couldn’t find it. But I have a working slot program for pulling out a list of songs from the Jellyfin API directly.

All that being said I do have intent scripts in my HA config. I do intend to expand upon them though lately my focus has been in the Hermes app stuff. I think there are things that HA should do, obviously all the IoT control, I think I’ll end up using it for communication to various integrated web services also, like weather, things the home automation will need anyway. I’ve built a Hermes time app and working on a better timer app also, but I keep debating if that was the right way to go or if should just let HA handle that and use intent scripts,. I’m thinking that way I have multiple “interfaces” into the same “skill”, so if I want to set a timer form my tablet I can do it in the HA interface, or if I want to yell at Rhasspy I can do it that way.

Lots to think about. But my current focus is going to be figuring out this unrecognized intent thing and the Grocy Hermes app. I am going to try and find a way to work in this open response concept into it. Perhaps maybe a “workflow” to add a new product to Grocy?

Hey all, uploaded my first real world use of this concept today to my Grocy skill repo. I implemented an intent for creating new shopping lists, it allows you to say a sentence like “create a new shopping list named groceries” (replace “groceries” with whatever word you want). It will create a shopping list with that name, in this case “Groceries”.

3 Likes

Hi @JoeSherman thanks a lot for sharing this!

Could you please provide a detailed step-by-step guide for newbie like me please!?
Thanks a lot man!

@giof94is

Sorry for the delays, real world been keeping me busy. :stuck_out_tongue:

What part of this are you looking for details on? Setting up Rhasspy with the full language model? Or setting up the skill?

Hi @JoeSherman, in this case I can say “the more is the better” ahahah

Thanks for whatever you will provide…

ok. I’ll try to sit down this weekend and document the Vosk setup, which is the full language model. Then I should be able to use the detailed page from my other skill as a base for a detailed page on this one.