A few ideas for rhasspy (that I'd like opinions on)

Evening,

while working with rhasspy I occasionally come about a few things that I think might be good, but I am not sure if they really are so I’d like some feedback on them before I would consider them more official feature requests. I can see how they might be beneficial, but it might only be my personal use case szenario.

1. Having an option to save wakewords (and maybe also requests to rhasspy) for further training inbuilt into rhasspy. I know raven can save wakewords but raven is not as good as my somewhat trained precise model but the audio data could be used for further training.

I did write a python script that buffers audio and saves the last 10 or so seconds once rhasspy detects the wakeword but I haf to switch audio over to mqtt instead of udp, which works but is not preferable for me. Since rhasspy needs to buffer audio anyway, I would think it should not be hard to include something like this for wakewords and it would help everyone collect wakeword samples. An opt in in the options (with an option to set how many seconds should be saved) would be best for this, I think.

As for audio of requests, I think it should be an option to save those also (maybe only for a limited time) as well as the transcriptions. The latter would help greatly for debuggen purposes when rhasspy really does not want to understand you. I realize the web fronted can be used for this, but I have a hard time reproducing the exact pronunciation and it happens most often at times that I just don’t have the time to debug. As for the audio itself, besides the benefits of being able to compare audio with the transcript it would also generate tons of not wakeword data for training of wakewords. Since I do know this would end up using lots of space, I think an opt in in the config as well as a time it should be saved for would be optimal, thought this would mean rhasspy has to take the cleanup on itself, which might be a bit too out of what it is supposed to do. An alternative would be to clean up data from that via a cronjob run script or something that the user has to write themselves, or we could come up with a community script that might make it in the documentation as a mention.

2. Intent combinations. I think someone asked about something similar a few weeks back ya way to tell rhasspy to do two things in the same sentence. The best result of that discussion was to have seperate intents for different combinations which would do the trick for a limited scope, but is far from ideal. Another way suggested was having rhasspy ask questions, but that makes it needlessly long if you just want to turn off two lights. I thought of something different and I would love feedback on if it could work.

As far as I am aware each language has words used to combine two parts of a sentence together some way or another, for English I can think of “and” and “as well” for examples. I can’t see much need for those words to turn up in requests for rhasspy most of the time. Sure, I could have an intent to “turn on the tv and the playstation” but I think what I am about to propose might reduce the need for those intents in most cases.

What if we had a list of all those combination words and when rhasspy recognizes one, it splits the sentence in two intents, what was said before the word and what was said after. So if I said “Turn the light on and tell me the weather” rhasspy would recognize the and, and send two intents out instead of one. I think this would be a step into the right direction for stating multiple requests in an as short as possible way.

To further help with this idea if it should be implemented I think we need some way to either mark intents this should include or exclude because there might be cases where you need to be able to include the word “and” without having to turn the whole system off. It would also need some though on how to handle multiple intents per session, especially if all have tts output. The tts part might need to use the same combination words to combine the answers. The best I can think of would be using indexes for the intents, starting with zero and counting up for every combined intent found. Then, rhasspy can collect tts output (with a configurable timeout period in case some script or other takes longer, this way everyone could choose a time he or she is willing to wait for an answer) and then speak what was collected.

A point I am not sure on would be performance. I think two intents combined should not slow rhasspy down by much, if at all, but I think it might get worse, the more parts are added, but I might just not think of it the right way. Without knowing the ins and outs of rhasspy here is one way I can think of getting this to work that would work with a filter of some kind for intents:

  1. Transcribe the text as normal

  2. Check for combining words against a list of them

  3. Split the sentence at the first find

  4. Save the part in front of the combining word in list, array, whatever, just preserve the order

  5. Repeat from two onwards until no combining word is found (It could be done with a single split in most programming languages, pretty sure in python also, but that would limit the words to split on to one and it would make for more natural language recognition if it can be a list)

  6. Send each part on to the intent recognition, while somehow saving the order as index

  7. Check if any if the returned intents are on the blacklist filter (or if all off them are on the whitelist)

  8. If nothing was on the blacklist (or everything on the whitelist) send the intents out to mqtt (and wherever else they need to go, hass for example) while saving the id. If something came up with the filters, just send the whole sentence to intent recognition and proceed as it is now

  9. Wait for however long is configured for the answer from all intents, then send out the answers that came in in the order the intents came in the sentence to tts either separately, or combined into one sentence with combination words (depending on the amount of requests in a sentence and the length of the answers, this might be a bad idea, maybe add another filter for intents that can be combined as answers and those that don’t. Combining a lenghy weather report with the feedback from having turned on or off multiple lights and the news into one sentence might be really bad)

I am not sure if what I describe here is feasible at all for rhasspy, and especially older pis, but I personally can’t think of much that might be a problem. If implemented, the whole system should be turned off by default and the user should have a way to add to or remove to the words for the language of use. This way, some words could still be used in one intent without triggering the system. Also, the filter for intents to use this on, either as a whitelist or blacklist would help with this, because if all I want is to change the state of multiple lights, or add multiple items to a shopping list, the way of a sentence that can fit multiple words is still better. “add milk, butter and cheese to the shopping list” instead of “add milk to the shopping list and add butter to the shopping list and add cheese to the shopping list” works much better if it is just multiple fields for one intent. Less repetition.

We would also need to come up with a way to use this with intents that start conversations, for example if leaving out a relevant slot. I myself have never done this but I have read here that other ppl do and we would need either a way for rhasspy to ask all outstanding questions in order (and make clear for which intent they are, preferably in somewhat natural language) or exclude conversation started intents from the system.

I do hope to hear feedback to both, or either idea,
Daenara

1 Like

Hi @Daenara, thanks for the ideas :slight_smile:

This should be pretty straightforward to do within the wake word and ASR services. I think it would make sense to add this functionality to a shared library so that it can be configured in one place in the web UI – maybe just a checkbox to enable wake/ASR audio to be saved, a target directory, and maybe some kind of disk space limit?

This one is more complicated, depending on how much “magic” there is (versus user configuration). The first problem is that ASR needs to know all of the possible sentences up front; not all written out, but in the graph form stored inside intent_graph.pickle.gz. Without this, transcriptions will never contain the second half of the combination (unless you do open transcription, of course).

Maybe we could just re-use the rule syntax and have users describe intent combinations?

[Intent1]
...

[Intent2]
...

[ComboIntent]
<Intent1> (and | then) <Intent2> [as well]

I could probably figure out an easy way to do this in fsticuffs, but it wouldn’t work with all of Rhasspy’s supported intent recognizers.

Something like that is what I had in mind, just not user configured. I thought of a list of “combination words” that are defined per language (and can be edited if needed) and then some kind of system that combines intents with this list. Having everything user defined can end up with pretty long lists, especially if quite a few intents should be combinable and in different orders as well.

How about an opt in (or out) system, where users can list all intents that should be combinable (or not combinable) and something to autogenerate those sentences in all orders?

So if I have intents 1, 2, 3 and 4 it would be pretty annoying having to write out a sentence for combinations “1 and 2”, “2 and 1”, “1 and 3”, “3 and 1” and so forth. So if a system that just splits on “combination words” and then ends both parts through the ASR system is not possible, having a list of some kind of “combinable intents” and then something autogenerating those as one of the first training steps could be possible. That way, they will still be trained into the graph but users don’t have to write out all combinations.

I think working for the recommended intent recognizer should be enough, at least until someone can come up with a compelling reason why that one can’t be used in some szenario where combining intents is also a must. As far as I know, all intent recognizers do roughly the same for most use cases, so locking a feature down to one specific shouldn’t be that bad unless other features are locked behind other intent recognizers.