Add context setting to intents - easy building dialogues (PROBLEM and SOLUTION)

tigrat · February 28, 2021, 10:01am

Dear community and Rhasspy developers,

This time I would like to offer you to develop a feature that potentially would make your product far more supperior in terms of the results and usability compared to any other software available in this field.

As an introduction, let me explain the source of the problem. We all know, that direct voice commandig is a stone age in home automation universe. At this stage we all want to communicate to our assistants as if those were a living human beings. To implement this we usually build a dialogues, so in a form of conversation one could set the lights or climate. The necessity for this originates from a simple fact that we are human beings, our memory is not perfect and nobody of us can store all that commands we make in our heads, neither our family members could know what options for setting are available.
So, normally, as I do it, I start conversation, example:

Rhasspy, lets do the lighting.
O.k. where do you want me to set the lighting, no room was mentioned
What rooms are there on the first floor
There are rooms r1, r2… ri found on the first floor
What room am I in?
The name of your room is…
and so on. It is easy to teach voice assistant the skills of defining the location, available lights in the room, available scenes for the light, so it would tell you what you can do in order you to decide on your action. So far Rhasspy does well !!!

But here comes the pitfall, it does not set the context for the incomming request. Let me explain. For example, you have a scene named ‘warm’ for the lights, and you want to set a tempereture to make the room warm - when you do your command you will always face the problem, that every now and then Rhasspy would return you a wrong intent - for the heating, when you talk about the lighting, and for the lighting, when you talk about the heatig…

SOLUTION:
To avoid this happening you need to introduce ‘context setting command’ - in your sentences file, among your other sentences, you have the one(ore several), marked with some flag, that say to Rhasspy, that these are context setters. For example:
[context_setter]#context
(Set the context. let’s do the):doing (heating | lighting | climate | music)

So, when this command commes to Rhasspy, it remembers exact outcome of this, let suppose you’ve asked about heating, so it remembers: “doing heating”.

Then, as long, as the context is set, it ADDs automatically “doing heating” to every of your recognised text, and passes this modified string to the intent recognition module. For example, the command: “make it warm in the livingroom” would be passed to Kaldi as “doing heating, make it warm in the livingroom”. So this way you would never recieve a wrong intent back!

Then it is enough to say: “release the context”(or any other phrase you’ve defined to release the context) for the Rhasspy to stop doing context setting.

Incase you have forgotten to release the context Rhasspy would return you a JASON with the intent: “no intent recognised for that context” so you know, that you’ve forgotten to release the context and you have to do it.

And one more important thing - it remembers the context for EVERY SATELITE, so it would return the contexted ‘heating’ intents to the livingroom and contexed ‘climate’ intents to the badroom.

This is a very simple and fast implementing feature, since it does not require any additional coding apart from that it is alredy in the Rhasspy - you only need to recompose abit what you already have, and could make Rhasspy standing out from the line of numerouse voice assistant software for it would make building the dialogues with 99.(9)% accuracy so easy.

Please feel free to comment on my idea, and hopefully, we can see this implemented in the next release already!

Thanks for your time!

KiboOst · March 4, 2021, 11:03am

Context stuff is a big subject by its own and may be what differentiate IA from human …

But regarding your example,why asking which light, where it is, etc ? If no room specified, you know which device you are talking to, so where you are. Talk to the master in livingroom, or a satellite in bedroom, etc.

For example, if I have no house_room slot, I take siteId from payload.

If I ask just Turn on, I have the siteId to know where I am. Then in Jeedom, it trigger a scenario to turn on stuff light music, hoover, heating and such, and if nothing found, then it start the scenario to turn on lights, checking house_room slot from siteId to turn on lights there.

Very easy and convenient setup without fancy stuff.

rejoe2 · April 27, 2021, 11:58am

Hi there,

first, I really apreciate the idea to build some kind of dialogue logics with Rhasspy.

Very often, just using the simple link from a specific siteId to a specific room is a simple and effective method to get to an appropriate result. Unfortunately in quite a few cases, this not leading to an accurate result. The later happens at least in two cases:

If there’s no device matching to the intent, one has to decide (in the intend handling logics) if just nothing should happen, or if any other device shall be addressed (that might be ok, if there’s only one);
but if there is more than one light device in the room, but the intent is used to switch just one of them and not the entire group? Which of them to switch? Or: no matching device in the room, but several outside?

According to https://rhasspy.readthedocs.io/en/latest/reference/#dialogue-manager, one could at least use the key intentFilter to switch a specific intent on or off. This is what I already use to ask for confirmation (or cancellation) to some extent.
But how about limiting the possible answers wrt. to the recognized words, e.g. to a subset of allready trained words?

What’s possible, is to do a “slot injection” to change arbitrary slot content. But this would require some training afterwards, right? (this will take quite some time and thus seems not to be an realistic option).

Is there any other option to build such a mechanism or are there any other aspects that I might have missed so far while searching?
Or would this even be counterproductive and best option would be to just allow “unlimited answers” within - let’s say - all possible keywords that might be posible as choice?

Thank’s for reading and providing some hints in the right direction.

koan · April 27, 2021, 7:02pm

Can’t you use the custom data field in intents for at least a part of what you want? For instance:

rejoe2 · April 28, 2021, 8:19am

Thanks for the quick reply!
Using the custom data field indeed might be a good option to at least check newly received data against original options. As my problem atm is restricted to device selection (within a specific room) or room selection (if device cannot be identified via siteId->room), just adding two intents with all possible options might already lead to good results.
I’ll report back, but most likely, this will take quite some time.

rejoe2 · May 3, 2021, 10:24am

OK, so here’s some first remarks on using “custom_data”.

First: In the JSON objects, the respective element seems to be named “customData”, so after first attempts with “custom_data”, this label has been changed.

Unfortunately, in both cases the “text” element had not been spoken any more, so atm. I use a different solution and store the JSON data from the first message on the home automation system side.
This helped to get the basic requirement (request further info from the user) done .

So I’d be glad if anyone could explanation, why customData breaks TTS.

Second thing I found out is changing slot data (or activating/deactivating intents) at runtime seems to be to slow in Rhasspy atm.; this is at least my personal understanding of

Re-training is fast enough to be done at runtime (usually < 5s), even up to millions of possible voice commands. This means you can change referenced slot values or add/remove intents on the fly.

(excerpt from sister project https://voice2json.org/, “Unique Features”)

koan · May 3, 2021, 12:36pm

Yes, you can find the JSON equivalent for all Hermes messages here:

I don’t understand why the text isn’t spoken in your situation. The example app that I posted above works for me. Could you give some more details about the exact messages you sent?

I haven’t tried changing slot data on-the-fly yet, do you have some numbers about your dataset? How many intents, sentences, …?

rejoe2 · May 4, 2021, 6:00am

As far as I have notes on that, the “dialogue” not working had looked like this on the mqtt side:

hermes/intent/de.fhem:Shortcuts {"input": "ton aus", "intent": {"intentName": "de.fhem:Shortcuts", "confidenceScore": 1.0}, "siteId": "motox", "id": null, "slots": [], "sessionId": "e86def78-beb3-81bc-3b99-750a2d53a257", "customData": null, "asrTokens": [[{"value": "ton", "confidence": 1.0, "rangeStart": 0, "rangeEnd": 3, "time": null}, {"value": "aus", "confidence": 1.0, "rangeStart": 4, "rangeEnd": 7, "time": null}]], "asrConfidence": null, "rawInput": "ton aus", "wakewordId": null, "lang": null}
hermes/dialogueManager/continueSession {"customData":{".ENABLED":["ConfirmAction","CancelAction"],"customData":null,"input":"ton aus","intent":"Shortcuts","probability":1,"rawInput":"ton aus","requestType":"voice","sessionId":"e86def78-beb3-81bc-3b99-750a2d53a257","siteId":"motox"},"intentFilter":["de.fhem:ConfirmAction","de.fhem:CancelAction"],"sessionId":"e86def78-beb3-81bc-3b99-750a2d53a257","siteId":"motox","text":"soll ich wirklich den verstärker stumm schalten"}

My setup atm. consists basically of the Rhasspy (deb install) located on the same server than my home automation system (FHEM, no sound hardware attached) and a mobile phone type satellite for sound in- and output with some 15 devices prepared to interact with Rhasspy (it’s more a testing setup atm.). Some of the devices have several names in Rhasspy; # of intents also is around 20. My (not yet optimized) main sentence file:

[de.fhem:SetNumeric]
\[ ( schalt | mach ) ] $de.fhem.Device-media{Device} [um] [(0..10){Value!int}] [dezibel{Unit}] ( lauter:volUp | leiser:volDown){Change}
[de.fhem:SetNumeric]
( mach | stelle ) $de.fhem.Device-thermostat{Device} [um] [(0..10){Value!int}] [grad{Unit}] ( wärmer:tempUp | kälter:tempDown ){Change}
( mach |schalt|schalte|stelle) $de.fhem.Device-light{Device} [um] [(0..100){Value}] [prozent{Unit:percent}] ( heller:lightUp | dunkler:lightDown){Change}
(schalt | schalte | stelle ) $de.fhem.Device-light{Device} auf (0..100){Value!float}
( mehr{Change:lightUp} | weniger{Change:lightDown} ) $de.fhem.Device-light{Device} [$de.fhem.Room{Room}]
( mach |schalt|schalte|stelle) $de.fhem.Device{Device} [um] [(0..100){Value}] [prozent{Unit:percent}] (heller){Change:lightUp}
( mach |schalt|schalte|stelle) $de.fhem.Device{Device} [um] [(0..100){Value}] [prozent{Unit:percent}] (dunkler){Change:lightDown}
(schalt | schalte | stelle ) $de.fhem.Device{Device} auf (0..100){Value!float}
( mehr{Change:lightUp} | weniger{Change:lightDown} ) $de.fhem.Device{Device} [$de.fhem.Room{Room}]

[de.fhem:SetNumericGroup]
\[(schalt|mach|fahr)] (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] [um]  [(0..10){Value!int}] [dezibel{Unit}] (lauter|höher){Change:volUp}
\[(schalt|mach)] (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] [um] [(0..10){Value!int}] [dezibel{Unit}] (leiser|niedriger){Change:volDown}
( mach | stelle ) (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] [um] [(0..10){Value!int}] [grad{Unit}] (höher|wärmer){Change:tempUp}
( mach | stelle ) (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] [um] [(0..10){Value!int}] [grad{Unit}] (niedriger|kälter){Change:tempDown}
( mach |schalt|schalte|stelle) (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] [um] [(0..100){Value}] [prozent{Unit:percent}] (heller){Change:lightUp}
( mach |schalt|schalte|stelle) (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] [um] [(0..100){Value}] [prozent{Unit:percent}] (dunkler){Change:lightDown}
(schalt | schalte | stelle ) (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] auf (0..100){Value!float}
( mehr{Change:lightUp} | weniger{Change:lightDown} ) (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}]


[de.fhem:MediaControls]
(starte|start){Command:cmdPlay} [die wiedergabe] [$de.fhem.Device-media{Device}][im] [$de.fhem.Room{Room}]
(stoppe|stop){Command:cmdStop} [die wiedergabe] [$de.fhem.Device-media{Device}] [im][$de.fhem.Room{Room}]
(pausiere halte ){Command:cmdPause} [die wiedergabe] [$de.fhem.Device-media{Device}][im] [$de.fhem.Room{Room}] [an]
(nächstes|nächster){Command:cmdFwd} (lied|titel) [$de.fhem.Device-media{Device}] [im][$de.fhem.Room{Room}]
(vorheriges|voriges|vorheriger|voriger){Command:cmdBack} (lied|titel) [$de.fhem.Device-media{Device}][im] [$de.fhem.Room{Room}]

[de.fhem:GetWeekday]
\[bitte] weißt du [bitte] welcher Tag heute ist [bitte]
\[bitte] kannst du mir [bitte] sagen welcher Tag heute ist [bitte]
\[bitte] könntest du mir [bitte] sagen welcher Tag heute ist [bitte]
\[bitte] kannst du mir [bitte] den [heutigen] Tag sagen [bitte]
welcher [wochentag|tag] ist heute [bitte]
welchen [wochentag|tag] haben wir heute [bitte]

[de.fhem:GetTime]
wie spät [ist es]
sag mir die uhrzeit
wie schpät [isch es]

[de.fhem:GetNumeric]
((Solltemperatur | Wunschtemperatur | Zieltemperatur){Type:desired-temp} | ( warm | kalt | heiß | Temperatur ){Type:temperature}) [ist es | von | vom | ist ] ([(im|auf dem)]($de.fhem.Room){Room}|[das]($de.fhem.Device-thermostat | $de.fhem.Device){Device})
(wie laut | Lautstärke){Type:volume} [ist es | von | vom | ist ] ([(im|auf dem)]($de.fhem.Room){Room}|[das]($de.fhem.Device-media){Device})
wie ist die (luftfeuchtigkeit){Type:humidity} [ ( [ ( im | auf dem ) ] ($de.fhem.Room){Room} | [ vom ] ($de.fhem.Device-thermostat){Device} ) ]


[de.fhem:SetTimer]
labels=( Wecker | Eieruhr | Kartoffeltaimer | Teetaimer | Taimer)

# Timer auf eine Stunde, 20 Minuten und 3 Sekunden
# Timer auf eine Stunde
# Timer auf drei Minuten
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (in|auf) [((1..60){Hour!int} (stunde|stunden))] [und] [((1..60){Min!int} (minute|minuten))] [und] [((1..60){Sec!int} (sekunde|sekunden))]

# Timer auf ein einviertel Stunden
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (in|auf) (1..60){Hour!int} (einviertel{Min:15}|einhalb{Min:30}|dreiviertel{Min:45}) (stunde|stunden)

# Timer auf ein einhalb Minuten
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (in|auf) (1..60){Min!int} (einviertel{Sec:15}|einhalb{Sec:30}|dreiviertel{Sec:45}) (minute|minuten)

# Timer auf 12 Uhr 15
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (in|auf|um) (1..24){Hourabs!int} uhr [(1..60){Min!int}]

# Timer löschen
(lösche|entferne|stoppe){CancelTimer} [den|die] [<labels>{Label}]  [in|im|in der|auf der] [$de.fhem.Room{Room}]
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (abbrechen|stoppen|löschen){CancelTimer}

# Timer auf eine viertel/halbe/dreiviertel Stunde
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (in|auf) ((eine viertel){Min:15}|(eine halbe){Min:30}|(eine dreiviertel){Min:45}) (stunde|stunden)

# Timer auf eine viertel/halbe/dreiviertel Minute
\[<labels>{Label}] [in|im|in der|auf der] [$de.fhem.Room{Room}] (in|auf) ((eine viertel){Sec:15}|(eine halbe){Sec:30}|(eine dreiviertel){Sec:45}) (minute|minuten)


[de.fhem:ReSpeak]
was hast du gesagt

[de.fhem:SetMute]
(gute nacht){Value:on}
(guten morgen){Value:off}

[de.fhem:Shortcuts]
motor aus
auto an
motor an
ton aus
auto aus
ton an
du bisch cool

[de.fhem:ConfirmAction]
(ja mach | tu es | ist ok | aber gerne doch){Mode:OK}
(lieber doch nicht ){Mode}

[de.fhem:CancelAction]
(lass es | nein | abbrechen | abbruch ){Mode:Cancel}

[de.fhem:GetOnOff]
ist [der|die|das] $de.fhem.Device{Device} [$de.fhem.Room{Room}] (an|ein){State:on}
(läuft){State} $de.fhem.Device{Device} [$de.fhem.Room{Room}]

[de.fhem:GetState]
wie ist der status{State} $de.fhem.Device{Device} [(im|in der|auf der|draußen|auf dem)] [$de.fhem.Room{Room}]

[de.fhem:SetOnOff]
\[(schalt|mach|fahr)] [den|die|das] $de.fhem.Device{Device} [$de.fhem.Room{Room}] $OnOffValue{Value}

[de.fhem:SetOnOffGroup]
\[(schalt|mach|fahr)] (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}] $OnOffValue{Value}

[de.fhem:SetColor]
colors=( rot{Hue:0} | orangerot{Hue:15} | orange{Hue:30} | goldgelb{Hue:45} | ([(zitrus|zitronen)] gelb){Hue:60} | (gelb grün){Hue:75} | (grün gelb){Hue:90} | ((grass|hell) grün){Hue:105} | grün{Hue:120} | (dunkel grün){Hue:135} | (smaragd grün | wasser blau){Hue:150} | (türkis [grün] | grün blau ){Hue:165} | (türkis [blau] | blau grün ){Hue:180} | (azur [blau]){Hue:210}  | ([blau] violet){Hue:225} | ([marine] blau){Hue:240} | ([blau] violet){Hue:255} | (rosa){Hue:270} | (purpur [blau]){Hue:285} | (magenta [blau]){Hue:300} | (alt rosa){Hue:315} | (rubin rot){Hue:330} | (karmin rot){Hue:345} )
\[setze|färbe] $de.fhem.Device-light{Device} [$de.fhem.Room{Room}] [auf die Farbe] (<colors> | (warm weiss){Colortemp:100} | (kalt weiss){Colortemp:0} | (mittleres weiss){Colortemp:85} )

[de.fhem:SetColorGroup]
\[setze|färbe] (alle | sämtliche ) $de.fhem.Group{Group} [im] [( überall:global | $de.fhem.Room){Room}]  [auf die Farbe] (<de.fhem:SetColor.colors> | (warm weiss){Colortemp:100} | (kalt weiss){Colortemp:0} | (mittleres weiss){Colortemp:50} )


[de.fhem:siteId2room]
( Ortswechsel  | begib dich ) ( ins | in den ) $de.fhem.Room{Room}

[de.fhem:ChoiceRoom]
nimm das Gerät aus ( dem | der ) $de.fhem.MainRooms{Room}

[de.fhem:ChoiceDevice]
ich hätte gerne das Gerät $de.fhem.Aliases{Device}

Hope, this is at about the requested info, please let me know if there’s som important thing missing.
Especially the dialogue handling still is in a very early state of development, and I found some inconsitencies within my code already, but imo nothing related to the missing sound output when customData is filled.

koan · May 4, 2021, 7:31am

Your customData is quite a big object, even including another customData key:

{
   "customData":{
      ".ENABLED":[
         "ConfirmAction",
         "CancelAction"
      ],
      "customData":null,
      "input":"ton aus",
      "intent":"Shortcuts",
      "probability":1,
      "rawInput":"ton aus",
      "requestType":"voice",
      "sessionId":"e86def78-beb3-81bc-3b99-750a2d53a257",
      "siteId":"motox"
   },
   "intentFilter":[
      "de.fhem:ConfirmAction",
      "de.fhem:CancelAction"
   ],
   "sessionId":"e86def78-beb3-81bc-3b99-750a2d53a257",
   "siteId":"motox",
   "text":"soll ich wirklich den verstärker stumm schalten"
}

Is this intentional? Have you tried first with just a simple string in your customData?

rejoe2 · May 4, 2021, 10:00am

The idea was just to send back the original full payload and extend it with additional info about what might be reverted (or asked). At first instance, I wanted to avoid temporary storage on the FHEM side.
So I didn’t put much effort in data reduction to customData (on whathever level), found out, it didn’t work as expected and then decided to use the other option and store the session data within FHEM. As this is necessary for other reasons, there no longer is a need to use customData, so I stopped further investigations on variants that may also work…
Atm. this doesn’t seem to cause any problems.

If you’re interested how the code for confirmation and so on now is designed:
To get similar functionality than the example in “continuing-a-session”, just a simple “is confirmed”-Flag is added to the original session data (in “confirmation” handler function) and then the extended data is handed over to the respective intent handler code. “selection” code (for device or room) is quite similar, but just inserts the original missing piece of info (provided by the second message) to the first payload.

Glad to hear, if you have ideas on different (and maybe better) options on how to get this part of the job done.

solyarisoftware · May 25, 2021, 6:54pm

Hi!

Generally speaking, looking at this conversation example, it comes in my mind just the concept of context (using your term maybe), that I’d call domain. This domain implies a knowledge of the world (rooms of your home and name of things, etc.), what we usually call a knowledge base, that you probably have learned interactively (with a conversation again) in a previous conversation session.

Yes. What is missing is probably the domain (you say context, that’s what you are talking about) and a inner-domain state-tracking, meaning for state a turn-taking pointer of the dialog flow, just as a stack pointer in a computer language program, if you like the metaphor…

SOLUTION:
To avoid this happening you need to introduce ‘context setting command’ - in your sentences file, among your other sentences, you have the one(ore several), marked with some flag, that say to Rhasspy, that these are context setters. For example:
[context_setter]#context
(Set the context. let’s do the):doing (heating | lighting | climate | music)

So, when this command commes to Rhasspy, it remembers exact outcome of this, let suppose you’ve asked about heating, so it remembers: “doing heating”.

Then, as long, as the context is set, it ADDs automatically “doing heating” to every of your recognised text, and passes this modified string to the intent recognition module. For example, the command: “make it warm in the livingroom” would be passed to Kaldi as “doing heating, make it warm in the livingroom”. So this way you would never recieve a wrong intent back!

I’m perplexed about a solution that consist to modify user utterance to be passed to an “intent classifier”, but i se your point: the intent must be contextualized.

I call this “meta-commands”: you teach your dialog manager to specify explicitly to move from a context to another. I’m not sure this is an optimal solution, for a conversational design point of view, for a “not-engineer” end-user I counter-propose to use implicit ways to change domain (context) using language conventions and last but not least using timing (as for human-to-human dialogues, after 5 minutes you are silents, probably the conversation ended).

So, as others said, dialog management is a not trivial topic, also in task-oriented/closed-domain realms of an home assistant.

BTW, I released an opensource draft dialog manager that maybe give you some ideas: https://github.com/solyarisoftware/naifjs

In Naif, a conversation unit (sort of domian) is modeled as a navigation though a graph of states. So the context of intents is, in my dialog manager, inside each input state.

BTW, I have no truth in my pocket
and I have to think about how to possibly use Naif in RHASSPY (we chatted in past with @synesthesiam; I’ll be back if I find the thread).

giorgio

rejoe2 · June 11, 2021, 10:35am

Hello together,
as by now our implementation in FHEM has matured to a large extend, some users have asked for extending dialogue features.
So I once more had a closer look on this and ran into some problems as described here.
As the focus is different to the suggestions from the thread starter, I’ve opend the above thread and kindly ask for your support .