"louder, please" - highlighting facets of the FHEM implementation

Hi all,

what’s the craziest thing you can or cannot do with rhasspy ? triggered me to at least start some kind of writeup on “really cool stuff” implemented in the FHEM plugin. So why could this be interesting reading (don’t know how to make funny and instructive videos?) if you’re not using FHEM?

Well, imo the code structure itself might contain some aspects that might be worth taking into consideration, even if you use a completely different solution for your home automation…

So expect a wall of text hard to understand :rofl:.

So let’s start with the spoken “louder, please” from the thread title:
Obviously, you can hardcode in your sentence.ini to point to a special intent to in consequence increase the volume on a specific amplifier. Oh no, how boring :stuck_out_tongue_winking_eye:

The one who’s asking your automation to execute that, might expect the TV in the living room to be addressed, or the one in the bedroom? Or the speakers attached to the PC in your offices desk? Who knows :upside_down_face:

As you see from this example, taking some additional aspects into account to make the final decision on what should finaly happen makes the story much more “natural” in terms of user experience.

As Rhasspy already provides at least also the information, where any input is coming from, imo it’s obvious, the code always will have to take the siteId into account - obviously apart from the cases, the speaker explicitely wants the execution in a specific room

So this is why this thread here started:

Additionally, for “louder, please”, the automation code should be able to consider

  • which “device” is capable to execute “volume” commands at all?
  • which one is running at this point in time?
    All these things have been solved a long time ago by some smart other people (don’t exactly know whom), so big thanks to you guys in case you’re reading this!

The code beeing that open also has a big disadvantage: you may have “too much” positive hits (devices that might be the right ones)… What to do when the TV and the amplifier are turned on?
Well, you may

  • aks the current user (this is solved here)
  • set some priorities in the configuration.

So, what’s the structuring elements allowing that kind of decission making:

  • keep the Rhasspy side (sentences.ini) as device agnostic as possible, use slots
  • use generic intents whenever possible (“louder, please” will trigger an intent called “SetNumeric” which also is used to set desired temperatures for heating devices, blind positions, …)
  • provide specific slots for specific (functional) tasks (“louder” only addresses media devices, so combining louder with a thermostat device should be avoided imo).
  • keep the code itself language-agnostic

Last point might be self-explaining with an excerpt from my sentences:
<cmdmulti> [<den>] [$de.fhem.Device-media{Device}] [<rooms>] [<etwasLauter>] ( lauter:volUp | leiser:volDown ){Change}

(cmdmulti internally is optional, too)
So the basic info provided in FHEM is “Change:volUp” and siteId, no matter, if you translate the sentence in English, French or whatever you like…

Same thing with color informations and so on: For a “yellow” color to be set, the FHEM plugin expects just the number 60 (or something in that range at least) :wink:. Obviously, this can be standardized and distributed centraly for any possible language.

So the FHEM plugin treats the language itself more like a “skin” - just read a special file and all “standard responses” are translated (and e.g. the slot for “colors 2 numbers” is renewed)…

Last point then is how to complete the link between the mass of devices in the automation system (FHEM) and Rhasspy. Just use some slots filled automatically…

So the final configuration on the FHEM side is to add some “labels” containing “names” (no need to be unique for the entire system, only for a “room”), “room” and (in some rare cases) special information on how to threat weired devices…

OK, hope, you got the idea and some first impression?

So here’ some links you might find useful:

  • If you are able to understand (or let translate) german, you might find RHASSPY/Schnellstart – FHEMWiki usefull as a first starting point.
  • the “commandref” (pdf version, english; this is the “official” documentation).
  • The code and some additional files (e.g. the “german language skin”) are available from FHEM svn here.
2 Likes

OK, so here’s some more…

One cool feature is the “test mode”. You may feed RHASSPY-FHEM with a single sentence or a list of sentences and check (without negative external influnces to the STT part like background noise, microphone quality, mumbling speaker, whatever) what will happen if the STT system will have recognized that specific sentence. The result either is shown online (single sentence) or written to the filesystem. Here’s some examples that may also reveal a little more details :sunglasses::

  • “warmer, please” - a variation on “louder, please”. No specific device mentionned, basically, only the “warmer” and the siteId are necessary to identify the device. Using a different satellite would lead to a different target device:

[RHASSPY] Input: mache etwas wärmer
SetNumeric {“Change”:“tempUp”,“confidence”:1,“customData”:null,“input”:“mache etwas tempUp”,“intent”:“SetNumeric”,“lang”:“de”,“rawInput”:“mache etwas wärmer”,“sessionId”:“defhem_108_testmode”,“siteId”:“defhem”}
Command: set Thermostat_Wohnzimmer_SSO_Clima desired-temp 23.
Response: zu diensten

Hope, despite beeing in German, you get the idea?

  • Two examples for individual color commands:

[RHASSPY] Input: färbe die stehlampe links rot
SetColor {“Device”:“stehlampe links”,“Hue”:“0”,“confidence”:1,“customData”:null,“input”:“färbe die stehlampe links 0”,“intent”:“SetColor”,“lang”:“de”,“rawInput”:“färbe die stehlampe links rot”,“sessionId”:“defhem_95_testmode”,“siteId”:“defhem”}
Command: set Licht_Stehlampe_links hue 0.
Response: zu diensten

[RHASSPY] Input: stelle die stehlampe rechts auf blau
SetColor {“Device”:“stehlampe rechts”,“Hue”:“240”,“confidence”:0.8,“customData”:null,“input”:“stelle die stehlampe rechts 240”,“intent”:“SetColor”,“lang”:“de”,“rawInput”:“stelle die stehlampe rechts auf blau”,“sessionId”:“defhem_96_testmode”,“siteId”:“defhem”}
Command: set Licht_Stehlampe_rechts hue 43690.
Response: Gerne!

As you may have noticed, the “ok”-response needs not to be always the same (indeed, it’s randomized within a user-configurable set of possible alternatives).

  • And here’s something not working (in the given context):

[RHASSPY] Input: schalte die beleuchtung am esstisch an
Confidence not sufficient! SetOnOffGroup {“Group”:“beleuchtung”,“Value”:“on”,“confidence”:0.5,“customData”:null,“input”:“schalte die beleuchtung on”,“intent”:“SetOnOffGroup”,“lang”:“de”,“rawInput”:“schalte die beleuchtung am esstisch an”,“sessionId”:“defhem_90_testmode”,“siteId”:“defhem”}
Devices in group and room: Licht_Stehlampe_links,Licht_Stehlampe_rechts

This may be interesting for several reasons:
– First, there’s just “beleuchtung” as a group name, so the addition of “am esstisch” seem to weaken the confidence level
– There really exists a group named “beleuchtung”, (so adding the “wohnzimmer” as room identifier may have resulted in a higher confidence level), and RHASSPY-FHEM obviously is able to address commands to such groups. Groups also may be built “ad hoc” by speaking something like “switch device a and device b on” or variations using several devices, groups or rooms.
– the confidence level is also taken into account. If it’s “to low” (configurable by intent), nothing will happen, besides some feedback towards the user (suppressed in the test case).

  • And here’s one more group command:

[RHASSPY] Input: schalt das licht am esstisch und das radio an
SetOnOff {“Device”:“licht am esstisch”,“Device1”:“radio”,“Value”:“on”,“confidence”:1,“customData”:null,“input”:“schalt das licht am esstisch und das radio on”,“intent”:“SetOnOff”,“lang”:“de”,“rawInput”:“schalt das licht am esstisch und das radio an”,“sessionId”:“defhem_291_testmode”,“siteId”:“defhem”}
Response: Bestätige schalt das licht am esstisch und das radio an

As RHASSPY in my early days of useage often switched the “wrong” devices (preferably my amplifier off :astonished:), one of my first additions to the code base was to request a user confirmation (“Bestätige”) prior to really switch the device. Most likely this would have been prevented by taking into account the confidence level, but in these early days I wasn’t aware of all the possibilities (and most likely also not yet capable to do the required coding).

As the “radio” still requires a confirmation, the entire group will also do so.

Maybe somewhen in the future I’ll change that and do - let’s say - a mixed model requiring a user confirmation only in case if the confidence level is below xy.z…

So far for that time, CU.

Doing that kind of testing as descibed in the last post here requires some logics also addressing the NLU interface in an appropriate way to get “intent recognition” done and derive the desired actions. So this is what today’s post is about:

Using “text interfaces” in the FHEM+Rhasspy context :sunglasses:.

There are three different cases in which FHEM interacts directly with the NLUI (so especially without prior usage of the Audio Server and ASR functionality):

  • Test sentence processing
  • “Dialogues” with a push messenger service (“messenger”, may be Telegram, Signal, Whatsapp, …)
  • Android devices using “automagic” app. That app (or a second one) are able to forward ASR preprocessed text towards FHEM (and also receive text and do TTS for responses), let’s call this case “AMAD”.

All the three require “FHEM” to be known als satellite in Rhasspy’s intent recognition system. As you are all Rhasspy experts, this point is obvious to you :slightly_smiling_face:.

Process then is as follows:

  • a sessionId is built. That slightly varies from variant to variant. E.g. in the testmode case, it’s just the own siteId + current test sentence number + “testmode”
  • then the NLUI-topic is queried under FHEM’s siteId
  • then either a “not recognized” message is received and analyzed (the siteId still makes part of the sessionId, so it’s possible to seperate these otherwise a little “anonymous” messages from others!) or the regular intent handling code will be called
  • – in test case: executing nothing, but process input as long as possible
    – in AMAD and messanger cases, commands will be executed and the response will be forwarded, in AMAD case TTS will be executed on the android device itself. The session may be kept open in case of requests towards the User.

The messenger case requires some kind of additional session handling (executed on the FHEM side), as typically we want to separate “other” messages to those explicitely addressed to the NLUI. To do this, the user will have to initiate every session with some kind of (per User configurable) keyword and process in the dialogue within a (configurable) time, otherwise the FHEM-session will not be opened at all or closed. From Rhasspy’s perspective, each individual request within a FHEM-session will be treated as a seperate session (apart from cases also beeing “continueSession”-cases when using Rhasspy for the entire processing (starting with audio capturing)).

Some background info: Original starting point was the messenger interface, as I really liked the idea to be able to request any available info from within my automation (ok, there’s only a few of them really interesting and made available the “RHASSPY-way”). In the end, this was rather tricky to get that functional, but then it was really easy to integrate the two other things.
Using the test option gave really deep insights in what’s still missing, could be better configured and so on, just preventing execution of final commands was really a piece of cake!
Additional remark wrt. to the AMAD feature: This was also quite easy to integrate as only text in- and output in combination with some standard event processing was required. When using this option for testing, it turned out to be quite a good option, as

  • TTS quality is better than Rhasspy’s default (but irritating slow in direct comparison)
  • STT quality is significant superior to the results in Rhasspy (using more or less defaults here)
  • as there’s no link between sentences.ini and STT, “rubbish spoken content” will not lead to “mad results” due to weak confidence levels (or even korrect “not recognized” handling).
  • turning back on the microphone seems to be better synchronized with reality. Other than when using the Rhasspy mobile app for requests, there seems not to be captured the own spoken response.

The later experiments with AMAD showed some room for improvement on the Rhasspy settings, so I’m curious, where my “Rhasspy voyage” will have the next stops… Final destination still is and will be: Doing the entire process completely offline!

Hope you enjoyed reading, and please do not hesitate to ask questions on any of the aspects mentionned here.

So here’s some cross reference to other solutions that have been posted on this forum in the past. As this thread is more about ideas that might be of interest to other implementations as well, you may have a look at the “originals” as well:

  • In Some example sentences for German there’s been the idea to enforcen the “louder please” with “much louder” (etc.). @moqart used a static difference to be handed over to get that done. In the FHEM implementation, there’s been the option to adjust the stepwidth by "device " since ever - this is how “louder” is interpreted: Apply one stepwidth. This mechanism now is enhanced by adding an additional field as multiplier. Doing so, any numerical command can be modified using additions like “slightly{Factor:0.7}”, "much{Factor:1.5}, “very much{Factor:2}”. Thanks to @moqart for this great idea :slightly_smiling_face:.

  • @KiboOst also did some interesting things that has parallels in the FHEM-plugin as well. To be honest, I didn’t take notice of most of that great stuff earlier, as Jeedom is “foreign language” to me… So you might find that interesting as well (and the entire post!):

And one last for today:

FHEM-RHASSPY still needs some kind of “basic orientation” by the recognized intents, but imo it’s really a good idea to reduce them to a very basic level (especially for standard tasks like switching things on or turn up the volume). Atm. we are investigation options to (partly) eliminate the barriers between so called group intents (“turn on the lights”) to single device intents (“turn on the light”).
Explainig that in detail is tricky, so it’s more like: Do not split up your sentences to much into many single intents, use other ways to find out what the speaker wants - that might turn out to be more flexible in the end.
This may be easier to understand with a simple example intent, let’s call it “GetState”: Any kind of request for information towards the HomeAutomation may start with “What’s …” or “Tell me …”.
Then you may distinguish then between “Time” and “Date” or “gazole”, “bread” or “todays (tomorrows) weather” by labeling a “GetDate” or “GetTime” (and so on) intent on the Rhasspy side, but that’s less flexible than just sorting some specific stuff (like date and time) out and follow one (or at least a few) common route(s) for the rest… So basically one “GetState” intent may be a good idea (but directly name it “GetInfo” :slightly_smiling_face:).