I’ve just setup Rhasspy together with my existing Home Assistant installation. I’m surprised to how easy it was, and I’ll probably contribute back some code in the future when I feel I’m missing something (for instance, good Dutch TTS in the Docker image).
Now, I see many people working on the classic commands you can handle with Google’s, Apple’s or Alexa’s voice assistents. While that’s truly cool, I’ve always been of the opinion that free software can do better than them. That’s my question: what did you do that you’re truly proud of? What did you imagine implementing, but have not yet? What do you think is still impossible?
I’ll kick off with something I did last night: I wrote a set of slot_programs that fetch names of movies, shows and music from my Jellyfin server, and populate $movies, $shows and $music slots. I can now say “Jarvis, speel Big Buck Bunny” and it’ll play that movie. I can say “Jarvis, speel Severed Fifth”, and it’ll shuffle the songs of that band. And I can call upon a TV show, and my TV will play the next unwatched episode. All with free software: Kodi, Jellyfin, Home Assistant, and Rhasspy. Best of all, my new best friend understands and speaks Dutch!
A few days ago, one of the users of our FHEM integration mentionned having abandoned one of the other “big” speech control systems in favour of Rhasspy. He didn’t like the “there’s no appropriate … near to you” responses any longer. So at least one point Rhasspy can do much better is to take into account where (and maybe also who) you are when evaluating what shall be done eventually. Adding that kind of information (and/or which wakeword has been used) additionally into account is one thing that really might make a significant difference.
Basically, I’m very proud of my contribution to the review on the entire FHEM integration. When starting with that journey, there had already been a very good code base doing kind of “context dependent reverse search” e.g. from a (non unique) device name + location info derived from the satellite. That search was/is used to find a specific actor. Also filling a couple of slots automatically (containing e.g. the speakable device names and rooms) had been included.
But the code was more or less unuseable for other languages than German. Now, everything is standardized using either English (e.g. for on or off) or just numbers (e.g. also for colour values) as expected values and/or keywords in the JSON blob. This way, you may even have more than one language (or Rhasspy bases) in parallel to address the same actors, sensors, …
Also the number of slots has been increased a lot to distingush betweeen e.g. blind and light type of actor. This makes it easier to built tailored sentences within Rhasspy.
There’s a lot more cool stuff included, e.g. you may also use a messenger service to “feed” Rhasspy and then get the (written) answer. But that’s a long story, maybe one point in time I’ make a longer write up on the entire thing.
If you are able to understand (or let translate) german, you might find RHASSPY/Schnellstart – FHEMWiki usefull as a first starting point.
Here’s a link to the “commandref” (pdf version, english).
The code and some additional files (e.g. the “german language skin”) are available from FHEM svn here.
Other “cool stuff” to mention:
value mapping
venetian blind specials
e.g. you may let interprete a “set blind xy to shading” command to open up the respective venetian blind to 40% and turn the lamellas also to that level…
Also the “abilities” of an actor will be analyzed automatically, so just marking any actor as “light” might be sufficient to privide at least on/off, and (dependend on what’s possible in reallity) additional brightness or color commands (regardless whether it understands “hue” (color without brightness value) or “rgb” kind of commands), color temp commands etc. in combination with Rhasspy.
Well, I’m of the opinion that a voice assistant shouldn’t really say the same thing twice
Adding the site-id and wake word to the intent entities can basically fill this gap, no? This was on my mind too, that you could say “turn on the light”, and depending on the site-id it would take the correct one.
The part of who is speaking sounds quite a bit more involved. At least you can partly mitigate that by giving each person their own wake word, I suppose, but that’s a bit lame.
Really looks very neat, and it’s a grievance I have with Home Assistant. There’s a lot of repeated configuration going on considering automations and integration, to a point where I have been thinking about writing my own automation engine (using a mix of declarative and imperative language, taking inspiration from distributed computing research such as https://gitlab.soft.vub.ac.be/fmyter/triumvirate) that interfaces Home Assistant. Sadly, I cannot afford another pet project
I had been looking at Home Intent yesterday, but that felt a bit too “opinionated” to me.
siteId (and wakeword?) is/should be included in the JSON already, FHEM is actively using the siteId to fill the gap.
As I personally use the mobile app as main input gadget, this is “mobile”, so I use a custom automation to map this satellite to a specific room when required. This mechanism imo also could be easily used to evaluate who’s taking based on used wakeword (peace of cake, indeed, I think). Almost no delay due to that additional few lines of code!
But really doing a “static” analysis of the speaker most likely would be a really big step…
I originally started my home automation journey with FHEM, and everything I wanted I got working with that (apart from things I didn’t get real acces to the hardware itself). So I never did a deeper analysis of the “outer world”. There’s been quite a few user in the past using FHEM for automation an especially HomeAssistant as UI, so perhaps this is an option for you, too? (I’ts not that easy to “learn” FHEM if you don’t know much about IT, but if you are able to write your own system, this might be worth a try…).
I am a little divided in view on Rhasspy. This is due to the fact that in my capacity as a professional researcher I am part of several state-of-the-art research NLP projects. One example: In the project MERLOT we are developing a voice assistant to guide humans in arbitrary educational situations. MERLOT is funded with roughly 10 Mio €, but around 70% will go into developing the underlying educational data space on top of the European Gaia-X initiative. For the development of the educational voice assistant we have only about 1 Mio €.
Concerning Rhasspy: Nice work, I admire the flexibility. I am somewhat disappointed by the fact that its future is unclear, because I believe we need to establish certain standards in the field of NLP…
Coming to the question for the “craziest thing”: I have coupled Rhasspy to my home, which is controlled by FHEM. Intent recognition etc. are carried out in FHEM, but not via the Rhasspy module. I have another FHEM Module (which I wrote myself a few years ago) called “Babble” (see Modul Babble – FHEMWiki for a hopelessly outdated description) based on semantic analysis of sentences (and therefore heavily tied to German as a language). Works pretty well, but so far this is nothing crazy. What makes this thing really interesting is that I have added a chatbot.
Whenever no intent is recognized by my FHEM system, the received sentence is passed to the chatbot, which is based on rivescript https://www.rivescript.com/. Of course, the main idea is to guide the user back to a controlled dialogue - but since part of my rivescript files is an implementation of ELIZA, it can carry out completely open dialogues with a user.
The privacy sell is a crazy one as ASR has gone offline because big data recognise they can track through services and with Google the offline modes are tightly coupled to services in Android.
Its also a crazy sell as many privacy advocates strangely also employ online services that big data no longer has the voice of intent but still has intent action data.
Having a voice assistant that lacks integrated audio processing where the initial audio pipeline of Beamforming/AEC/Filter/Separation is mostly missing and not integrated with modules has been a huge hole in dealing with noisy environments and the lack of attention this has received is pretty crazy as its the input to the system so obviously can have pretty drastic effects.
One of the craziest things most do as its the way Rhasspy works is to broadcast raw audio over a light weight control protocol such as MQTT whilst the software for existing standards of hugely more supported pools of opensource are shunned purely to label ownership in what is supposed to be opensource is absolutely totally crazy.
Same with audio delivery as if you are to integrate into a wireless audio system then surely it is better to adopt a wireless audio system that already exists than another proprietary branding that is being positioned purely because you have ownership than a choice of already existing and supported and once more piggybacking audio into MQTT that effectively destroys MQTT as a lightweight control protocol borders on absolutely bat s hit crazy.
I think the route @synesthesiam took to accomodate others in their effectively failed Snips attempt was crazy even though commendable as this great lightweight embedded local control platform became lost in this terrible peer-2-peer Hermes control protocol.
The lite nature of the initial Rhasspy to be able to quicky and easily train a localised GPIO controlled voice interface with zero reliance on internet connection than software distribution and upgrades was a pretty unique and useful piece of software that now in its current stage I am confused to if or what its effective at.
I would love Rhasspy to return the wonderful lite simplicity that @synesthesiam originally created and maybe could work well as much of the higher load additions are likely a much better fit to the Mycroft system as the work could be partitioned and benefit from it and separate any conflicting interest.
There is another thing that generally I think is crazy that distributed network microphones are branded and referred to as satellites whilst they should be like any other form HMI or maybe we should have Rhasspy keyboards and Mycroft mice!?
This branding craziness of a plethora of disparate systems has seriously hampered opensource smart assistants where there are clear partitions that should be a system to themselves that promote interoperability than what for most parts is the opposite and to know what you are good at and to stick to it and provide user choice as supposedly is the nature of opensource.
I’m noticing a trend of people in AI and NLP being on this forum. That’s a good thing, because it means there’s a bunch of critical thinkers on here. The fact that those same people use the software, means that it lives somewhere on the spectrum between state-of-the-art and usability. It is clear however that you have some frustration that research doesn’t trickle fast enough into these kind of projects. I have the same feeling coming from cryptography
Any specifics that you’d like to see trickle down, especially pertaining to my point " While that’s truly cool, I’ve always been of the opinion that free software can do better than them."?
Exactly what I’m after, cool stuff!
@rolyan_trauts, can I batantly summarize your post as “Rhasspy is doing a lot of hammering screws (voice over MQTT)” and is a typical mess of many open source components that fit together but isn’t really modular yet?
The simplicity is still there, what is it you feel has increased the complexity of Rhasspy?
It is still a tool to quicky and easilty train a localised voice interface wit zero reliance on the net if you choose to.
@rubdos I guess you could look at it that way but its naturally modular as audio in becomes text out which is then fed…
It was made proprietary and non modular with the introduction of Hermes that does little else than confuse and being proprietary its has limited support by a few on here for no reason as better, more standard, more modular, more supported protocols exist…
@romkabouter you know I hate Hermes always did and especially the satellite versions, but lets not go over opinion worn thin.
The Pi0 is dead, long live the Pi02 and for @synesthesiam that for a long while did manage to just about work on a Pi0 but I expect could quite easily optimise for embedded Pi02 as they are so cost effective and likely isn’t a conflict of interest with his Mycroft work.
Yeah I know, but Rhasspy on a single setup does not do anything with that. That’s why I ask.
There is no network broadcast or anything, it is just that single simple device so I was wondering what made Rhasspy more complex to use for you than in the early days.
I agree with you there a better ways for real time audio.
Not much as that is what I think is to concentrate on that single setup and ditch much of the rest as that is the bit don’t like. A few later modules like Larynx are prob a tad heavy for that type of ‘embedded’ role and prob conflict slightly with what Mycroft might want to employ. Also maybe trim out some of the modules where there are duplicates without any advantage.
It is really the network broadcast, more complex satellite side that I really didn’t like and the hole in the initial audio processing to be able to cope better with noise.
I am chipping away at the noise thing and very slowly picking up some C++ skills to try and minimise load and if I do get some solutions that I might have I will share.
As yeah I also agree the single setup ‘embedded’ device mode does work quite well and was just thinking the focus should be to make that side lean and give the Pi0-2 some focus as it is just an incredibly cost effective platform.
Even with big data they get this wrong I swapped my Google Nest Audio for 2x Amazon Echo Gen 4 because they have an aux in so can use them also as wired active speakers.
The Gen 4 is noticeably less accurate in the presence of noise than the echo dot gen 3 that I have also tested and supposedly same with the full gen 3 and its driving me crazy at times as its much worse than the Google Nest.
I am not sure Rhasspy will ever make improvements in that area as some of the biggest improvements Big Data make are by dictate, control and integration of hardware whilst Rhasspy operates in a bring your own mic to the party style of operation that is near impossible to provide for because of how wide ranging that can be.
Even though the Echo Gen4 in terms of noise wipes the floor with Rhasspy its still not great and at a guesstimate the mics might be getting to much feedback due to some sort of isolation problem.
There is a huge amount of engineering that goes into them a lot of choices of algs of use and mic type to how you even assemble, if you get these wrong or are merely unaware you will get worse results.
Having a complete absence of any of the all important audio processing setup in a project leads to the results it has.
In that regard it seems rudderless even though its been an issue for years.
Have you, or are you willing to share your slot programs? I’m especially interested in your music slot generation, as I’m working on a voice interface to my homebuilt jukebox. TIA