Collaboration with Jaco-Assistant project

Hi @synesthesiam,

first of all, I would like to congratulate you for the great progress Rhasspy did make since last year.
With the end of Snips I was looking for a replacement, which should have been able to be extended with custom skills, but a that time I wasn’t satisfied with the existing solutions, so I decided to build my own assistant called Jaco.
Like with Rhasspy, the code is completely open source, but Jaco has the focus on usage and extension with skills, in difference to Rhasspy’s focus on SmartHome integration.

Currently Jaco can run on almost and Linux computer or a Raspberry Pi 4 and understands English, German, Spanish and French. It also did show great results in some benchmarks I did run, like this one:

1 Like

I did check out Rhasspy recently again and with the plans you described in your Master Plan and with the ideas @maxbachmann and @koan did mention for secure skills, I thought it would be a good idea if we could work together on problems we both face.

Regarding the different modules like STT, NLU and TTS required for an assistant, I did focus on one solution for each. For TTS I’m using PicoVoice, like Snips did, and for NLU I’m using Rasa which I also did port to work on RasPis. For STT I did train own models using DeepSpeech. All of those come with a prebuilt container image, for easy installation.

For the skills I did try to make installation as well as creation as easy as possible, without restricting possibilities. You can add utterances in a simple markdown style and run any code you want. Communication with the rest of the assistant is done over a small python interface, using MQTT. To get rid of the restriction Snips had, that you can only use python code, and to still keep things secure, all skills run in a separated container, built locally on the device (with Podman that containers don’t run with root permissions).
The skills also use a permission system, so that they can only access special hardware features or external topics if they mentioned them in their config file.
Training of STT language model and the NLU model is done locally on the device (normally does take a few seconds on a computer or a few min on a RasPi).

For easy access to skills and sharing them with others, I did build a skill store (can also run offline if needed):


You can find the project here:

I did try to make the setup as easy as possible, but I believe it’s a bit more complex than the one of Rhasspy, which has a nice GUI:) The installation does take about 30-60 min, depending on your hardware and internet speed. Jaco also should be able to use multiple satellites, but I didn’t test this yet.


Regarding a collaboration, I would suggest, that we first try to join our utterance definition syntax, that we can more easily interchange the different modules of the assistants. It’s already quite similar, but not to the full extend. Maybe we then can make the skills (or at least some of them) runnable with Rhasspy too.
I’m also interested in your larynx TTS project, while I can offer support with STT trainings in exchange.


What do you think about it?

Greetings,
Daniel

I’m maintaining a Docker image for Rasa on ARM, I’m using it on my Raspberry Pi:

For now still pre-Rasa 2.0 and only 32-bit.

How are you running Rasa on a Raspberry Pi?

This all sounds awesome, exactly like I envisioned how a secure app architecture should look like! I’ll try to have a closer look at your project next week.

Yes thats what I envisioned for the app architecture aswell. I really like, that the store shows warnings for these things (access to different topics, internet connection, use of specific devices, memory usage …).

In about 180 lines Container file. This did take me a lot of time :see_no_evil:
Thanks for your bazel-on-arm project by the way:)
You can check it out here: https://gitlab.com/Jaco-Assistant/Jaco-Master/-/tree/master/nlu-parser or download the prebuilt image from there: https://gitlab.com/Jaco-Assistant/Jaco-Master/container_registry

1 Like

Dear @DANBER,

First thing to say… this skill store looks awesome :+1:

I would love to see a collaboration of Rhasspy and Jaco with the goal to fuse the knowledge and create very flexible and secure offline voice assistants with skill stores. I think this is a win-win situation and the result could inspire a lot of programming beginners to give it a try and start with their first voice project without the fear of having to create all the skills by themselves or risking the security of their privacy.

1 Like

This is some incredible work, @DANBER!

I would definitely be interested in joining forces (perhaps @sepia-assistant would be interested as well). If we can make our utterance syntaxes compatible and our containers interoperable, this could be quite an impressive system :slight_smile: Your skill store also looks awesome; this is definitely something that could benefit the whole community.

Do you have a document describing your utterance syntax? Perhaps we could author a shared standard here. Part of Rhasspy’s syntax complexity comes from not having a very sophisticated NLU component (basic graph search). I went this direction to keep things more language independent, but it wastes a lot of time with things like numbers and dates.

If possible, how would you propose making our containers interoperable? It looks like Jaco is using something similar to the Hermes protocol. Maybe a translation service?

Your use of podman here is especially intriguing; I’d like to transition Rhasspy to it eventually and use podman compose instead of having one big container. I have some working code already that uses the settings from Rhasspy’s GUI to generate a Docker compose file. I imagine this could be made to work with podman too.

Larynx uses MozillaTTS, but I’ve been able to shrink down the models by using a small phoneme set per language instead of one big set across all languages. My hope is to train TTS models for all of Rhasspy’s STT languages.

It’s been a while since I’ve looked at DeepSpeech, and 0.9 with your models seems quite fast and accurate. I may be able to train some models myself; I’ve collected quite a bit of speech data, and have been training Kaldi STT models.


Let’s keep the discussion going here and see what we can accomplish.

@koan is the keeper of our Hermes messaging library. @maxbachmann wrote the very fast rapidfuzz NLU library and is our resident C++ badass. @RaspiManu has recently been helping verify German sentences for Larynx TTS so we can have volunteers contribute their voices.

Lots of great talent here :smiley:

You can find all the instructions to create new skills in the demo skill. I also think this is a good start for you to checkout things we can combine.

There is some explanation in the riddles demo skill: https://gitlab.com/Jaco-Assistant/Skill-Dialogs/-/blob/master/dialog/nlu/en/nlu.md. I’m using the style Rasa suggested, but with some restrictions. I do like the style because you can see the utterances in a nice layout when looking at the files in the skills repository. Currently I’m not supporting named slots for slots of the same type like “start” and “destination” locations, but this should be easy to integrate.
With numbers and dates, I am using Duckling, which does all the conversion stuff and is integrated into Rasa.

It’s quite similar, mainly using different topic names. But most importantly, all topics are encrypted by default, because I did want to ensure, that skills can only access their topics or external topics they mention in their config file. So I think a translation service (we could create an extra skill for this) would be the best way. For encryption and decryption I’m using a file with keys from which the required keys are written to another file for each skill. So it’s easy to access them as user or a separate program (which is intended), but not from inside a skills container (a weather skill shouldn’t be able to access the microphone streaming topic).

Really would love to be able to use Rhasspys GUI to setup even easier. Currently you have one global setup file, where you have to fill out details like microphone index by hand.

For running Jaco, I’m automatically creating a podman-compose file which can start all the modules and skills.

One important requirement for me would be that we can run the model on a RasPi 4, maybe 2x faster than real time.
I already integrated a MozillaTTS voice (in German), but it’s quite slow, even on good computers (not using the gpu).

3 Likes

Our initial idea for was to use Mosquitto Access Control List for this. This raises the question how restrictive we want access to be.

Using Mosquitto ACL the most we could do is restriction at the topic level. However we might want to restrict read/write aswell, since e.g. each skill will require to a topic like hermes/dialogueManager/endSession, since it is used to end an dialog. Should each skill be able to read the content of this message when it is sent by a different intent? If not, this would require us to encrypt messages. Even when using Mosquitto ACL to prevent subscribtions to certain topics and encrypting messages on topics like hermes/dialogueManager/endSession it would still be possible for skills to check when these messages are sent, which could be used to monitor the presence of a person.

So I would be really interested in the way Jaco handles this.

Some of the requirements I personally have for the solution:

  • users should be able to have a simple way to bypass all of this when they do not need this security. E.g. a common use case would be to install some skills from the store, but at the same time have some own skills outside of these containers, e.g. because you need to receive messages on a microcontroller (encryption might not be well supported and the device is trusted)
  • We should try to keep the performance impact as small as possible
  • It should be very simple to implement (or at least the complexity should be hidden away from the user e.g. in the Python skill library)

You’re in luck! My German Larynx container runs around 2X realtime on my Pi 4 (4 GB). Ignore the timing of the first sentence you try, as it’s loading the lexicon into memory. It’ll also cache sentences, so repeats will skip synthesis.

I’ve thought about ways to make the GUI more modular. It would be nice to have a description of options in a service or skill, and have HTML controls created for them. I’ve been wary to go down that rabbit hole, though, because there are going to be lots of corner cases and I’m not really a fan of web programming anyways :laughing:

This is an interesting approach. I like it overall, but for Rhasspy I think we’d need to lock topics down at the broker level by client. Lots of people use Rhasspy with NodeRED, and getting encryption keys into that would be tough for newcomers.

Your system seems secure and well thought out, though :+1:

Currently Jaco encrypts only the messages, so skills can’t read the content, but they would be able to subscribe to topics and see that something is sent. Not sure which metadata is included in the messages, but that would be readable too. Each skill has its own topic to send text which shall be spoken to the system. The sessions are also handled differently, in a way, that I skipped them completely. I didn’t like the approach of Snips here, because I have some skills that require long running sessions, because they did some web or hardware request which could take half a minute. So currently I’m just blocking wake-word detection while the user or the system are speaking.

The encryption itself is done with python’s cryptography library, and you just need to share the key string to decrypt it. So I think it’s quite easy to implement, if you don’t want to use Jaco’s tools which do everything automatically. I also don’t think this results in a big performance drop.

But I know this is not perfect and Mosquitto ACL could be a good, or maybe better solution for this, I’m open to a more elegant suggestion here.
My plan for some time in the future would be to drop MQTT completely and use some messaging system which runs in peer-to-peer mode, removing the central mqtt server. If you have some external system like HomeAssistant my idea was that you use a skill which serves as communication interface. What do you think about this?

For me this was more a problem for the future, because my current priority is to further improve recognition performance, as well as supporting more languages.

Did you test if it’s also able to run on a Pi3? And how long did training the voice roughly take?
Your larynx container has a nice web interface, really good for a short test :+1:


Regarding the intent definition syntax, what do you think of something like this:

## lookup:city
city.txt  <- text file with one city per line 

## intent:travel_duration
- How long does it take to drive from [Augsburg](skill_travel_city:city_name_1) to [Berlin](skill_travel_city:city_name_2) by (train|car)?
`Comments like this`
`Rasa, and I think Rhasspy too, use something with {} brackets for named slots, but I think this breaks the style somewhat, because you don't get the blue link coloring then`

## synonyms
syn.json  <- Currently I'm using a list like with the intents above, but I think this would be more readable.

I would say we are quite free with creating a syntax and should make another conversion script for the different tools. I like your (option1|option2) syntax, but its not supported by Rasa, and I would have to replace it with two sentences anyway.
We can focus on readability this way.

Currently Jaco does collect all dialog files from the skills and merges them into one single place, with skill-name prefixes you can use globally.
Those intent files are used later on to create a text file with possible sentences users can say, which is used to train the language model. I think Rhasspy uses a similar approach here, which should make replacing the STT service with a different on quite easy.

I did just recheck how mosquitto ACL works and it appears to do exactly what we need. We could use MQTT over TLS and username/password, where each skill gets his own pair of username and password. Username is the intent name used for the hermes topic in hermes/intent/<intentName>. I remember we talked about the behaviour, when multiple skills have the same name, but I do not remember how we decided to handle this (@synesthesiam, @koan, do you remember where this was discussed).

Using ACL it is possible to set read, write and readwrite permissions, so the basic rules for a skill could be something along the lines of

user <skill_name>
topic read hermes/intent/<skill_name>
topic write hermes/dialogueManager/startSession
topic write hermes/dialogueManager/endSession
topic write hermes/dialogueManager/continueSession
...

What I like about this solution is:

  • TLS for MQTT is required anyways, so nobody else on the network can listen to the traffic (already tracked in issue 29)
  • we can derive usernames from the skill name and easily automate the generation of the ACL rules.
  • for self made skills the ACL rule can simply give readwrite permission to all topics
  • skills do not even know a message was sent, when they only have write access to a topic
  • All encryption components used are standard in MQTT -> most MQTT libraries support it
1 Like

This should be quite easy to implement into Jaco. Instead of automatically distributing the encryption keys (which look like this: "Jaco/Intents/JacoMusicmasterNextSong": "64uLChtXiB8n00yZoBzzbPcfYSK_BKt17spuTn2jbtE=" we just have to share passwords and create a rules file for the broker. Skills use an interface for communication, so there should be no need for changes in the skill code.

Currently they would share the same keys if their intents have the same names too (I don’t check for upper or lower case, drop all special signs and convert them to SkillNameIntentName format). Maybe we should do a short check in the skill installation process and print a big warning.

Ah now I remember what we discussed. It was not about having multiple skills with a similar name, but multiple intents with a similar name. We decided to go for the syntax

hermes/intent/<skill_name>/<intent_name>

so it is possible to subscribe e.g. to all topics of a skill aswell using hermes/intent/<skill_name>/#

I guess in the skill store this could not occur anyways, since skills would be forced to choose a unique name. So it should be enough to warn the user about this (this would probably mean he created a local skill himself with a similar name)

Currently it’s possible, there is not check for name uniqueness right now. The skills can also have different names in different languages. What Jaco uses later as name in the scripts is the folder name of the skills. You could have two skills with the same names here too, because you can download the skills from any git repository and you theoretically could have one TheSkill on github and another TheSkill on gitlab.

Would you be interested in adding this yourself into Jaco, as a merge request? With some help of course.

In the next weeks I still have to get my string matching library RapidFuzz to v1.0.0 (getting closer :tada:) and add SSML support to Rhasspy. Possibly afterwards, but I would not count on it.

@synesthesiam what do you think about it?

(Sorry for the delay in a response. I’m spending a bit less time on the computer during the holidays :slight_smile:)

I think this would be a good start; there is a lot of overlap with Rhasspy’s existing format. I would probably keep this “cross-assistant” format separate and just convert it to Rhasspy’s internal format during training.

Can you provide some examples of synonyms? Are these any different than Rasa’s?

Currently the synonyms follow rasa’s format:

## lookup:city
...

## intent:travel_duration
...

##synonym:red
- light red
- dark red

##synonym:blue
- light blue
- dark blue

In my opinion moving the definitions into an extra file (like with the lookup.txt) would improve readability if there are many synonyms.
In this file I would create a simple json structure:

{
  "red": ["light red", "dark red"],
  "blue": ["light blue", "dark blue"]
}

This is a good approach I think. I would handle this the same way for rasa. This should make switching nlu or stt services quite easy.

1 Like