And here’s the forum discussion where you see the evolution of its development:
If you look at the example code in the repository and the Usage page in the documentation, you see that this abstracts away all MQTT topics indeed. The main goal of the library is to make it as easy as possible to create Rhasspy apps in Python, without having to know low-level details and with as less boilerplate code as possible.
This is very similar to what Jaco is doing. The time example for Jaco would look like this:
"""Example app to react to an intent to tell you the time."""
from jacolib import assistant
from datetime import datetime
file_path = os.path.dirname(os.path.realpath(__file__)) + "/"
assist: assistant.Assistant
def callback_get_time(message):
now = datetime.now().strftime("%H %M")
assist.publish_answer(f"It's {now}", message["satellite"])
assist = assistant.Assistant(repo_path=file_path)
assist.add_topic_callback("get_time", callback_get_time)
assist.run()
Hi @synesthesiam, I made some progress using customized FSTs with the Scribosermo models. Building custom FSTs also allowed an integration of the nlu information directly into the decoding graph, which means that is’s possible to combine the STT+NLU steps into a single SLU model. I already tested it on some benchmarks, with very good results (competitve to Jaco’s current solution with Rasa, as well as some other SLU approaches). And, what’s very interesting for Jaco and Rhasspy, the models can be built in a few seconds, even on a RaspberryPi.
A hopefully understandable description of the approach can be found in the readme of the project:
In the next time I will update Jaco to use the new SLU model (needs some additional features for number+date parsing in multiple languages, presumably with the help of Duckling)
This is some incredible work! I’m still digesting it all.
How will you incorporate numbers and dates into the FST? Will you pre-generate all of the possibilities during training, or use grammar FSTs?
By the way, you may be interested in the “unknown words” feature I’ve added to Rhasspy 2.5.11 (Docker preview is available). For each word in the grammar, I have a low-probability branch that goes through a collection of ~100 frequently-used words from the target language (and not used elsewhere in the grammar). This branch outputs <unk>, and acts as a way of “catching” misspoken words since (presumably) frequently-used words will be phonetically diverse. Pretty easy to code up, and seems to be effective!
I would say it’s a mix with both. The process for numbers looks like this:
all number possibilities are pre-generated in text form
the numbers are converted to a (grammar) Slot-FST
the FST is optimized → this makes it much smaller because different paths are merged (twenty-two and twenty-one start both with twenty and the path after optimization splits at one and two instead of twenty)
the Number-FST is inserted into the Intent-FST which contains the intent examples (what is [number1] plus [number2])
I did test it with the Timers-And-Such benchmark, where finstreder could outperform the paper’s baseline. Optimizing about 1M numbers (from -1000.00 to 1000.00 in different wordings) takes about 10 seconds. Benchmark code is here.
For the benchmark I did write a custom word2num function for the opposite way, but I think switching to Duckling makes things much easier, especially when supporting multiple languages.
Very interesting. What do you think is a use-case where the transcription benefits from it? I think it could be integrated into finstreder too, but I’m not sure where it could help.
With a restricted vocabulary/grammar in Kaldi, it seems to go to great lengths to match even garbage audio to a known sentence. I don’t know if this is the case with Jaco’s ASR; maybe you already have a good way to reject bad sentences
Finally integrated finstreder into Jaco. Number+Date parsing is done with Duckling (the RasaNLU module already did this too, but I had to rebuild the container and this was more complicated than planned).
Another change is that podman got replaced with docker, because podman doesn’t support Raspi’s arm32 architecture anymore. One benefit of the replacement is that I could integrate portainer, which creates a simple local website to interact with the module and skill containers.
In general Jaco has the same problem, but I found that often just the yes/no intents are matched which normally trigger no reaction of the assistant. I will try to observe this further in the future. Maybe a simple option for Jaco would be to add a garbage intent with such words which doesn’t trigger any actions.
You might be interested in the new Conformer models I’ve recently added to Scribosermo and Jaco. In general I’ve converted the pretrained models from Nvidia’s NeMo to tensorflow and tflite to make them work on a RaspberryPi. I hope I can finetune them further in the future. In comparison with the old QuartzNet models the WordErrorRate on greedy free speech recognition was reduced by half, with a language-model about 20-40%. Maybe the models can help you with some ASR tasks as well. But unlike before, they now output sentencepiece-based tokens instead of alphabet-based single characters.
In combination with an updated finstreder SLU decoder, Jaco now achieves state-of-the-art recognition accuracy in most SLU benchmarks I’ve found.
As always all models can be found in Jaco’s repository: Jaco-Assistant / Jaco-Master · GitLab
Awesome work as always, @DANBER! I’d like to include some of these models and finstreder in the next version of Rhasspy. Have you encountered any problems getting sentencepiece running the pi?
Have you encountered any problems getting sentencepiece running the pi?
No, running is no problem, the tflite STT models run about 3x faster than realtime on a 64bit Raspi-4. Training the finstreder SLU model also takes only a few seconds. The only problem might be installation of the dependencies, it’s quite resource intensive. I think the easiest way would be using my prebuilt docker images and run them as an external service.
Can you start other containers with Rhasspy? Or else do you think it would be complicated to add?
In this case I could help you with the interface.
I can, but I’ve been considering other options for Rhasspy going forward. Rather than requiring containers, a more general approach may be something like nix or guix.
I’d be interested in building a few CLI programs for your STT models and finstreder, and packaging them with nix/guix. One program for training, and one program of inference. What do you think?
I’d be interested in building a few CLI programs for your STT models and finstreder, and packaging them with nix/guix. One program for training, and one program of inference. What do you think?
Never worked with nix before. As far as I’ve read in my short research now, the main feature would be reproducible builds, which work quite well with docker too. And since the Jaco images are prebuilt, every user gets the same build. Where do you see the benefits of nix vs the docker images?
With the installation of plugins/extensions. If you go the container route, you need an orchestration mechanism like pods or docker-compose to bring up all the right containers. Additionally, you need a network-based IPC method like mqtt or HTTP for communication.
My (currently untested) idea is have a single Rhasspy container with nix installed and the nix store mounted externally. Plugins/extensions would be installed as nix packages, and accessed as regular programs by Rhasspy (Python subprocesses, etc).
The dream would be for these packages to be highly reusable outside out Rhasspy, so we can get out of the N by M problem where you need N plugins (wakeword, STT, etc.) for M voice assistants (Jaco, Rhasspy, Mycroft, SEPIA, etc.).
I don’t think containerization is the main reason for the incompatible modules problem. I would say it’s more the different interfaces the modules and assistants use. For example Rhasspy uses MQTT following Snips’ hermes protokoll, and Jaco uses MQTT as well but with different topic names and encrypted topic contents.
Regarding the SLU module, if I understand this correctly, you would basically package the python packages of tensorflow + finstreder and some voice assistant specific scripts together into a single nix package. Then it could be called with a specific interface, which results in the same behavior as calling it as containerized service…
And even if there are already packaged modules, like WW detection with porcupine we still package them again to match our custom interfaces.
After reading some more about nix, I think the idea itself is quite elegant, but it also seems to be more complicated than using modularized docker images. I’m not sure it this should be the first problem we need to solve. So I think in a first step it would be easier and faster to add an interface translator between our MQTT topics.
One problem I don’t understand yet, is how you solve the communication between Satellites and Master if they run on different devices? You would still need a network communication like MQTT here?
Another problem I’m not sure about how nix solves this is sandboxing, in case of Jaco it’s mainly required for the skills. I would like that skill devs can implement almost anything they want with any dependencies they need in a preferably simple manner, and currently I solve it by letting each skill build a custom container if they need it. All containers (modules+skills) are then started via an auto-generated docker-compose file. How would this look like in the nix architecture?
nix/guix by itself doesn’t directly solve the problems you mentioned, but I think it would be a more flexible and composable starting point. For example, packaging up a nice command-line interface for porcupine with nix/guix would allow for it to be installed on most any Linux distribution, inside a Docker/Podman container, etc. And it would still be usable standalone.
My idea here is trying to reduce the number of “commitments” made upfront by each packaged module, which include things like a particular container technology, MQTT/HTTP/Websockets, and the specific protocol (Hermes, etc.). At a minimum, if a packaged module is a program with a “simple” command-line interface using stdin/stdout for communication (more on this if you’re interested), then it can used standalone or wrapped for something more specialized (Jaco, Rhasspy, etc.).
I think models and other artifacts would also work well as nix/guix packages, since everything is automatically hashed, and you can download files in the build process. I have a weak form of this in Rhasspy now, where STT models, etc. are downloaded in chunks from Github and their sizes (but not hashes) are checked.
Some dev news again:
In the last days I had some time to continue working on a better cross-assistant integration and found a way to make Jaco’s skills usable with Rhasspy:
The basic concept was to create a mapping to Rhasspy’s sentences.ini file and between the different mqtt topics. To use a skill from Jaco’s skill store you download and install it with the tools from Jaco-Master. Then run the preprocessing script which builds the sentences.ini and slots files and retrain Rhasspy. At last you start all skill containers with the docker-compose file Jaco generated (running Jaco-Master as well is optional). You can find this interface skill here:
As I already wrote at the beginning of this thread, the skill concept is a bit similar to the old Snips’ skills, but with several improvements regrading skill functionality, with less developer restrictions, and an improved security concept with topic encryption, a permission system and container isolation.
It would be great if you could test this out yourself, and if you like it, maybe add it as official skill concept. In my opinion skill interoperability between the open source assistants (adding support for Mycroft, Sepia and Alice through another interface skill shouldn’t be complicated) is a quite important feature for everyone.
Did you continue working on the nix/guix packaging of the assistant modules? How is your progress?
This looks really cool! I’ll try it out and let you know how it works
I agree that skill interoperability is something we should strive for. It seems like for most cases you’d still need to run both systems, though. With Mycroft, for example, skills embed intents and dialogue responses as well as code for interacting with the microphone and TTS.
I’m still working on that, but for now I’m focusing on how Rhasspy (v3) talks to programs. I have a small protocol that works over stdin/stdout, and a set of “adapter” scripts that let you call existing programs. For example, you can use any STT program that accepts a WAV file and outputs the transcription as text.
I’ll be writing a variety of adapter scripts, and I want to make it easy for others to contribute them.