Thanks for the links:)
Kaldi’s grammar syntax seems to be quite similar to what is described in the Snips paper. They also did use Kaldi models, so this looks like a project we should try out
I found a project which already did implement something similar, a WFST decoder for CTC models, which is called Eesen, but I’m still trying to understand it …
I also have another suggestion, in the threads of this forum I’m often seeing users that are trying to build custom skills for Rhasspy, but often with different approaches. What do you think about making Jaco’s skills directly compatible with Rhasspy and vice versa? This would give all skills a similar structure and also would allow to share Rhasspy skills with other users through the skill store.
Since you already have built a conversion script for the dialog syntax I think this should be very easy. I know that not all features will be supported, but I could add an extra flag with which users can disable the automatic topic encryption and we could restrict the additional requirements of those skills to python libraries only (Jaco would build a container for each skill, that they can have arbitrary requirements, but I think running containers in Rhasspys docker container doesn’t work).
We only need to find an approach for intent to topic mapping. Jaco currently uses Jaco/Skills/SkillName/IntentName and Jaco/Skills/SayText for answers.
I suppose this could be abstracted away behind a common API. @koan already created a helper library for Rhasspy skills which can be found here. The library already abstracts away all of the MQTT topics, since they are not relevant for the skill author anyways.
Jaco does something similar with skill owned topics (you can listen to them with the intent name only: assist.add_topic_callback("get_riddle", callback_get_riddle)) and specific other topics like text outputs (assist.publish_answer(result_sentence, message["satellite"]), but skills can listen to other topics (system or other skills) or write to them too, and I’m not sure if it makes sense to abstract those arbitrary topics
And here’s the forum discussion where you see the evolution of its development:
If you look at the example code in the repository and the Usage page in the documentation, you see that this abstracts away all MQTT topics indeed. The main goal of the library is to make it as easy as possible to create Rhasspy apps in Python, without having to know low-level details and with as less boilerplate code as possible.
This is very similar to what Jaco is doing. The time example for Jaco would look like this:
"""Example app to react to an intent to tell you the time."""
from jacolib import assistant
from datetime import datetime
file_path = os.path.dirname(os.path.realpath(__file__)) + "/"
assist: assistant.Assistant
def callback_get_time(message):
now = datetime.now().strftime("%H %M")
assist.publish_answer(f"It's {now}", message["satellite"])
assist = assistant.Assistant(repo_path=file_path)
assist.add_topic_callback("get_time", callback_get_time)
assist.run()
Hi @synesthesiam, I made some progress using customized FSTs with the Scribosermo models. Building custom FSTs also allowed an integration of the nlu information directly into the decoding graph, which means that is’s possible to combine the STT+NLU steps into a single SLU model. I already tested it on some benchmarks, with very good results (competitve to Jaco’s current solution with Rasa, as well as some other SLU approaches). And, what’s very interesting for Jaco and Rhasspy, the models can be built in a few seconds, even on a RaspberryPi.
A hopefully understandable description of the approach can be found in the readme of the project:
In the next time I will update Jaco to use the new SLU model (needs some additional features for number+date parsing in multiple languages, presumably with the help of Duckling)
This is some incredible work! I’m still digesting it all.
How will you incorporate numbers and dates into the FST? Will you pre-generate all of the possibilities during training, or use grammar FSTs?
By the way, you may be interested in the “unknown words” feature I’ve added to Rhasspy 2.5.11 (Docker preview is available). For each word in the grammar, I have a low-probability branch that goes through a collection of ~100 frequently-used words from the target language (and not used elsewhere in the grammar). This branch outputs <unk>, and acts as a way of “catching” misspoken words since (presumably) frequently-used words will be phonetically diverse. Pretty easy to code up, and seems to be effective!
I would say it’s a mix with both. The process for numbers looks like this:
all number possibilities are pre-generated in text form
the numbers are converted to a (grammar) Slot-FST
the FST is optimized → this makes it much smaller because different paths are merged (twenty-two and twenty-one start both with twenty and the path after optimization splits at one and two instead of twenty)
the Number-FST is inserted into the Intent-FST which contains the intent examples (what is [number1] plus [number2])
I did test it with the Timers-And-Such benchmark, where finstreder could outperform the paper’s baseline. Optimizing about 1M numbers (from -1000.00 to 1000.00 in different wordings) takes about 10 seconds. Benchmark code is here.
For the benchmark I did write a custom word2num function for the opposite way, but I think switching to Duckling makes things much easier, especially when supporting multiple languages.
Very interesting. What do you think is a use-case where the transcription benefits from it? I think it could be integrated into finstreder too, but I’m not sure where it could help.
With a restricted vocabulary/grammar in Kaldi, it seems to go to great lengths to match even garbage audio to a known sentence. I don’t know if this is the case with Jaco’s ASR; maybe you already have a good way to reject bad sentences
Finally integrated finstreder into Jaco. Number+Date parsing is done with Duckling (the RasaNLU module already did this too, but I had to rebuild the container and this was more complicated than planned).
Another change is that podman got replaced with docker, because podman doesn’t support Raspi’s arm32 architecture anymore. One benefit of the replacement is that I could integrate portainer, which creates a simple local website to interact with the module and skill containers.
In general Jaco has the same problem, but I found that often just the yes/no intents are matched which normally trigger no reaction of the assistant. I will try to observe this further in the future. Maybe a simple option for Jaco would be to add a garbage intent with such words which doesn’t trigger any actions.
You might be interested in the new Conformer models I’ve recently added to Scribosermo and Jaco. In general I’ve converted the pretrained models from Nvidia’s NeMo to tensorflow and tflite to make them work on a RaspberryPi. I hope I can finetune them further in the future. In comparison with the old QuartzNet models the WordErrorRate on greedy free speech recognition was reduced by half, with a language-model about 20-40%. Maybe the models can help you with some ASR tasks as well. But unlike before, they now output sentencepiece-based tokens instead of alphabet-based single characters.
In combination with an updated finstreder SLU decoder, Jaco now achieves state-of-the-art recognition accuracy in most SLU benchmarks I’ve found.
As always all models can be found in Jaco’s repository: Jaco-Assistant / Jaco-Master · GitLab
Awesome work as always, @DANBER! I’d like to include some of these models and finstreder in the next version of Rhasspy. Have you encountered any problems getting sentencepiece running the pi?
Have you encountered any problems getting sentencepiece running the pi?
No, running is no problem, the tflite STT models run about 3x faster than realtime on a 64bit Raspi-4. Training the finstreder SLU model also takes only a few seconds. The only problem might be installation of the dependencies, it’s quite resource intensive. I think the easiest way would be using my prebuilt docker images and run them as an external service.
Can you start other containers with Rhasspy? Or else do you think it would be complicated to add?
In this case I could help you with the interface.
I can, but I’ve been considering other options for Rhasspy going forward. Rather than requiring containers, a more general approach may be something like nix or guix.
I’d be interested in building a few CLI programs for your STT models and finstreder, and packaging them with nix/guix. One program for training, and one program of inference. What do you think?
I’d be interested in building a few CLI programs for your STT models and finstreder, and packaging them with nix/guix. One program for training, and one program of inference. What do you think?
Never worked with nix before. As far as I’ve read in my short research now, the main feature would be reproducible builds, which work quite well with docker too. And since the Jaco images are prebuilt, every user gets the same build. Where do you see the benefits of nix vs the docker images?
With the installation of plugins/extensions. If you go the container route, you need an orchestration mechanism like pods or docker-compose to bring up all the right containers. Additionally, you need a network-based IPC method like mqtt or HTTP for communication.
My (currently untested) idea is have a single Rhasspy container with nix installed and the nix store mounted externally. Plugins/extensions would be installed as nix packages, and accessed as regular programs by Rhasspy (Python subprocesses, etc).
The dream would be for these packages to be highly reusable outside out Rhasspy, so we can get out of the N by M problem where you need N plugins (wakeword, STT, etc.) for M voice assistants (Jaco, Rhasspy, Mycroft, SEPIA, etc.).
I don’t think containerization is the main reason for the incompatible modules problem. I would say it’s more the different interfaces the modules and assistants use. For example Rhasspy uses MQTT following Snips’ hermes protokoll, and Jaco uses MQTT as well but with different topic names and encrypted topic contents.
Regarding the SLU module, if I understand this correctly, you would basically package the python packages of tensorflow + finstreder and some voice assistant specific scripts together into a single nix package. Then it could be called with a specific interface, which results in the same behavior as calling it as containerized service…
And even if there are already packaged modules, like WW detection with porcupine we still package them again to match our custom interfaces.
After reading some more about nix, I think the idea itself is quite elegant, but it also seems to be more complicated than using modularized docker images. I’m not sure it this should be the first problem we need to solve. So I think in a first step it would be easier and faster to add an interface translator between our MQTT topics.
One problem I don’t understand yet, is how you solve the communication between Satellites and Master if they run on different devices? You would still need a network communication like MQTT here?
Another problem I’m not sure about how nix solves this is sandboxing, in case of Jaco it’s mainly required for the skills. I would like that skill devs can implement almost anything they want with any dependencies they need in a preferably simple manner, and currently I solve it by letting each skill build a custom container if they need it. All containers (modules+skills) are then started via an auto-generated docker-compose file. How would this look like in the nix architecture?
nix/guix by itself doesn’t directly solve the problems you mentioned, but I think it would be a more flexible and composable starting point. For example, packaging up a nice command-line interface for porcupine with nix/guix would allow for it to be installed on most any Linux distribution, inside a Docker/Podman container, etc. And it would still be usable standalone.
My idea here is trying to reduce the number of “commitments” made upfront by each packaged module, which include things like a particular container technology, MQTT/HTTP/Websockets, and the specific protocol (Hermes, etc.). At a minimum, if a packaged module is a program with a “simple” command-line interface using stdin/stdout for communication (more on this if you’re interested), then it can used standalone or wrapped for something more specialized (Jaco, Rhasspy, etc.).
I think models and other artifacts would also work well as nix/guix packages, since everything is automatically hashed, and you can download files in the build process. I have a weak form of this in Rhasspy now, where STT models, etc. are downloaded in chunks from Github and their sizes (but not hashes) are checked.