Collaboration with Jaco-Assistant project

Now it’s finished:)
The project got a new name: Scribosermo and now can be found here:

The new models can be trained very fast (~3 days on 2x1080Ti to reach SOTA in German) and with comparatively small datasets (~280h for competitive results in Spanish). Using a little bit more time and data, the following Word-Error-Rates on CommonVoice testset were achieved:

German English Spanish French
7.2 % 3.7 % 10.0 % 11.7 %

Training is even simpler than with DeepSpeech before and adding new languages is easy as well. After training, the models can be exported into tflite-format for easier inference. They are able to run faster than real-time on a RaspberryPi-4.

Only downside is that the models can’t be directly integrated into DeepSpeech bindings (technical possible, but I had no need for it) and doesn’t support streaming anymore (at least until someone has the time to implement it). I don’t think the missing streaming feature should be a problem, because our inputs are quite short, they usually are processed in 1-2 seconds on a Raspi.

1 Like

I already did update Jaco and run the benchmarks again, which show that the new models perform really well:


@maxbachmann, if you have some free time left, I have an idea for a project, where your experience with RapidFuzz might be helpful.

Currently the different STT modules of Jaco and Rhasspy are using n-gram language models (in form of arpa or scorer files), which are based on plain sentences, to improve the predictions.
With the recent update of Jaco’s STT network to Scribosermo, the base-line performance of the model is much better than the Kaldi model which was used in Snips. But in above SmartLights benchmark the recognition performance of Jaco is only slightly better than the one of Snips.

I think we should be able to improve the performance further if we replace the n-gram language model with a more task specific language model. In the Snips paper (https://arxiv.org/pdf/1810.12735.pdf) a combination of a n-gram model with a pattern-based model is described, which I think would be a good idea to start with.
In general this would replace the training sentences “turn on the light in the kitchen” and “turn on the light in the living room” with “turn on the light in the [ROOM]” . The resulting sentence can be used to build a n-gram model (which should be much smaller than before) and for the ROOM slot an extra matcher would be required.

The input into such a model would be in our case the direct output of the STT models. In the case of CTC-based STT models (Scribosermo, DeepSpeech, and I think Kaldi too) this would be, before rescoring with a LM, letter based probabilities for a specific time step. The word “hello” might look like this: “hhellll-lllllllooooo” (using the letter with highest probability for each time step). To get the original text, letters with multiple occurrences have to be merged to a single letter, and then the blank-symbol (-) is removed.


What do you think?

I have absolutely no idea about machine learning, so I am probably unable to help you on that. For the character deduplication I would personally try to go with a relatively naive implementation first and see whether it is fast enough. It could probably be improved performance wise using a concept similar to https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/ if it is really required.

This wouldn’t require any knowledge of machine learning, but experience in fast C++ implementations for text based problems:)
Merging “hhellll-lllllllooooo” to “hello” and combining it with a n-gram language model is already solved, for example by DeepSpeech’s ds-ctcdecoder. What I didn’t find is a library that can work with slot patterns like ROOM.

There are some papers/libraries which combine regex with edit distance measurements like the Levenshtein distance. See e.g. https://github.com/laurikari/tre

Very impressive! I will see if I can train a new model using Scribosermo :slight_smile:

A few questions:

  • Do you think accuracy could be improved using phonemes instead of orthographic characters?
  • Would streaming require significant changes to the Quartznet architecture?

For Kaldi, Rhasspy also supports directly generating an FST with all of the possible sentences. This is the default, and is actually faster and more accurate than an n-gram language model. The downside, of course, is that it can never recognize sentences outside of the training set.

This sounds very close to Kaldi’s grammars, which you’re probably already familiar with. Instead of rolling our own, I wonder if we could take the output phoneme probability distributions from your DeepSpeech models and run them through a modified Kaldi FST. This would essentially swap out the acoustic model layer of Kaldi with DeepSpeech, but leave the rest intact.

Another idea I’ve had is to use a GPT-2 model in place of a traditional n-gram language model. GPT-2 is far better at tracking long-range dependencies, and may be fast enough to use as a DeepSpeech scorer. Some big challenges are:

  • How can the model be tuned quickly for a specific domain?
  • Can new models be reasonably trained for other languages?
  • Are separate slots possible (e.g., “turn on the light in the [ROOM]”)?

No, that shouldn’t be very complicated, Nvidia already did implement an example script here, but I found that recognition is fast enough that you wouldn’t notice a great speedup with streaming.

It might be, but most of the recent STT papers use graphemes (letters) directly, so I’m not sure about that. Some use a subword approach instead of single characters, which seems to bring a small improvement.
If you want to experiment with that, you would need to retrain the English network, I did use the one published by Nvidia, which made training much easier…
And I think that we will get more performance improvement per invested development time if we optimize the language model rescoring and the nlu extraction.

Oh, didn’t know about that, never trained a model with Kaldi, will take a look into it …

That sounds interesting, but as you already mentioned in your challenges, this has to be fast enough to be trained on a Raspi, that we can build domain/skill specific models directly on the device.

1 Like

This seems to be what Snips used. @fastjack has pointed me to the kaldi-active-grammars project, which uses them to do dynamic decoding.

If we went with this approach, we could pre-generate grammars for numbers, dates, etc. like Snips did and stitch them into the final graph at runtime.

2 Likes

Thanks for the links:)
Kaldi’s grammar syntax seems to be quite similar to what is described in the Snips paper. They also did use Kaldi models, so this looks like a project we should try out

1 Like

I found a project which already did implement something similar, a WFST decoder for CTC models, which is called Eesen, but I’m still trying to understand it …


I also have another suggestion, in the threads of this forum I’m often seeing users that are trying to build custom skills for Rhasspy, but often with different approaches. What do you think about making Jaco’s skills directly compatible with Rhasspy and vice versa? This would give all skills a similar structure and also would allow to share Rhasspy skills with other users through the skill store.

Since you already have built a conversion script for the dialog syntax I think this should be very easy. I know that not all features will be supported, but I could add an extra flag with which users can disable the automatic topic encryption and we could restrict the additional requirements of those skills to python libraries only (Jaco would build a container for each skill, that they can have arbitrary requirements, but I think running containers in Rhasspys docker container doesn’t work).
We only need to find an approach for intent to topic mapping. Jaco currently uses Jaco/Skills/SkillName/IntentName and Jaco/Skills/SayText for answers.

I suppose this could be abstracted away behind a common API. @koan already created a helper library for Rhasspy skills which can be found here. The library already abstracts away all of the MQTT topics, since they are not relevant for the skill author anyways.

Jaco does something similar with skill owned topics (you can listen to them with the intent name only: assist.add_topic_callback("get_riddle", callback_get_riddle)) and specific other topics like text outputs (assist.publish_answer(result_sentence, message["satellite"]), but skills can listen to other topics (system or other skills) or write to them too, and I’m not sure if it makes sense to abstract those arbitrary topics

Example can be found here: https://gitlab.com/Jaco-Assistant/Skill-Riddles/-/blob/master/action-riddle.py

Thanks for the pointer, Max.

And here’s the forum discussion where you see the evolution of its development:

If you look at the example code in the repository and the Usage page in the documentation, you see that this abstracts away all MQTT topics indeed. The main goal of the library is to make it as easy as possible to create Rhasspy apps in Python, without having to know low-level details and with as less boilerplate code as possible.

This is very similar to what Jaco is doing. The time example for Jaco would look like this:

"""Example app to react to an intent to tell you the time."""
from jacolib import assistant
from datetime import datetime

file_path = os.path.dirname(os.path.realpath(__file__)) + "/"
assist: assistant.Assistant

def callback_get_time(message):
    now = datetime.now().strftime("%H %M")
    assist.publish_answer(f"It's {now}", message["satellite"])

assist = assistant.Assistant(repo_path=file_path)
assist.add_topic_callback("get_time", callback_get_time)
assist.run()

Hi @synesthesiam, I made some progress using customized FSTs with the Scribosermo models. Building custom FSTs also allowed an integration of the nlu information directly into the decoding graph, which means that is’s possible to combine the STT+NLU steps into a single SLU model. I already tested it on some benchmarks, with very good results (competitve to Jaco’s current solution with Rasa, as well as some other SLU approaches). And, what’s very interesting for Jaco and Rhasspy, the models can be built in a few seconds, even on a RaspberryPi.

A hopefully understandable description of the approach can be found in the readme of the project:

In the next time I will update Jaco to use the new SLU model (needs some additional features for number+date parsing in multiple languages, presumably with the help of Duckling)

2 Likes

This is some incredible work! I’m still digesting it all.

How will you incorporate numbers and dates into the FST? Will you pre-generate all of the possibilities during training, or use grammar FSTs?

By the way, you may be interested in the “unknown words” feature I’ve added to Rhasspy 2.5.11 (Docker preview is available). For each word in the grammar, I have a low-probability branch that goes through a collection of ~100 frequently-used words from the target language (and not used elsewhere in the grammar). This branch outputs <unk>, and acts as a way of “catching” misspoken words since (presumably) frequently-used words will be phonetically diverse. Pretty easy to code up, and seems to be effective!

1 Like

I would say it’s a mix with both. The process for numbers looks like this:

  1. all number possibilities are pre-generated in text form
  2. the numbers are converted to a (grammar) Slot-FST
  3. the FST is optimized → this makes it much smaller because different paths are merged (twenty-two and twenty-one start both with twenty and the path after optimization splits at one and two instead of twenty)
  4. the Number-FST is inserted into the Intent-FST which contains the intent examples (what is [number1] plus [number2])

I did test it with the Timers-And-Such benchmark, where finstreder could outperform the paper’s baseline. Optimizing about 1M numbers (from -1000.00 to 1000.00 in different wordings) takes about 10 seconds. Benchmark code is here.
For the benchmark I did write a custom word2num function for the opposite way, but I think switching to Duckling makes things much easier, especially when supporting multiple languages.

Very interesting. What do you think is a use-case where the transcription benefits from it? I think it could be integrated into finstreder too, but I’m not sure where it could help.

1 Like

With a restricted vocabulary/grammar in Kaldi, it seems to go to great lengths to match even garbage audio to a known sentence. I don’t know if this is the case with Jaco’s ASR; maybe you already have a good way to reject bad sentences :slight_smile:

Finally integrated finstreder into Jaco. Number+Date parsing is done with Duckling (the RasaNLU module already did this too, but I had to rebuild the container and this was more complicated than planned).
Another change is that podman got replaced with docker, because podman doesn’t support Raspi’s arm32 architecture anymore. One benefit of the replacement is that I could integrate portainer, which creates a simple local website to interact with the module and skill containers.


In general Jaco has the same problem, but I found that often just the yes/no intents are matched which normally trigger no reaction of the assistant. I will try to observe this further in the future. Maybe a simple option for Jaco would be to add a garbage intent with such words which doesn’t trigger any actions.