Use Rhasspy in open transcription mode with Tock for intents recognition and an externalized intents handling

Hello all,

I’m trying to do the following:

  • Use Rhasspy for WWS + STT + TTS
  • Use Tock for intent recognition and entities extraction

About Tock: its an NLP system created by the French SNCF trains transport company, released as Open Source (http://doc.tock.ai/tock/en/)

Project description:

In my project, Tock will receive STT messages from Rhasspy by pointing to its REST API (Intent recognition > Remote HTTP)
By feeding Tock with STT messages, I can “learn” it how to match relevant intents and extracts entities.

For example:

  • I say “allume la cuisine” (light on the kitchen) to Rhasspy.
  • Tock will receive “allume la cuisine”.
  • I tell Tock that the corresponding intent is “lumieres” and there are 2 entities:
  • “action” --> “allume” (in english: “action” --> “light on”)
  • “piece” --> “cuisine” (in english: “room” --> “kitchen”)

The next time I will say “allume la cuisine” Tock will answer back with the corresponding intent/entities for the sentence.

The major improvment for me comes here:
The more I speak with Tock, the more new sentences it will learn, thus becoming able to handle them.
After some time, Tock will be autonomous and pertinent/precise enough to “guess” the correct intent and entities for new sentences. (without needing me !)
So, even if the STT were not precise; Tock would still be able to “guess” the pronounced order and will apply the relevant answer.
This will be usefull to decorelate the way to say an order and the expected result. So the system will be able to understand different human speakers, even if each one has its custom way to say the order.
The corresponding action will theoretically always be triggered.

The gains for me are:

  • No need to maintain the sentences.ini on Rhasspy (it will be centralized in Tock)
  • No restricted/closed sentences list anymore (as I would like to use Open Transcription)
  • No more need to anticipate each way an order could be pronounced in sentences.ini to make sure it will match Tock’s AI will manage this)
  • The system will become virtually able to handle an infinite amount of orders thanks to Tock’s AI which will just need some time to learn.

Current project status:

  • Making Rhasspy and Tock interacting is OK (by using their respectives APIs and Node-Red in the middle)

  • Sending a TTS feedback for orders is not currently done (By using Node-Red and Rhasspy /tts/say API I think it will not be too challenging)

  • STT from Rhasspy + intent recognition into Tock + action triggering is OK (even with new sentences that Tock had to learn by itself). But only with the sentences described in the sentences.ini

The problem I’m facing:

My setup is currently able to handle any sentence I registered in the sentences.ini (with no intent/entities description)
Example:

  • Before:

    [lumieres]
    (allumer | éteindre | allume | éteins){action} (le | la |) (toutes les lumières | salon | séjour | cuisine | chambre 1 | chambre une | buanderie | réserve | jardin | bureau | salle de bain | miroir | couloir | lumière de devant | escalier | wc | toilettes | palier){piece}

  • Current:

    [lumieres]
    (allumer | éteindre | allume | éteins) (le | la |) (toutes les lumières | salon | séjour | cuisine | chambre 1 | chambre une | buanderie | réserve | jardin | bureau | salle de bain | miroir | couloir | lumière de devant | escalier | wc | toilettes | palier)

So, as my orders are recognized and the relevant actions gets triggered, I suppose my setyp is working. (am I wrong ?)

The only thing I’m stuck with is that I cannot clear the sentences.ini and switch to Open Transcription mode.
I’m using Kaldi. If I clear sentences.ini + selects open transcription mode + set mixed language weight to 1 + restart Rhasspy & download Open Transcription models + re-trained; Rhasspy STT is still unable to translate sentences not registered in sentences.ini (I get irrelevant results, even if I say simple unique words like “allume”)

It behaves like if the Open Transcription model were not used.
Is what I’m trying to achieve not possible ?
Is Open Transcription the right option to achieve what I need ?

Thak you for your help

Best regards

Hi @jerome83136_tux,

The mixed weight is likely the problem here. This mode is not well tested. Have you tried just using open transcription without setting the mixed weight?

Going forward, it may be worth waiting a little bit until I add support for Vosk. It has an interesting in-between mode, where you give it all the possible words (vocabulary), and it will recognize sentences containing those words.

Hi @synesthesiam,
Thank you for your reply.

I’ve tried using both “Mixed Language Model Weight” 0 or 1 with Open Transcription activated and I got no luck

I’m surprised about the training time I observed with/without open transcription:

kaldi + flst + weight = 0 --> Training completed in 1.69 second(s)
kaldi + flst + weight = 1 --> Training completed in 18.60 second(s)
kaldi + flst + open trans + weight = 0 --> Training completed in 2.59 second(s)

Without Open Transcription activated, I can see a big difference when “Mixed Language Model Weight”. I think this is normal because another dictionnary is used for “mixing” purpose.

But When I activate Open transcription, training time falls to ~3 seconds.
I feel it strange because Open Transcription is supposed to use a huge dictionnary and so training time should be longer right ?

I wonder if I did activate Open Transcription correctly. May I ask you to confirm please ?

"speech_to_text": {
    "kaldi": {
        "mix_weight": "0",
        "open_transcription": true
    },
    "satellite_site_ids": "master,satellite1,satellite2,satellite3,rhasspy-sejour,rhasspy-rdc,rhasspy-salon,rdc.respeakercore2,rdc.raspberrypi3,etage.echo1,etage.echo2",
    "system": "kaldi"
},

NB: I tried removing all Kaldi stuff before each test (rm /rhasspy/profiles/fr/kaldi*)
Rhasspy asked to re-download dictionnaries every time as expected, but it didn’t help.

Anyway, if you think its better to wait for Vosk implementation into Rhasspy, I will :slight_smile: (and I’m impatient to test)

Thank you again

Best regards

1 Like

Training time for open transcription should be minimal, since it’s just using a pre-trained (large) model. Mixing absolutely takes a lot of time (mix weight > 0).

It looks like you’re setting open transcription properly. Are you trying to still use an intent recognizer? You should probably disable that, since it will mostly just cause “intent not recognized” errors.

OK, so it seems I didn’t really undestood how Open Transcription was working. Thanks for the update on training times.

I disabled intend recognition (it was enabled) and I tried saying:

  • “commande une pizza” (order a pizza)
    The result was:

[DEBUG:2021-05-23 23:00:46,776] rhasspyasr_kaldi_hermes: Transcription result: Transcription(text='comment pizza', likelihood=0, transcribe_seconds=1.997281896066852, wav_seconds=2.112, tokens=[TranscriptionToken(token='comment', start_time=0.00188111, end_time=0.780435, likelihood=0.535767), TranscriptionToken(token='pizza', start_time=1.07749, end_time=1.1369, likelihood=0.417615)])

  • “commande”
    The result was:

[DEBUG:2021-05-23 23:03:07,318] rhasspyasr_kaldi_hermes: Transcription result: Transcription(text=‘comment’, likelihood=0.9670972, transcribe_seconds=1.7813964639790356, wav_seconds=1.984, tokens=[TranscriptionToken(token=‘comment’, start_time=0.00677444, end_time=1.08622, likelihood=0.998006)])

  • “pizza”
    The result was:

[DEBUG:2021-05-23 23:02:03,946] rhasspyasr_kaldi_hermes: Transcription result: Transcription(text='pizza', likelihood=0, transcribe_seconds=1.9157036510296166, wav_seconds=1.984, tokens=[TranscriptionToken(token='pizza', start_time=0.0, end_time=1.05, likelihood=0.19847)])

I looked in the dictinnaries and it seems all words “commande” “une” “pizza” are present:

/rhasspy/profiles/fr/kaldi/base_dictionary.txt:commande k O m A~ d
/rhasspy/profiles/fr/kaldi/dictionary.txt:commande k O m A~ d
/rhasspy/profiles/fr/kaldi/dictionary.txt:une y n
/mnt/docker-data/rhasspy/.config/rhasspy/profiles/fr/kaldi/base_dictionary.txt:pizza p i d z a
/mnt/docker-data/rhasspy/.config/rhasspy/profiles/fr/kaldi/dictionary.txt:pizza p i d z a

But it seems Kaldi is still unable to understand the full sentence.

I also tried with something Kaldi understands without Open Transcription (from sentences.ini)

  • I said “allume le séjour” (light on the living room)

I got:

[DEBUG:2021-05-23 23:08:46,648] rhasspyasr_kaldi_hermes: Transcription result: Transcription(text='allo le séjour', likelihood=0.43633999999999995, transcribe_seconds=2.359589765081182, wav_seconds=2.432, tokens=[TranscriptionToken(token='allo', start_time=0.0, end_time=0.697323, likelihood=0.451953), TranscriptionToken(token='le', start_time=0.697323, end_time=1.20103, likelihood=1.0), TranscriptionToken(token='séjour', start_time=1.20184, end_time=2.43, likelihood=0.996929)])

Or this:
[DEBUG:2021-05-23 23:09:32,886] rhasspyasr_kaldi_hermes: Transcription result: Transcription(text='le séjour', likelihood=0.883427, transcribe_seconds=2.717036842019297, wav_seconds=2.88, tokens=[TranscriptionToken(token='le', start_time=0.101252, end_time=1.70726, likelihood=1.0), TranscriptionToken(token='séjour', start_time=1.70726, end_time=2.88, likelihood=0.99093)])

(“allume” “le” “séjour” are also all in the Kaldi dictionnaries)

Do I need to tune some Kaldi or voice redording parameters ?

Thank you again

Best regards

Update:

By speaking louder I was able to get “commande une pizza” understook by Kaldi.

So I think this is just some kind of “voice capture precision” problem.

Speaking louder for “allume le séjour” didn’t work unfortunately.

Do you think I could get better result with some " Voice Command Settings" tuning ?

Regards

You might get slightly better results by tuning but it will never be very accurate probably.
This is the big problem with using purely open transcription.
It uses a large language model which was trained on a text corpus that was chosen to represent the language in general well.
This leads too two problems:
1st is that the corpus it was trained on might not actually contain the words you want it to recognize but even if it does the sentences and or words might not be very likely in the scope of the large model and so will nearly never be chosen by the kaldi stt process as other possibilities are just more likely.
2nd is that the accuracy of the combination of the acoustic model and the language model used in modern offline stt systems like kaldi is often not that great when used in an open real world setting especially when the input deviates from the material the model was trained on in the microphone used, the setting or even the accent or gender of the speaker.

Those are the reasons why Rhasspy and most other offline assistants train some kind of domain specific language model for a limited set of sentences and vocabulary as this is the only reliable way right now to improve the accuracy in such offline settings.

Johannes

Hello,
Thank you for explaining.
I now understand why what I’m trying to achieve here will not be as easy as I expected.

So my setup is “technically” working thanks to @synesthesiam support; but it seems I will have to find another way to address the “Open Transcription” requirement.

Maybe with the upcoming Vosk implementation into Rhasspy :wink:

Thank you for your help

Best regards

1 Like