Multilingual profiles

Libbum · January 27, 2020, 10:22am

Hi all,

Just starting to set up rhasspy now. One drawcard for me was the Rhasspy is an offline, multilingual voice assistant toolkit portion of the readme. Perhaps though, I interpreted this incorrectly.

I live in a multilingual household, and ultimately would like to have English, Swedish and Mandarin interchangeable. Seemingly, whilst rhasspy supports all three languages, I can only choose one profile from this set at a time.

Would it be possible to create a combined profile? Would this be straightforward enough to edit say the Swedish profile.json to include the other two endpoints, and merge the profile folders/additional files together in the same manner?

Or is this something that fundamentally isn’t really possible with the current architecture of rhasspy?

synesthesiam · January 27, 2020, 10:24pm

Hi @Libbum, this is a great question. I actually ended up changing the wording in the docs to avoid this misconception (from “multilingual” to “support for many languages”). Looks like I forgot to update the README

This is not possible with Rhasspy in a single profile, but could be done with some tweaking and by running multiple instances of Rhasspy. At a high level, there would be a backend Rhasspy instance per language, a satellite listening for wake words (one per language), and a program to direct audio at the right Rhasspy instance (NodeRED).

Here’s the details of how I would do it:

Run a seperate instance of Rhasspy for each language
- Disable wakeword
- Use MQTT and a unique siteId per instance
- Use the Hermes microphone for audio input over MQTT
Run a wakeword service (snowboy, porcupine) on a satellite device with a different model for each language
Use the webhook feature to catch the wake-up event using a NodeRED flow
Switch on the wakewordId property and stream audio over MQTT to the corresponding siteId
- See hermes-audio-server for an example
Catch hermes/intent/<INTENT_NAME> or hermes/nlu/intentNotRecognized MQTT events in NodeRED and shut audio off

esdeboer · February 9, 2020, 9:16am

@synesthesiam wouldn’t it be an idea to make the topic where Rhasspy listens for wake words configurable. You can then run multiple wakeword services with different models on the satellites each posting on their own MQTT topic and therefore waking the right Rhasspy instance. Could also be used if you want to use different STT engines (for example use Kaldi with an open model or google for open ended questions, and Kaldi with a closed model for home automation)

Libbum · February 10, 2020, 10:14am

Thank you both for your input here. I’ll play around with a few methods in this case and see what I can come up with. May take some time, but will report back here once I have something cobbled together.

voice · April 9, 2020, 2:05am

Hey, how did it go? Did it work?

How about putting everything on the same machine, would it be possible?

gplaza · June 1, 2020, 4:40pm

Hello @synesthesiam,

i understand all the first part :

2 instances ok
2 wake work ok
1 awake webook ok

for my the dificulty is in the number 4. Lot of work of koan & Co is oriented to create app with appdaemon for manage intent, but manage the audi stream is not (for the moment) considered.

Mi concrete question : howto send audio stream to a specific master (switch beetween 2 instance) ?

My first idea was just :

have 2 instance same IP in MQTT but with separate port
have a pyton app who detect intent (eg: “do yo speak french ?”), after send a TTS say with “ok” and send to satellite by HTTP API a new json config part with the new MQTT port, and POST api/restart …

but change the MQTT server is not a good idea, and break a logic with a central MQTT server

koan · June 1, 2020, 7:04pm

To create apps that can react to intents in multiple languages, I proposed to add a lang attribute to some ASR and NLU topics in the Hermes protocol. But the wake word component could be extended with the same attribute. Then you define a wake word model for each language, and the wake word component adds a lang attribute for this language to its Hermes topics. Maybe the dialogue manager could then use this attribute to activate the right Rhasspy instance with the profile for this language, so the right ASR and NLU profiles will be used. If every component only acts on messages with the language of their profile, you could run multiple instances of an ASR and NLU in parallel, each for their own language. This would need some changes in Rhasspy, but would this approach work, @synesthesiam?

synesthesiam · June 1, 2020, 9:12pm

This should work as long as we also add lang to the wakeword detected and ASR startListening messages too. Then the dialogue manager could simply copy lang from detected into startListening, then from textCaptured into NLU query, etc. The NLU system could copy lang into the outgoing intents (and intentParsed).

I can think of another way to do this, though, with only minor changes to the microphone/wake services:

Modify the microphone services to be able to output to multiple UDP hosts/ports
Modify the wake services to forward UDP audio to MQTT while awake (between ASR startListening and stopListening)
Run two copies of Rhasspy (different profiles/languages) with internal MQTT brokers
- Have audio recording disabled on both Rhasspy instances
- Set unique UDP ports for each Rhasspy’s wake word
Run the microphone service standalone, steaming over UDP to both instances at the same time

Now, when you say one of the wake words, that Rhasspy should wake up and start streaming audio to its ASR component. The whole ASR/NLU/TTS cycle will happen inside a single Rhasspy profile, so you don’t have to worry about other languages.

Just an idea

koan · June 1, 2020, 9:48pm

But then you have two completely separated internal MQTT brokers, and you can’t react to the Hermes MQTT messages on other devices, right?

gplaza · June 2, 2020, 12:06am

yes, for the moment i have to MQTT (two dockers with distinct redirect MQTT internal port) … but it’s a bad idea … i’m looking for a solution to switch audio stream between two instances on the same MQTT.

french stream => master 1
spanish stream => master 2

in reality it’s a based on the idea of spychokiller with snips for switch between language : use a particular intent : “do you speack xxx language ?” for reset the configuration of satellite. this system work with only one wakeword.

Exactly ! i loose always the posibility to react on message of one or the other hermes MQTT …

koan · June 2, 2020, 8:45am

The more I think about it, the more it seems to me that using one (external) MQTT broker and letting the dialogue manager copy a lang attribute from the hermes/hotword/<wakeword_id>/detected message all the way to hermes/intent/<intent_name> is the most flexible approach.

This gives the users and developers maximum architectural freedom. The two extremes are:

You could run one Rhasspy instance with all components for the fr language and the other one for en. Each component handles Hermes messages with the same language as its profile and ignores all other messages.
Or you could run all Rhasspy components independently as Docker containers. Some of the components, such as the NLU and ASR, will be duplicated: one with a fr profile, the other with a en profile. These components ignore Hermes messages from another language.

In practice, you would run a mix of these extreme scenarios. For instance, you would run separate NLU and ASR instances for each language, but the same TTS component could handle hermes/tts/say messages with fr or en and just switch internally to another language output. With the right approach, even Rhasspy apps handling the intents can internally switch on the fly to another language.

This approach is also general enough to pave the way for later additions such as automatic language identification (you can find examples with TensorFlow and I found an interesting paper too): then you wouldn’t need a separate wake word for each language (which is, after all, a clever but ugly hack), but Rhasspy would be able to recognize the language of your command and then forward the audio to the correct ASR component (e.g. French or English). The lang attribute would be copied over too in that case and the rest of the flow stays the same.

synesthesiam · June 2, 2020, 6:29pm

I agree: lang is probably the best approach. I can imagine a future Rhasspy ASR system that detects the language and then forwards audio to an appropriate sub-ASR system.

That reminds me, at some point we may want to add more structure to the notion of “site id”. One site may have multiple ASR systems, as in the example above…

koan · June 2, 2020, 6:37pm

Well, I used “forward”, but that’s actually not the correct way to describe it. As it’s all just MQTT messages going through the broker, each ASR just subscribes to the relevant MQTT messages and the French ASR ignores all messages except the ones with "lang": "fr", the English ASR ignores all messages except the ones with "lang": "en" and so on. So at least for multilingual profiles I don’t think we need more structure to the notion of a site ID: it seems perfectly possible to have multiple ASR systems running on the same site, as long as there are no two ASR components that react to the same language.

kusi · April 15, 2021, 8:55pm

Hello, I’m interested in a multi-language setup, too.

I’ve seen on this issue that the detected language will be specified in the json file.

What is now still missing for a multilingual setup with one wake word for each language?

Will it be possible run rhasspy in a single container or is still one container per language needed?

thanks alot for an update, a really exciting project

albertmon · November 3, 2022, 1:04pm

Hi,

I just succeeded in implementing multi languages, but I am not sure this is a good approach.

I have a complete system with Rhasspy (SiteId:nl = dutch language), Domoticz, MosQuiTTo and Node-read. On another pi runs Kodi.
Config:

SiteId: nl
MQTT: External (on my pi-nl)
Audio recording: PyAudio
Wake Word: Porcupine (porcupine…ppn)
Speech to Text: Kaldi
Intent Recognition: Fsticuffs
Text to Speech: Espeak
Audio Playing: aplay
Dialogue Management: Rhasspy
Intent Handling: Local Command

I wanted also an english voice interface, so I used an old laptop with debian to install another Rhasspy instance (siteId:en = english language).
Config:

SiteId: en
MQTT: External (on my pi-nl)
Audio recording: Hermes MQTT
Wake Word: Porcupine (blueberry…ppn)
Speech to Text: Kaldi
Intent Recognition: Fsticuffs
Text to Speech: Espeak
Audio Playing: Remote HTTP (http://pi-nl:12101/api/wav
Dialogue Management: Rhasspy
Intent Handling: Local Command

In node-red I catch all MQTT messages with topic hermes/audioServer/nl/audioFrame and send them with topic hermes/audioServer/en/audioFrame to MQTT
In node-red I catch all MQTT messages with topic hermes/audioServer/en/playBytes/# and replace the en in the topic with nl and send the result to MQTT

The only differences in my english version are: SiteId, Audio Recording, Audio Playing and Wake Word and of course sentences.ini and slots files

It works, is simple and no other config is necessary. But is it a correct solution?