Changing the language for part of the sentence?

Would it be possible to have a part of a sentence in another language?

The use case is for controlling music player (either locally through MPD or remotely such as youtube-dl). For instance, image that you speak something different than english but want to ask for an english song name and album.

Example: Veuillez jouer la chanson “With or without you” de la band “U2”.


That is a great topic. Commercial voice assistants (like Amzon Echo or Google home) do that quite well.

It is important for people living in a non English speaking country. Music, films etc. do often have english titles.

Do achieve this we would need a multi language trained TTS model (can we do that with Kaldi somehow?). For words with the same spelling in both languages but different pronunciation we would need a way to tell the system in the setence/slot which pronunciation to expect.

Another probably more realistic approach that may be possible today (with some tinkering):

Create an intent like “Play an (english|french) {language} song” and than handle that intent (from a python script connecting to MQTT) with a DialogContinueSession saying “What song do you want to hear” and feed the users answer into a STT depending on the language. Not exactly sure how to configure the last part. But it should be doable (maybe with a second Rhasspy instance and some coding).

More simple to setup: Two rhasspy instances with a french and english profile using a different wake word. You would have to say the command in the same language as the song title though.

You could also try to generate custom words for every english title providing the phonetic pronunciation in english. I tried that with a few examples and it seems to kind of work. But without a simple way to generate all the phonetic representations of a mixed language song library that seems practically impossible.

There is some overlap here with a recent discussion about multilingual profiles.

To use multiple languages in one sentence, a system such as Speech Synthesis Markup Language (SSML) could be useful. Of course the TTS should support it then, or Rhasspy should parse the markup and forward the parts in different languages to text-to-speech engines of different language profiles.