Rhasspy 2.5.6 Released

After a few quiet months, Rhasspy 2.5.6 has finally been released! This release includes contributions from many community members, credited below :slight_smile:

Two of the biggest features in this release:

  • Multi-site support for dialogue manager
    • For MQTT base/satellite setups - allows each satellite to maintain an independent dialogue session. Previously, the newest session would cancel the previous one.
    • A possible bug has already been found
  • “Text FST” language model type for Kaldi
    • For very large Rhasspy grammars, the default (ARPA) language models may behave poorly by adding extra words
    • In “Text FST” mode, the ASR system can only produce sentences from your sentences.ini file (no extra words or different word order). This is faster to train and more accurate if you only care about using your exact voice commands.

Thanks to everyone for contributing, testing, answering questions, and helping to make Rhasspy better!

Lastly, as I mentioned in The Master Plan, we’ll be looking for volunteers soon to donate their voices for new Rhasspy text to speech voices. We have volunteers for English, Dutch, and German currently. If you speak a different language, have a good microphone (like a Blue Yeti Nano), and are willing to license your recordings as public domain or Creative Commons, please let me know :+1:


Changelog

Added

  • Multi-site support for dialogue manager
  • Add “Text FST” language model type for Kaldi for strict grammar-based recognition
  • UDP audio settings in web UI for Pocketsphinx wake word system
  • Rudimentary SSML support in Google Wavenet TTS (digitalfiz)

Changed

  • JSON output from all services is no longer forced to be ASCII
  • fuzzywuzzy performance improvement by using sqlite database (maxbachmann)
  • Lots of documentation improvements (koen)
  • Strip commans from replaced numbers (“one thousand, one hundred”)
  • Improve rhasspy-nlu performance (maxbachmann)
  • Simplify Google Wavenet voice selection UI (Romkabouter)
  • Fix local command when not using absolute path (DeadEnd)
15 Likes

Thank you for the good work.
After upgrading from 2.5.5 to 2.5.6 on a server (rpi 3b+) satellite (rpi rb+) installation, I get this error: NluException: file is not a database

Great, I have updated the Hassio Addon here:

2 Likes

Thanks for the latest update!
After upgrading from 2.5.5 to latest (2.5.6) i got the following error message after saving a new sentence:
“TrainingFailedException: file is not a database”

Hello,

After a day of test, it seems that the new satellites management (multi-site support for dialogue manager) works well :slight_smile:

Thank you, once again !

1 Like

What is being used to create the new tts voices?

If you’re using fuzzywuzzy, you may need to delete your existing intent_examples.json file and retrain. A fix for this should be coming soon.

A fork of MozillaTTS with the PyTorch backend.

1 Like

Thanks, I will try it.

1 Like

Thanks!!! :slight_smile: you have some links?! I would love to play with some of that

Hi Michael, all

That’s my first post here.

I love your project anI’m trying to disseminate it: https://twitter.com/solyarisoftware/status/1314151250716491778?s=20

I have a question. You say:

Multi-site support for dialogue manager

but what you mean with “dialogue manager” ?
I mean I do not reading about it in the (great) documentation here:
https://rhasspy.readthedocs.io/. I didn’t find anything related. My fault?

BTW 1, I’m obsessed with dialog management and I’m working on NaifJs, my opensource state-machine based DM.

BTW 2 , I’m from Italy and I’ll try to contribute as volunteer as soon I understand call to action in practice.

Thanks again for your beautiful project!
Giorgio

1 Like

Hi Giorgio, good to see you here :slight_smile: Thanks for your help in promoting Rhasspy!

The “dialogue manager” in Rhasspy is very close to what Snips had, which is not very sophisticated. It is essentially a session manager, where a client application can choose which intents are active at each step and invoke the text to speech system to prompt the user.

In Snips and previously in Rhasspy, only one session could be active at at time. As of 2.5.6, there can now be one session active per site (each client reports a siteId). This is supposed to allow people with multiple Rhasspy satellites in different rooms to interact with them simultaneously, assuming there’s a central server running Rhasspy’s speech-to-text, intent recognition, etc. services.

This looks pretty cool. Rhasspy works well with Node-RED, so it would seem that a NaifJs bot could be made to work by listening to the websocket events. With open transcription enabled, you could catch the spoken text at /api/events/text (websocket) and send the response back via /api/text-to-speech (HTTP).

I’ve almost completed a new Italian Kaldi speech model, thanks to a connection made by @adrianofoschi and public speech data from Common Voice and M-AI Labs.

My “call to action” for Italian is for single-person recordings to make a nice text-to-speech voice using a fork of MozillaTTS (in progress). I have a process for finding seemingly good sentences/prompts for a person to read, but I need help to verify that they aren’t weird or offensive – they come from the Internet after all.

Here are some samples of the Dutch voice I’m working on now: https://drive.google.com/file/d/1Fp9lc5eGpe5Xw8ACHXIgVGX5zTow7rpM/view?usp=sharing (credit to @rdh). Forgive the slightly robotic sound, it’s only been training for a day and is maybe 10-15% done :slight_smile:

2 Likes

Yes, I finally pushed a version up today! I’m calling it larynx after the technical term for the human voice box.

This uses my MozillaTTS fork as a git submodule. The only big changes to MozillaTTS are to use gruut for cleaning/phonemizing text. Note: right now only U.S. English and Dutch are supported. More languages are coming soon!

3 Likes

Hi Michael,
thanks for your detailed feedback!

Thanks for your help in promoting Rhasspy!

I “have to” give you back. Promoting is a first level :slight_smile:

  • Your work on no-cloud ASR/TTS “unification” of other opensources fill a gap.
  • You did a great simplification/dissemination of basic concepts, with beautiful graphics.
  • Rhasspy “resurrects” good defunct :frowning: SNIPS and last but not least, it could be act as a common OS of opensource/open-hardware smartspeakers,

The “dialogue manager” in Rhasspy is very close to what Snips had, which is not very sophisticated. It is essentially a session manager, where a client application can choose which intents are active at each step and invoke the text to speech system to prompt the user.

In Snips and previously in Rhasspy, only one session could be active at at time. As of 2.5.6, there can now be one session active per site (each client reports a siteId ). This is supposed to allow people with multiple Rhasspy satellites in different rooms to interact with them simultaneously, assuming there’s a central server running Rhasspy’s speech-to-text, intent recognition, etc. services.

Thanks for the clarification. I like the multi-room architecture!

This looks pretty cool. Rhasspy works well with Node-RED, so it would seem that a NaifJs bot could be made to work by listening to the websocket events. With open transcription enabled, you could catch the spoken text at /api/events/text (websocket) and send the response back via /api/text-to-speech (HTTP).

Well, I just opensourced NaifJs. Currently the project is still very rough / a very alfa-stage. The architecture I foreseen would be pretty independent from the “caller” channel so yes, I’ll build an HTTP / websocket (or hermes protocol maybe using github /snipsco/hermes-protocol/tree/master/platforms/hermes-javascript) channel adapter as you suggest. I’ll see the Node-RED integration. :ok_hand:

Digression: in NaifJs, there is something similar to the intent-based approach (you integrated in Rhasspy as an option), but that’s intended to be embedded as part of internal nodes (states) pattern matching (in practice I propose simple regexps as an option to dialogues developers). I will think about the integration and I’ll feedback you in a separated post :slight_smile:

I’ve almost completed a new Italian Kaldi speech model, thanks to a connection made by @adrianofoschi and public speech data from Common Voice and M-AI Labs.

That’s great! I didn’t know.

My “call to action” for Italian is for single-person recordings to make a nice text-to-speech voice using a fork of MozillaTTS (in progress). I have a process for finding seemingly good sentences/prompts for a person to read, but I need help to verify that they aren’t weird or offensive – they come from the Internet after all.

So it seems to me that you are looking for a voice actor that has to pronounce a list of utterances you prepared, right?

I guess that different voice signatures (female/male/other) would be a bonus.

Anyway, I’m personally available (and I could involve friends) to double-check / validate the content of Italian spoken recorded voices. Please let me know. Eventually I’m available to record a voice… (not sure) :face_with_hand_over_mouth:

Maybe, some manifesto / detailed description of that “call to action” would help. I’m available to share your needs on social media and my blog convcomp dot it, with a dedicated article.

Sorry again for digressions here.
Thanks
giorgio

1 Like

Any tutorials on how to create models? I work for a TV station and actually have access to TONS of close captioned media and would be interested in seeing if i can create a voice model from some of this data.

Which language are you planning to work with?

programming language or speaking language?
i have no real preference, typically something like python is where i lean for linux stuff

I meant speaking language :wink:

would be english :slight_smile:

1 Like

OK, I’ve added a Larynx tutorial to get you started. This is a first draft, so hopefully everything works!

3 Likes