The Master Plan

Development has slowed down a bit lately, but don’t worry: Rhasspy is alive and well!

I’ve been learning more about computational linguistics, and formulating a broad Master Plan™ for Rhasspy’s future. At the heart of this is a simple truth: open source voice assistant projects (like Rhasspy) end up being model scavengers. We wait patiently for others to produce or update our speech/language models. Most of these models cannot be reproduced by the community because the data and training methods are not made public.

I believe we’ve reached a point in time where closed models and data are no longer necessary for open source voice assistants. Going forward, I’d like transition Rhasspy to using speech/language models trained entirely from public data using open source software. My ideas are based on what Zamia Speech did years ago for English and German, but generalized to work for many more languages.

In short, the plan is to:

Importantly, this gives just about anyone a way to contribute! Add new words to Wiktionary, record yourself for Common Voice, or volunteer to help us train a voice for your language. Contributions will benefit Rhasspy as well as many other open source projects.

I describe some of the new projects I’m working on to support this effort below. No matter how it turns out, it’ll be an interesting journey:

traveler

gruut

A new soon-to-be-released project for the Master Plan is named “gruut” (pronounced /ɡruːt/). This project will clean up text and lookup/guess phonemic word pronunciations for all of Rhasspy’s supported languages. Its tasks include:

  • Tokenizing text into words with spaCy
  • Expanding numbers into words using babel and num2words
  • Looking up IPA pronunciations from downloaded Wiktionary lexicons or guessing them with pre-trained phonetisaurus models
  • Converting pronunciations to espeak-ng format for verification
  • Ensuring word pronunciations fit within a given language’s phonological inventory (phonemes)
    • These same phonemes will be used to train Kaldi speech models and MozillaTTS voices

ipa2kaldi

Training new speech models will involve a second new project whose job is to generate a Kaldi nnet3 recipe for each language using gruut. Transcribed audio from public speech data will have transcriptions cleaned with gruut. Word pronunciations will be looked up or guessed using gruut as well, and the Kaldi recipe’s phonemes will match accordingly.

This same approach could also be used to train a new Pocketsphinx or DeepSpeech. I’ll be focusing on Kaldi initially, but the goal will be to support additional speech to text systems in time.

ipa-tts

The last new project is a fork of MozillaTTS that uses gruut instead of phonemizer for word pronunciations. There are a few reasons for this:

  1. TTS models can be made smaller and trained faster by using a small set of language-specific phonemes
  2. Multiple pronunciations for the same word can be supported (think “I will read” and “I had read” in English)
  3. Text cleaning, number to word expansions, and pronunciation guessing is all handled by gruut

This last reason is important for supporting many different languages. ips-tts can be greatly simplified because it can delegate text pre-processing to gruut and just focus on IPA. A side benefit is that you automatically get a text to speech voice that can directly speak phonetic pronunciations in IPA!

New Voices

In order to create new text to speech voices, we will need people reading sentences (also called prompts) with good microphones. The sentences they read are important, since they must provide enough examples of the different sounds present in a given language. A good set of sentences is “phonetically rich” if it covers a large number of a language’s possible sounds.

An additional feature of ipa-tts (or a sub-project) will be to help generate small sets of phonetically rich sentences that still do a good job of covering a language’s sounds (technically, pairs of sounds called diphones). The set should be as small as possible so a volunteer isn’t burdened too much. Ideally, the sentences should also be pleasant to read, which is why public domain books are often mined. I’m planning to use sentences from the OSCAR corpus, however, because public domain books tend to be quite old and may use words or phrases differently than a modern speaker (see: language drift).

Feedback

Let me know what you think. Thanks for reading!

18 Likes

Sounds awesome. It’ll be a pleasure to volunteer to record sentences :wink:

Thanks for your work.

1 Like

This is all kind of awesome! AMAZING! :+1:

1 Like

Fantastic! Like the clarity of where things are heading. Exciting times ahead! :smiley:

1 Like

I really love this vision! This confirms my belief that I made the right choice by choosing Rhasspy as THE open source voice assistant project to support.

1 Like

Awesome! What do you consider a good microphone (brand/model)? What software do we need to record sentences?

1 Like

This is the software to record sentences:

It has been tested on Linux and macOS.

2 Likes

Thanks everyone for your support! Good times ahead :slight_smile:

I bought a Blue Yeti Nano for about 100 USD, and it’s worked quite well for me. I’m not an audio expert, so maybe someone else could chime in with a suggestion.

Very promising, exciting,…
I’ve have a very big fear, you are open minded that is cool but your are also so efficient that I hope no big company Will promise you tons of money to take all your faith.
So please, stay with us, stay open.
Thanks à lot for what you’ve done so far ans be sure to have tons of People behind you, not money but only People, wich is (if you have enough to live) May be more valable.
Thanks again :+1:

2 Likes

I understand your concern, and appreciate your thanks :slight_smile:

I’ve been offered a number of jobs since creating Rhasspy, mostly from start-ups. But I don’t need to make any more money, and am not interested in a job that will take time from my family.

Rhasspy and open source give me personal fulfillment, so I’ll happily stay with my current job as long as they support me in those efforts :+1: As I’ve mentioned to @koan, though, I still think it’s important to ensure the Rhasspy project could function without me in case I get hurt or something.

9 Likes

The microphone that you use in your VoiceAi is always the best microphone to record voice with for datasets.

There is no such a thing as a really good microphone for recording speech datasets as firstly down sampled to 16khz.
Generally a cheap microphone will do that has reasonable sensitivity so that the recording software can provide consistent normalisation and trimming.
Generally a cheap microphone is likely to have similar tonal qualities to a cheap microphone in a VoiceAI.

Its extremely rare to have a high class studio mic in a VoiceAi and strangely but probably not the best option as its probably the exception to normal input on a VoiceAI.

Which languages are supported yet beside of English? It’s not crystal clear to me yet how the workflow is. I mean like following the small software and then? Is it sending everything that you need somewhere to your place?

Well, you speak the sentences that voice-recorder shows you, and they are recorded in separate wav files for each sentence. Then you still have to send these files to @synesthesiam so he can use them to train a TTS model. You can already try the program: it comes with a set of sentences for English.

@synesthesiam has been working with @hugocoolens and me to create a phonetically rich set of sentences for Dutch. If you want to help with creating such a set for another language or reading the voice prompts for another language, please let us know. But the complete workflow hasn’t been put intro practice yet, we haven’t recorded the Dutch voices yet, so we still have to see if this works (but the preliminary results are very promising).

2 Likes

@synesthesiam Thank you very much for informing us about the great future of Rhasspy. I really like where we are going.

After reading, I had one question in mind… If we record our voices and create TTS models out of them, will there also be male voice models for most of the languages or did I get something wrong about how this works? I would like to use a deeper male voice instead of a voice that sounds like a car navigation system.

1 Like

If you have a deep male voice yourself, you can contribute :wink:

I would like to contribute to a male german TTS model. But there are two questions that need to get an answer first:

  1. Is it possible to do this directly on the Pi that’s running Rhasspy? I have no other Linux or Mac machine.
  2. Will this model be a mixture of some voices or will my voice be reproducible? I would see this as a security risk. What do you think about this, @koan? I saw that you are a security expert (in your book :blush:).

I haven’t tried it, but I believe it should be possible. Are you able to install and run https://github.com/synesthesiam/voice-recorder on your Raspberry Pi? If not, it will probably be easy to patch the program. Note that you need the desktop version of Raspberry Pi OS for this, and you should still use a high-quality microphone.

It will be a model based on your voice alone. I know about voice cloning and deepfakes, so I understand your hesitation. I’m actually not sure whether the output of this TTS model will be close enough to your voice to reproduce it accurately and convince others that it’s you talking. @synesthesiam do you have an idea about this? It’s a risk I hadn’t considered yet :slight_smile:

1 Like

Btw where are these recordings saved/who has access to them, since I suppose they would be required for future models again.

The wav files are just saved locally by the program. We should probably create a repository where we make them available to create models.

Well then the answer to

would be, that the recording is public -> you could definetly reproduce the voice :wink: