The Master Plan

synesthesiam · September 18, 2020, 8:46pm

Development has slowed down a bit lately, but don’t worry: Rhasspy is alive and well!

I’ve been learning more about computational linguistics, and formulating a broad Master Plan™ for Rhasspy’s future. At the heart of this is a simple truth: open source voice assistant projects (like Rhasspy) end up being model scavengers. We wait patiently for others to produce or update our speech/language models. Most of these models cannot be reproduced by the community because the data and training methods are not made public.

I believe we’ve reached a point in time where closed models and data are no longer necessary for open source voice assistants. Going forward, I’d like transition Rhasspy to using speech/language models trained entirely from public data using open source software. My ideas are based on what Zamia Speech did years ago for English and German, but generalized to work for many more languages.

In short, the plan is to:

Use Wiktionary and the international phonetic alphabet (IPA) for pronunciation dictionaries
Get text data from the OSCAR corpus and tokenize it with spaCy
Download all available public speech data to train new Kaldi models
Work with volunteers to record new public speech datasets for training text to speech voices using MozillaTTS

Importantly, this gives just about anyone a way to contribute! Add new words to Wiktionary, record yourself for Common Voice, or volunteer to help us train a voice for your language. Contributions will benefit Rhasspy as well as many other open source projects.

I describe some of the new projects I’m working on to support this effort below. No matter how it turns out, it’ll be an interesting journey:

traveler

gruut

A new soon-to-be-released project for the Master Plan is named “gruut” (pronounced /ɡruːt/). This project will clean up text and lookup/guess phonemic word pronunciations for all of Rhasspy’s supported languages. Its tasks include:

Tokenizing text into words with spaCy
Expanding numbers into words using babel and num2words
Looking up IPA pronunciations from downloaded Wiktionary lexicons or guessing them with pre-trained phonetisaurus models
Converting pronunciations to espeak-ng format for verification
Ensuring word pronunciations fit within a given language’s phonological inventory (phonemes)
- These same phonemes will be used to train Kaldi speech models and MozillaTTS voices

ipa2kaldi

Training new speech models will involve a second new project whose job is to generate a Kaldi nnet3 recipe for each language using gruut. Transcribed audio from public speech data will have transcriptions cleaned with gruut. Word pronunciations will be looked up or guessed using gruut as well, and the Kaldi recipe’s phonemes will match accordingly.

This same approach could also be used to train a new Pocketsphinx or DeepSpeech. I’ll be focusing on Kaldi initially, but the goal will be to support additional speech to text systems in time.

ipa-tts

The last new project is a fork of MozillaTTS that uses gruut instead of phonemizer for word pronunciations. There are a few reasons for this:

TTS models can be made smaller and trained faster by using a small set of language-specific phonemes
Multiple pronunciations for the same word can be supported (think “I will read” and “I had read” in English)
Text cleaning, number to word expansions, and pronunciation guessing is all handled by gruut

This last reason is important for supporting many different languages. ips-tts can be greatly simplified because it can delegate text pre-processing to gruut and just focus on IPA. A side benefit is that you automatically get a text to speech voice that can directly speak phonetic pronunciations in IPA!

New Voices

In order to create new text to speech voices, we will need people reading sentences (also called prompts) with good microphones. The sentences they read are important, since they must provide enough examples of the different sounds present in a given language. A good set of sentences is “phonetically rich” if it covers a large number of a language’s possible sounds.

An additional feature of ipa-tts (or a sub-project) will be to help generate small sets of phonetically rich sentences that still do a good job of covering a language’s sounds (technically, pairs of sounds called diphones). The set should be as small as possible so a volunteer isn’t burdened too much. Ideally, the sentences should also be pleasant to read, which is why public domain books are often mined. I’m planning to use sentences from the OSCAR corpus, however, because public domain books tend to be quite old and may use words or phrases differently than a modern speaker (see: language drift).

Feedback

Let me know what you think. Thanks for reading!

Bozor · September 18, 2020, 9:42pm

Sounds awesome. It’ll be a pleasure to volunteer to record sentences

Thanks for your work.

fastjack · September 19, 2020, 6:16am

This is all kind of awesome! AMAZING!

Andrew49 · September 19, 2020, 7:21am

Fantastic! Like the clarity of where things are heading. Exciting times ahead!

koan · September 19, 2020, 8:27am

I really love this vision! This confirms my belief that I made the right choice by choosing Rhasspy as THE open source voice assistant project to support.

Rad · September 19, 2020, 12:05pm

Awesome! What do you consider a good microphone (brand/model)? What software do we need to record sentences?

koan · September 19, 2020, 5:35pm

This is the software to record sentences:

It has been tested on Linux and macOS.

synesthesiam · September 20, 2020, 1:33am

Thanks everyone for your support! Good times ahead

I bought a Blue Yeti Nano for about 100 USD, and it’s worked quite well for me. I’m not an audio expert, so maybe someone else could chime in with a suggestion.

urbatecte · September 20, 2020, 2:01pm

Very promising, exciting,…
I’ve have a very big fear, you are open minded that is cool but your are also so efficient that I hope no big company Will promise you tons of money to take all your faith.
So please, stay with us, stay open.
Thanks à lot for what you’ve done so far ans be sure to have tons of People behind you, not money but only People, wich is (if you have enough to live) May be more valable.
Thanks again

synesthesiam · September 20, 2020, 3:13pm

I understand your concern, and appreciate your thanks

I’ve been offered a number of jobs since creating Rhasspy, mostly from start-ups. But I don’t need to make any more money, and am not interested in a job that will take time from my family.

Rhasspy and open source give me personal fulfillment, so I’ll happily stay with my current job as long as they support me in those efforts As I’ve mentioned to @koan, though, I still think it’s important to ensure the Rhasspy project could function without me in case I get hurt or something.

rolyan_trauts · September 21, 2020, 12:37am

The microphone that you use in your VoiceAi is always the best microphone to record voice with for datasets.

There is no such a thing as a really good microphone for recording speech datasets as firstly down sampled to 16khz.
Generally a cheap microphone will do that has reasonable sensitivity so that the recording software can provide consistent normalisation and trimming.
Generally a cheap microphone is likely to have similar tonal qualities to a cheap microphone in a VoiceAI.

Its extremely rare to have a high class studio mic in a VoiceAi and strangely but probably not the best option as its probably the exception to normal input on a VoiceAI.

Bozor · September 21, 2020, 11:03am

Which languages are supported yet beside of English? It’s not crystal clear to me yet how the workflow is. I mean like following the small software and then? Is it sending everything that you need somewhere to your place?

koan · September 21, 2020, 11:40am

Well, you speak the sentences that voice-recorder shows you, and they are recorded in separate wav files for each sentence. Then you still have to send these files to @synesthesiam so he can use them to train a TTS model. You can already try the program: it comes with a set of sentences for English.

@synesthesiam has been working with @hugocoolens and me to create a phonetically rich set of sentences for Dutch. If you want to help with creating such a set for another language or reading the voice prompts for another language, please let us know. But the complete workflow hasn’t been put intro practice yet, we haven’t recorded the Dutch voices yet, so we still have to see if this works (but the preliminary results are very promising).

RaspiManu · September 26, 2020, 7:29am

@synesthesiam Thank you very much for informing us about the great future of Rhasspy. I really like where we are going.

After reading, I had one question in mind… If we record our voices and create TTS models out of them, will there also be male voice models for most of the languages or did I get something wrong about how this works? I would like to use a deeper male voice instead of a voice that sounds like a car navigation system.

koan · September 26, 2020, 8:29am

If you have a deep male voice yourself, you can contribute

RaspiManu · September 26, 2020, 11:34am

I would like to contribute to a male german TTS model. But there are two questions that need to get an answer first:

Is it possible to do this directly on the Pi that’s running Rhasspy? I have no other Linux or Mac machine.
Will this model be a mixture of some voices or will my voice be reproducible? I would see this as a security risk. What do you think about this, @koan? I saw that you are a security expert (in your book ).

koan · September 26, 2020, 11:43am

I haven’t tried it, but I believe it should be possible. Are you able to install and run https://github.com/synesthesiam/voice-recorder on your Raspberry Pi? If not, it will probably be easy to patch the program. Note that you need the desktop version of Raspberry Pi OS for this, and you should still use a high-quality microphone.

It will be a model based on your voice alone. I know about voice cloning and deepfakes, so I understand your hesitation. I’m actually not sure whether the output of this TTS model will be close enough to your voice to reproduce it accurately and convince others that it’s you talking. @synesthesiam do you have an idea about this? It’s a risk I hadn’t considered yet

maxbachmann · September 26, 2020, 12:36pm

Btw where are these recordings saved/who has access to them, since I suppose they would be required for future models again.

koan · September 26, 2020, 12:51pm

The wav files are just saved locally by the program. We should probably create a repository where we make them available to create models.

maxbachmann · September 26, 2020, 12:54pm

Well then the answer to

would be, that the recording is public -> you could definetly reproduce the voice