I would like to contribute to a male german TTS model. But there are two questions that need to get an answer first:
Is it possible to do this directly on the Pi that’s running Rhasspy? I have no other Linux or Mac machine.
Will this model be a mixture of some voices or will my voice be reproducible? I would see this as a security risk. What do you think about this, @koan? I saw that you are a security expert (in your book ).
I haven’t tried it, but I believe it should be possible. Are you able to install and run https://github.com/synesthesiam/voice-recorder on your Raspberry Pi? If not, it will probably be easy to patch the program. Note that you need the desktop version of Raspberry Pi OS for this, and you should still use a high-quality microphone.
It will be a model based on your voice alone. I know about voice cloning and deepfakes, so I understand your hesitation. I’m actually not sure whether the output of this TTS model will be close enough to your voice to reproduce it accurately and convince others that it’s you talking. @synesthesiam do you have an idea about this? It’s a risk I hadn’t considered yet
Well, this is the whole point of a TTS, isn’t it? Everyone who has been on the radio or TV, or every public person, takes the same risk then if their voice can be cloned. I haven’t tried these voice cloning systems myself, but this is a risk assessment that everyone should do themselves. Of course, if your voice becomes famous as “Rhasspy’s voice in language X”, no one will try to deepfake you because it’s just too obvious
Yes I was just asking because you discussed whether it was possible to recreate the original voice from the model, which is not really needed when the original recording is already public
Yes, the intention is to make the recordings public domain (or creative commons licensed) for others to use as well.
If you’d like to train a private TTS voice, that should be fully possible with the right hardware (I’m just using a GTX 1060 6GB). The things you need are:
A good microphone, like the Blue Yeti nano
A set of sentences to read (gruut will be able to generate these)
Recording software like my voice-recorder that will create WAV files and text files (with transcriptions)
A GPU that is supported by PyTorch
The ipa-tts software I’m working on now (a fork of MozillaTTS)
It definitely is; I’ve been able to do this by having a different TTS system read the “phonetically rich” sentences (approximately 1000 examples). So if you’re concerned about the security of your voice, I would not recommend releasing your model!
Yeah, we may just start with a GitHub repo for those who are willing to release their recorded WAV files. If not, as @koan mentioned, the files are saved locally so people can train their own private voices as they please.
Couldn’t the security issue be solved if we collect files from lets say 10 people, have one guy train the model and upload it? Then noone could simulate a specific person…
The text to speech models are trained on one voice (I don’t know how to do the multi-speaker models). We could do what you’re saying with a speech to text model, however.
@synesthesiam Would it be possible to record the wav files for a model and then use some voice effects to modify them a bit before training the model?
I think about giving the person that creates a model the opportunity to use something like 3 scalable effects to modifiy the recorded voice so it still sounds good but not exactly like their own voice. A nice side effect would be, that you don’t have a feeling of talking to yourself, if you make a private model.
I am absolutely no expert when it comes to sound technology and I never worked with SoX, but I remember that I had a small guitar amplifier when I was a kid. It had rotary knobs for gain, bass, middle and treble. Maybe these kind of simple basic effects could help changing voices, too. I looked at the SoX effects and this basics seem to be available.
So far, it’s only been since September But I’d say it’s going pretty well! We now have a Dutch voice (thanks @rdh), support for Czech, and a new Italian model.
That would be great, thanks! I will get started with French and create a set of “phonetically rich” sentences to read. Would you have time to review the sentences and let me know if they make sense and are not vulgar or offensive (they come from the internet, so…yeah)?
I believe @fastjack was also asking about a French voice. Maybe he’d be willing to help too?
I didn’t use sox effects so far. I’m discovering it give is a lot of interesting processing functions.
I think maybe the time-domain functions as vad (voice activity detector) or gain/normalize, could be useful. I’m perplex using frequency-domain functions, because I fair you loss/artifact voice “grain”, of the original voice recording.
@Bozor told me, that you sent him phonetically rich sentences for German. I would like to help him with reviewing. You can message me the sentences, too
I came across the thorsten German dataset recently, so I’m training a model on it as a test of my system with German. If it works, I’ll create a Docker image like the Dutch model from @rdh’s data for everyone to try out