The Master Plan

koan · September 26, 2020, 1:09pm

Well, this is the whole point of a TTS, isn’t it? Everyone who has been on the radio or TV, or every public person, takes the same risk then if their voice can be cloned. I haven’t tried these voice cloning systems myself, but this is a risk assessment that everyone should do themselves. Of course, if your voice becomes famous as “Rhasspy’s voice in language X”, no one will try to deepfake you because it’s just too obvious

maxbachmann · September 26, 2020, 1:12pm

Yes I was just asking because you discussed whether it was possible to recreate the original voice from the model, which is not really needed when the original recording is already public

synesthesiam · September 26, 2020, 3:24pm

Yes, the intention is to make the recordings public domain (or creative commons licensed) for others to use as well.

If you’d like to train a private TTS voice, that should be fully possible with the right hardware (I’m just using a GTX 1060 6GB). The things you need are:

A good microphone, like the Blue Yeti nano
A set of sentences to read (gruut will be able to generate these)
Recording software like my voice-recorder that will create WAV files and text files (with transcriptions)
A GPU that is supported by PyTorch
The ipa-tts software I’m working on now (a fork of MozillaTTS)

It definitely is; I’ve been able to do this by having a different TTS system read the “phonetically rich” sentences (approximately 1000 examples). So if you’re concerned about the security of your voice, I would not recommend releasing your model!

Yeah, we may just start with a GitHub repo for those who are willing to release their recorded WAV files. If not, as @koan mentioned, the files are saved locally so people can train their own private voices as they please.

No_one · September 26, 2020, 9:46pm

Couldn’t the security issue be solved if we collect files from lets say 10 people, have one guy train the model and upload it? Then noone could simulate a specific person…

synesthesiam · September 27, 2020, 12:07am

The text to speech models are trained on one voice (I don’t know how to do the multi-speaker models). We could do what you’re saying with a speech to text model, however.

RaspiManu · September 27, 2020, 9:22am

@synesthesiam Would it be possible to record the wav files for a model and then use some voice effects to modify them a bit before training the model?

I think about giving the person that creates a model the opportunity to use something like 3 scalable effects to modifiy the recorded voice so it still sounds good but not exactly like their own voice. A nice side effect would be, that you don’t have a feeling of talking to yourself, if you make a private model.

synesthesiam · September 27, 2020, 9:48pm

I know sox can do lots of interesting effects. Anyone know some good ones for human voices that wouldn’t distort them too badly?

RaspiManu · September 28, 2020, 6:05pm

I am absolutely no expert when it comes to sound technology and I never worked with SoX, but I remember that I had a small guitar amplifier when I was a kid. It had rotary knobs for gain, bass, middle and treble. Maybe these kind of simple basic effects could help changing voices, too. I looked at the SoX effects and this basics seem to be available.

tjiho · October 22, 2020, 2:34pm

Wow, I’m amazed by this plan. How this plan is going on two years later ?

I’m french, I could contribute by reading french sentences, It would be nice to have a better TTS than PicoTTS.

synesthesiam · October 22, 2020, 3:16pm

So far, it’s only been since September But I’d say it’s going pretty well! We now have a Dutch voice (thanks @rdh), support for Czech, and a new Italian model.

That would be great, thanks! I will get started with French and create a set of “phonetically rich” sentences to read. Would you have time to review the sentences and let me know if they make sense and are not vulgar or offensive (they come from the internet, so…yeah)?

I believe @fastjack was also asking about a French voice. Maybe he’d be willing to help too?

tjiho · October 22, 2020, 3:23pm

Sure I’d be happy to review them.

solyarisoftware · October 23, 2020, 7:08am

I didn’t use sox effects so far. I’m discovering it give is a lot of interesting processing functions.
I think maybe the time-domain functions as vad (voice activity detector) or gain/normalize, could be useful. I’m perplex using frequency-domain functions, because I fair you loss/artifact voice “grain”, of the original voice recording.

RaspiManu · October 24, 2020, 5:12pm

Hey @synesthesiam,

@Bozor told me, that you sent him phonetically rich sentences for German. I would like to help him with reviewing. You can message me the sentences, too

Bozor · October 24, 2020, 6:03pm

I’ll send you everything.

synesthesiam · October 26, 2020, 1:00am

I came across the thorsten German dataset recently, so I’m training a model on it as a test of my system with German. If it works, I’ll create a Docker image like the Dutch model from @rdh’s data for everyone to try out