GLaDOS as TTS voice

Syntox · June 3, 2021, 8:00am

I have no idea how easy or hard that would be or if it was already accomplished, but I think I speak for a few others when I say I would love to hear GLaDOS as a TTS voice. The only solution I have found, is
this GLaDOS voice generator project, but the response time is way to slow:

I guess there are enough voice samples to train the model? And if something is missing one could use the generator I mentioned. I would be happy to help, but I have no clue of how anything of this is done.
Thanks

rolyan_trauts · June 4, 2021, 12:38am

If you can find a big enough training set then you can create any voice by choosing the voice dataset.

Training TTS takes some Ooomf or at least quite some time

Syntox · June 4, 2021, 6:15am

I guess a training set would be a collection of voice samples and the text it said. But I have no idea of what to train or even how. I’ve looked at the glow-tts repo of rhasspy and jaywalnut310 but again have no clue of what all that means. Do you know of a step-by-step guide to train such a model?

koan · June 4, 2021, 6:57am

@synesthesiam has created a balanced set of sentences which you can record to create a suitable collection of voice samples:

He has also created a web-based tool to help with recording the set, vox-check, as well as a tool that runs on your computer, voice-recorder. You should speak to him if you want to contribute your voice samples to train a TTS model.

rolyan_trauts · June 4, 2021, 10:17am

There is some old Tacotron training details here https://google.github.io/tacotron/publications/semisupervised/

The idea that you are going to record your own TTS dataset seems a little crazy when such a large collection of transcribed audio already exists. But really a single book or exert is prob no where near enough for modern TTS. The Thorsten dataset that is often used for German is 23 hours of audio and more is better with https://keithito.com/LJ-Speech-Dataset/ sort of being an accepted minimum.

TensorflowTTS is pretty much state-or-art and is documented quite well but apart from needed to record quite huge volumes of audio training can lock up a pretty much state-of-art machine 4 several days if using good sized datasets.

From Librespeech to others there is so much in common languages that its likely you could cherry pick existing and augment but really training TTS or ASR is a much bigger undertaking than say single word KWS.

Speechbrain has some processing classes that can work direct with datasets otherwise Sox is a good candidate.

But say GLaDOS would be try and find Female audio books maybe augment with a bit of pitch and reverb and run through a TTS and see how you go.
Modern TTS very much pick up the characteristics of the dataset so you just need to create that dataset.