Opentts custom voice.... weird request

CrankyCoder · February 2, 2022, 7:34pm

I have a very strange request. I have been working with rhasspy+satellite configuration. Working fine. I have opentts running to give some better voices (outstanding by the way)…

But someone else in my family has made 2 requests and I am unsure about how to go about either of these really.

“daddy i want to be able to call the thing ______” - this one i am guessing would fall under custom wake word which I have seen stuff before. I haven’t really done this and hoping someone might be able to point me in the right direction. I believe snowboy and raven both support multiple.
here is the tricky part “and i want the thing to sound like so and so from some cartoon” - ok. so this is where i am LOST LOST in the woods. I can pull tons of audio clips of whoever this character is and pull the closed captioning to get text match up… BUT. i have 0 knowledge on how to train a new voice. I don’t know how difficult it is, what is in entailed or anything. Not sure if it’s possible to contribute a new voice to opentts if I do get one created.
This is where I hope @synesthesiam has already done his magic. If I have multiple wake words and by some chance get this other voice created… is it possible to pass a parameter through the process as to which wakeword it woke up to and be able to pass that to the opentts tts endpoint to use a specific voice based on the wakeword. I have 0 interest hearing my morning reports and notifications in my office in a cartoon voice lol.

Thanks!!

synesthesiam · February 5, 2022, 8:22pm

Hi @CrankyCoder,

I haven’t done my magic with this yet, but I like the idea

Over time, we’ve included the wake word id in more messages, so maybe this is a good case for adding it to the tts/say message. Then, the TTS services could be extended to switch voices depending on the wake word.

If you can get at least an hour of them speaking (with transcriptions), I might be able to train a voice

repole · February 5, 2022, 9:09pm

@synesthesiam - Some documentation on how to train a custom voice (presumably for Larynx) would be, really, really awesome.

I suspect you could run into some licensing issues for people wanting to train off of voices in a TV show and redistributing that voice, but giving users the tools to do so themselves (provided they’ve pulled the audio/mapped those clips to transcriptions) would open up lots of options. And prevent a bunch of people asking you to train custom voices for them

CrankyCoder · February 8, 2022, 7:02pm

I like that. You think it’s something that could be incorporated into the opentts stuff (that’s what im using in place of marytts as per your documentation )

What kind of format does it need to be in. Not sure how the transcription matches up to the audio or if timecodes ect are needed.

@repole mentions in the next post having some docs on how to start getting things together for training voices would be awesome!

synesthesiam · February 9, 2022, 2:48am

Do you have a GPU, @CrankyCoder?

CrankyCoder · February 9, 2022, 2:43pm

I do. Also looking at some coral TPU stuff as well. But not sure if that helps what you are asking lol.

synesthesiam · February 18, 2022, 9:09pm

The Coral TPUs seem limited in the kinds of neural network layers they can accelerate. I haven’t taken the time to see if they would be useful with something like Larynx yet (transfomer-based).

@repole The training process is a bit complex right now, which is why I haven’t taken the time to write documentation yet. Unlike other TTS systems, I do a few extra pre-processing steps before training:

Transform raw text to a spoken form (numbers expanded from “1” to “one”, punctuation removed, etc.)
Forced alignment between spoken audio and spoken text, so phonemes match what is said and silence/pauses can be explicitly encoded
All training audio is converted to PyTorch objects (spectrograms, etc.)

If anything goes wrong in any of these stages, it’s very difficult to debug and correct. But if it works, training is very fast and usually results in a working model within a few hours (when fine-tuning)