Mimic 3 TTS Preview

schnopsi · May 13, 2022, 8:19am

Using Pi4 64bit mimic3-server.

m-ailabs_low:

On short sentences or just single words, most of the times there is a strange noise at the end of the voice output. Not sure, why this happens. If the text gets longer, everything is ok (most of the times).
There is a nice and realistic speaking-break after a comma. But no break after the end of a sentence. That makes the text hard to understand.

Should the API of mimic3-server already be functional?

sve · May 13, 2022, 8:44am

I tested German and English voices on a smartphone running Mobian (Pocophone F1). Impressive speed and quality!

tjiho · May 13, 2022, 10:17am

I think Siwis_low is the best one. But this voice is very close to voice use in public transport in Toulouse. So I don’t like it much. In fact it sounds like an actress reading a book. So good voice but not natural.

Else, I would say zeckout, I feel like I’m listening to an old science teacher.

Edit: I re-listen voices with a good headset, Siwis low is better than the other in terms of quality.

AndreKR · May 13, 2022, 10:37am

I don’t think there is anything special about those words, it’s just that they happen to be in my voice outputs and Larynx has issues with them. There are probably many more words with issues, I’m just not using them in any voice prompts.

synesthesiam · May 13, 2022, 2:41pm

I’d agree in terms of audio quality. Many of the voices I trained from the M-AILabs dataset don’t have great audio quality since they were recorded by volunteers for Librivox using whatever hardware they had.

Thank you! Was it difficult to get working on the phone at all?

This seems to be a general problem with the TTS model I’m using. If the dataset doesn’t contain the speaker saying single words or very short phrases, the model has a hard time producing them. For now, I think I’ll have to consider the M-AILabs voices as intended for reading long-form text only

I think I can at least fix the pausing issues after a period for now

Yes, if you’re running it locally you can check out http://localhost:59125/openapi/ to see what’s available. It should also be compatible with anything that’s meant to talk to MaryTTS. You just have to make sure your “MaryTTS voice” is something like “en_UK/apope_low”.

sve · May 13, 2022, 4:46pm

Was it difficult to get working on the phone at all?

No. It was as easy as for a Debian or Ubuntu computer.

rejoe2 · May 13, 2022, 5:30pm

Hi there, got the server up and running (manualy for now).

Settings are saved (wrt. to maryTTS) as follows:

    "text_to_speech": {
        "marytts": {
            "voice": "thorsten_low"
        },
        "system": "marytts"
    },

The other keys mentionned in docu / text-to-speech/#marytts are not explicitely stored in the JSON, but visible in the Rhasspy UI (de_DE, thorsten_low).
Putting that combined in the “Voice” field doesn’t help changing back leads to the locale also beeing stored in the JSON, but still this results in

TtsException: file does not start with RIFF id

What did I miss or could do better?

Tests with “http://external-ip:59125/” work quite good, calling with “openapi” postfix results in 404 error…

jens-schiffke · May 14, 2022, 6:51am

@synesthesiam
I like the announcement!
Is there already a date when the mimic3 repository will be online in the Github? I would like to test the Debian packages.

Greetings, Jens

fluidvoice · May 14, 2022, 8:47pm

Wow, how could Portuguese (Brazilian) not be on the list?

synesthesiam · May 15, 2022, 2:58pm

I’ll have to check this myself. The Mimic 3 server should also work with Rhasspy’s “remote TTS” option, but I need to double check I haven’t broken anything with that either!

Hopefully next month, but I sent you a link with the beta packages

It was, but people told me that the voice I trained wasn’t understandable. I used this dataset: https://github.com/Edresson/TTS-Portuguese-Corpus

Do you know of any other TTS Portuguese datasets?

fluidvoice · May 15, 2022, 8:33pm

sorry no. I’m clueless about lang models, data, etc.

fluidvoice · May 16, 2022, 5:23pm

did you look here? Hugging Face – The AI community building the future.

jens-schiffke · May 17, 2022, 3:52pm

Nice!

    "text_to_speech": {
        "command": {
            "say_arguments": " --ssml --voice 'de_DE/m-ailabs_low#rebecca_braunert_plunkett' ",
            "say_program": "mimic3"
        },
        "satellite_site_ids": "default",
        "system": "command"
    },

Das ist ein Test in deutsch <voice name="en_US/vctk_low#p236">and this is an test in english.</voice>

… and Rhasspy speaks two languages in one sentence - cool.
It runs a bit slow on my old machine without GPU. With enough power and cache it will definitely get better.

Greetings, Jens

synesthesiam · May 17, 2022, 4:13pm

I didn’t, but I don’t see any useful data there

Awesome! The way to speed this up is to run mimic-server as a service (check the source code for a systemd unit example), and then use mimic3 --remote ... so it will use the web server instead.

jens-schiffke · May 17, 2022, 4:19pm

Calling it up via the web interface wasn’t faster either. Now I have to pimp my base a bit first…

fluidvoice · May 18, 2022, 12:18pm

Btw, I think Mycroft should link to some demo’s in their Mimic 3 blog post announcement.
If people could hear presumably how good the TTS sounds they’d be more likely to sign up and get involved. My 2 cents.

CrankyCoder · May 31, 2022, 10:03pm

Will this be a drop in replacement?

The1And0 · June 28, 2022, 8:28pm

Hello,

unfortunately i cannot sent PM as a new user and therefore cant test RTF’s for different architectures. Can someone give hints about RTFs, maybe for ARM?

Thanks

synesthesiam · June 29, 2022, 12:40pm

Hi @The1And0, on 64-bit ARM you can get an RTF of around 0.5. 32-bit ARM is slower, around 1.2 or 1.3. If you’re on a 64-bit x86/64 machine though, it can be 10x faster than ARM

Try it out for yourself: https://github.com/mycroftAI/mimic3

AndreKR · June 29, 2022, 12:58pm

Oh, there’s a Docker image now. (Although apparently without harvard-glow_tts yet?)

Is it compatible with the “Remote HTTP” TTS option of Rhasspy?