Mimic 3 TTS Preview

synesthesiam · May 11, 2022, 4:06am

Hi everyone,

Today, Mycroft is opening up a beta test for Mimic 3 TTS. This is a spiritual successor to Larynx, with better performance and more supported languages than Larynx. It runs about 2x faster than real-time on a Pi 4 (64-bit OS), and I’m hoping this will replace Google TTS for a lot of people

I’m looking for feedback, specifically on the non-English voices (there are a total of 25 languages). You can sign up for the beta using the link above, or just listen to some voice samples. PM me if you’d like to get access to the code or Debian packages to try out locally. There will also be a Docker image available when it officially launches (probably next month).

If you’d like to learn more, the documentation is already up. And here is the list of currently supported languages:

Afrikaans
Bengali
Dutch
English
Farsi
Finnish
French
German
Greek
Gujarati
Hausa
Hungarian
Italian
Javanese
Kiswahili
Korean
Nepali
Polish
Russian
Setswana
Spanish
Telugu
Ukrainian
Vietnamese
Yoruba

Thanks,
Mike

romkabouter · May 11, 2022, 9:38am

Good work, the Dutch voices are pretty good. I’ll check on the documentation!

rejoe2 · May 11, 2022, 9:52am

Hi there, also the German examples sound interesting, so I’d also like to take part in testing.

What kind of additional information shall be provided in the PM (beside the info one wants to participate)? Rhasspy is running as debian package here, so I’d need access to the (amd64) deb package…

fastjack · May 11, 2022, 11:01am

@synesthesiam Awesome!

Small issue on the french voices for the word “de” though.

C'est un arc de cercle is read as C'est un arc dam cercle.

In the Rhasspy French Kaldi base_dictionary.txt, the word “de” has these pronunciations :

de d a m
de d e
de d ə
de d e ø

The first one (d a m) looks incorrect to me and probably is the cause of this issue.

Hope this helps.

kicker10bog · May 11, 2022, 2:14pm

I don’t know about the other languages, but some of the English ones are really good. My favorite was hifi-tts_low ID 2. But cmu-arctic_low 7, 9, and 11 were also great, in my opinion. I didn’t listen all of the last group. There were a ton of them.

synesthesiam · May 11, 2022, 2:39pm

Thanks everyone for testing and the feedback!

One thing that seems consistent is problems with gruut for non-English languages (French especially). Because Mimic 3 is licensed under APGLv3, I’m considering re-training some voices with eSpeak phonemes instead. @fastjack do you think eSpeak pronunciations are better for French than gruut?

No additional info, I’ll just send you a Google Drive link. The project is open source, so there’s nothing to hide. I’m just trying to keep the feedback at a level where I can respond and fix things quickly

Thank you! I will get this fixed right away.

Thanks! Let me know if you’d like to try it locally. It runs pretty well on a Pi 4 (64-bit), and very fast on a desktop/laptop.

fastjack · May 11, 2022, 2:55pm

I’ve never noticed any specific issue with IPA phonemes (using the Rhasspy Kaldi French profile).

What kind of problems did you encounter with Gruut for the french language?

synesthesiam · May 11, 2022, 2:58pm

The is mispronunciation of “de” is one of the problems, but I’m sure there are others. I was just curious if it would be worth it to use eSpeak as a phonemizer instead of gruut.

fastjack · May 11, 2022, 3:08pm

Listening to Mimic 3 french samples the only other issue I can hear is the missing liaison between the C'est (s e) and the un (œ̃'') in C'est un (s e / œ̃'') which should vocalize the t (s e t / œ̃'') when it is followed by a vowel.

kicker10bog · May 12, 2022, 3:06pm

Sure, I’ll try it on my Chromebook Tablet.

AndreKR · May 12, 2022, 6:01pm

Hm, what happened to the harvard-glow_tts voice? That’s the one I’m using and it’s a bit difficult to compare quality between different voices.

I found a couple of possible voices and they seem to have improved where Larynx had trouble, like “thirteen” coming out as “thirthy” and the name “Maria” coming out as “May”, but each of them is somewhat unique and they might have other issues. Also some of them seem to still have trouble with “thirteen” and “Maria”.

I think when the Docker image comes out I’ll do further tests with my actual Rhasspy. I tried putting the beta URL as Remote HTTP but it didn’t like that at all. It crashed on restart until I removed the URL from profile.json.

tjiho · May 12, 2022, 11:53pm

Nice work ! Well done !
I tested french voices.

I agree there are some issue on phoneme due to Gruut. A try with an other phonemizer could be a solution.

Also, the first voices, m-ailabs_low, are ending too fast. The sound of the last letter is cut.

Else, voices are really nice.

synesthesiam · May 13, 2022, 12:59am

I hadn’t trained that one just yet. I can put it on the list, though
Thanks for testing the other voices. I wonder what it is about “thirteen” and “Maria”.

Thanks! Dare I ask your favorite French voice ?

Yeah, I’m having a lot of trouble with those voices for some reason. I think they may just work better split out as single speaker models, maybe with eSpeak.

Light · May 13, 2022, 3:14am

This is certainly very interesting. Although I’m not a Russian native speaker, I did take a little listen to the first Russian sample. I’m just a student, and the monologue was beyond my level, but it sounded in the ballpark to me.

I am curious, is the eventual plan to provide this as one of the options in Rhasspy, or is it something that would be used externally?

schnopsi · May 13, 2022, 7:26am

“thorsten drunk” - best voice ever

schnopsi · May 13, 2022, 8:19am

Using Pi4 64bit mimic3-server.

m-ailabs_low:

On short sentences or just single words, most of the times there is a strange noise at the end of the voice output. Not sure, why this happens. If the text gets longer, everything is ok (most of the times).
There is a nice and realistic speaking-break after a comma. But no break after the end of a sentence. That makes the text hard to understand.

Should the API of mimic3-server already be functional?

sve · May 13, 2022, 8:44am

I tested German and English voices on a smartphone running Mobian (Pocophone F1). Impressive speed and quality!

tjiho · May 13, 2022, 10:17am

I think Siwis_low is the best one. But this voice is very close to voice use in public transport in Toulouse. So I don’t like it much. In fact it sounds like an actress reading a book. So good voice but not natural.

Else, I would say zeckout, I feel like I’m listening to an old science teacher.

Edit: I re-listen voices with a good headset, Siwis low is better than the other in terms of quality.

AndreKR · May 13, 2022, 10:37am

I don’t think there is anything special about those words, it’s just that they happen to be in my voice outputs and Larynx has issues with them. There are probably many more words with issues, I’m just not using them in any voice prompts.

synesthesiam · May 13, 2022, 2:41pm

I’d agree in terms of audio quality. Many of the voices I trained from the M-AILabs dataset don’t have great audio quality since they were recorded by volunteers for Librivox using whatever hardware they had.

Thank you! Was it difficult to get working on the phone at all?

This seems to be a general problem with the TTS model I’m using. If the dataset doesn’t contain the speaker saying single words or very short phrases, the model has a hard time producing them. For now, I think I’ll have to consider the M-AILabs voices as intended for reading long-form text only

I think I can at least fix the pausing issues after a period for now

Yes, if you’re running it locally you can check out http://localhost:59125/openapi/ to see what’s available. It should also be compatible with anything that’s meant to talk to MaryTTS. You just have to make sure your “MaryTTS voice” is something like “en_UK/apope_low”.