Mimic 3 TTS Preview

Hi everyone,

Today, Mycroft is opening up a beta test for Mimic 3 TTS. This is a spiritual successor to Larynx, with better performance and more supported languages than Larynx. It runs about 2x faster than real-time on a Pi 4 (64-bit OS), and I’m hoping this will replace Google TTS for a lot of people :slight_smile:

I’m looking for feedback, specifically on the non-English voices (there are a total of 25 languages). You can sign up for the beta using the link above, or just listen to some voice samples. PM me if you’d like to get access to the code or Debian packages to try out locally. There will also be a Docker image available when it officially launches (probably next month).

If you’d like to learn more, the documentation is already up. And here is the list of currently supported languages:

  • Afrikaans
  • Bengali
  • Dutch
  • English
  • Farsi
  • Finnish
  • French
  • German
  • Greek
  • Gujarati
  • Hausa
  • Hungarian
  • Italian
  • Javanese
  • Kiswahili
  • Korean
  • Nepali
  • Polish
  • Russian
  • Setswana
  • Spanish
  • Telugu
  • Ukrainian
  • Vietnamese
  • Yoruba

Thanks,
Mike

5 Likes

Good work, the Dutch voices are pretty good. I’ll check on the documentation!

1 Like

Hi there, also the German examples sound interesting, so I’d also like to take part in testing.

What kind of additional information shall be provided in the PM (beside the info one wants to participate)? Rhasspy is running as debian package here, so I’d need access to the (amd64) deb package…

1 Like

@synesthesiam Awesome! :+1:

Small issue on the french voices for the word “de” though.

C'est un arc de cercle is read as C'est un arc dam cercle.

In the Rhasspy French Kaldi base_dictionary.txt, the word “de” has these pronunciations :

de d a m
de d e
de d ə
de d e ø

The first one (d a m) looks incorrect to me and probably is the cause of this issue.

Hope this helps.

3 Likes

I don’t know about the other languages, but some of the English ones are really good. My favorite was hifi-tts_low ID 2. But cmu-arctic_low 7, 9, and 11 were also great, in my opinion. I didn’t listen all of the last group. There were a ton of them.

1 Like

Thanks everyone for testing and the feedback!

One thing that seems consistent is problems with gruut for non-English languages (French especially). Because Mimic 3 is licensed under APGLv3, I’m considering re-training some voices with eSpeak phonemes instead. @fastjack do you think eSpeak pronunciations are better for French than gruut?

No additional info, I’ll just send you a Google Drive link. The project is open source, so there’s nothing to hide. I’m just trying to keep the feedback at a level where I can respond and fix things quickly :slight_smile:

Thank you! I will get this fixed right away.

Thanks! Let me know if you’d like to try it locally. It runs pretty well on a Pi 4 (64-bit), and very fast on a desktop/laptop.

1 Like

I’ve never noticed any specific issue with IPA phonemes (using the Rhasspy Kaldi French profile).

What kind of problems did you encounter with Gruut for the french language?

The is mispronunciation of “de” is one of the problems, but I’m sure there are others. I was just curious if it would be worth it to use eSpeak as a phonemizer instead of gruut.

1 Like

Listening to Mimic 3 french samples the only other issue I can hear is the missing liaison between the C'est (s e) and the un (œ̃'') in C'est un (s e / œ̃'') which should vocalize the t (s e t / œ̃'') when it is followed by a vowel.

2 Likes

Sure, I’ll try it on my Chromebook Tablet.

1 Like

Hm, what happened to the harvard-glow_tts voice? That’s the one I’m using and it’s a bit difficult to compare quality between different voices.

I found a couple of possible voices and they seem to have improved where Larynx had trouble, like “thirteen” coming out as “thirthy” and the name “Maria” coming out as “May”, but each of them is somewhat unique and they might have other issues. Also some of them seem to still have trouble with “thirteen” and “Maria”.

I think when the Docker image comes out I’ll do further tests with my actual Rhasspy. I tried putting the beta URL as Remote HTTP but it didn’t like that at all. It crashed on restart until I removed the URL from profile.json.

1 Like

Nice work ! Well done !
I tested french voices.

I agree there are some issue on phoneme due to Gruut. A try with an other phonemizer could be a solution.

Also, the first voices, m-ailabs_low, are ending too fast. The sound of the last letter is cut.

Else, voices are really nice.

1 Like

I hadn’t trained that one just yet. I can put it on the list, though :+1:
Thanks for testing the other voices. I wonder what it is about “thirteen” and “Maria”.

Thanks! Dare I ask your favorite French voice :wink: ?

Yeah, I’m having a lot of trouble with those voices for some reason. I think they may just work better split out as single speaker models, maybe with eSpeak.

This is certainly very interesting. Although I’m not a Russian native speaker, I did take a little listen to the first Russian sample. I’m just a student, and the monologue was beyond my level, but it sounded in the ballpark to me.

I am curious, is the eventual plan to provide this as one of the options in Rhasspy, or is it something that would be used externally?

“thorsten drunk” - best voice ever :laughing: :heart:

1 Like

Using Pi4 64bit mimic3-server.

m-ailabs_low:

  • On short sentences or just single words, most of the times there is a strange noise at the end of the voice output. Not sure, why this happens. If the text gets longer, everything is ok (most of the times).
  • There is a nice and realistic speaking-break after a comma. But no break after the end of a sentence. That makes the text hard to understand.

Should the API of mimic3-server already be functional?

I tested German and English voices on a smartphone running Mobian (Pocophone F1). Impressive speed and quality!

I think Siwis_low is the best one. But this voice is very close to voice use in public transport in Toulouse. So I don’t like it much. In fact it sounds like an actress reading a book. So good voice but not natural.

Else, I would say zeckout, I feel like I’m listening to an old science teacher.

Edit: I re-listen voices with a good headset, Siwis low is better than the other in terms of quality.

I don’t think there is anything special about those words, it’s just that they happen to be in my voice outputs and Larynx has issues with them. There are probably many more words with issues, I’m just not using them in any voice prompts.

1 Like

I’d agree in terms of audio quality. Many of the voices I trained from the M-AILabs dataset don’t have great audio quality since they were recorded by volunteers for Librivox using whatever hardware they had.

Thank you! Was it difficult to get working on the phone at all?

This seems to be a general problem with the TTS model I’m using. If the dataset doesn’t contain the speaker saying single words or very short phrases, the model has a hard time producing them. For now, I think I’ll have to consider the M-AILabs voices as intended for reading long-form text only :confused:

I think I can at least fix the pausing issues after a period for now :+1:

Yes, if you’re running it locally you can check out http://localhost:59125/openapi/ to see what’s available. It should also be compatible with anything that’s meant to talk to MaryTTS. You just have to make sure your “MaryTTS voice” is something like “en_UK/apope_low”.