Today, Mycroft is opening up a beta test for Mimic 3 TTS. This is a spiritual successor to Larynx, with better performance and more supported languages than Larynx. It runs about 2x faster than real-time on a Pi 4 (64-bit OS), and I’m hoping this will replace Google TTS for a lot of people
I’m looking for feedback, specifically on the non-English voices (there are a total of 25 languages). You can sign up for the beta using the link above, or just listen to some voice samples. PM me if you’d like to get access to the code or Debian packages to try out locally. There will also be a Docker image available when it officially launches (probably next month).
If you’d like to learn more, the documentation is already up. And here is the list of currently supported languages:
Hi there, also the German examples sound interesting, so I’d also like to take part in testing.
What kind of additional information shall be provided in the PM (beside the info one wants to participate)? Rhasspy is running as debian package here, so I’d need access to the (amd64) deb package…
I don’t know about the other languages, but some of the English ones are really good. My favorite was hifi-tts_low ID 2. But cmu-arctic_low 7, 9, and 11 were also great, in my opinion. I didn’t listen all of the last group. There were a ton of them.
One thing that seems consistent is problems with gruut for non-English languages (French especially). Because Mimic 3 is licensed under APGLv3, I’m considering re-training some voices with eSpeak phonemes instead. @fastjack do you think eSpeak pronunciations are better for French than gruut?
No additional info, I’ll just send you a Google Drive link. The project is open source, so there’s nothing to hide. I’m just trying to keep the feedback at a level where I can respond and fix things quickly
Thank you! I will get this fixed right away.
Thanks! Let me know if you’d like to try it locally. It runs pretty well on a Pi 4 (64-bit), and very fast on a desktop/laptop.
The is mispronunciation of “de” is one of the problems, but I’m sure there are others. I was just curious if it would be worth it to use eSpeak as a phonemizer instead of gruut.
Listening to Mimic 3 french samples the only other issue I can hear is the missing liaison between the C'est (s e) and the un (œ̃'') in C'est un (s e / œ̃'') which should vocalize the t (s e t / œ̃'') when it is followed by a vowel.
Hm, what happened to the harvard-glow_tts voice? That’s the one I’m using and it’s a bit difficult to compare quality between different voices.
I found a couple of possible voices and they seem to have improved where Larynx had trouble, like “thirteen” coming out as “thirthy” and the name “Maria” coming out as “May”, but each of them is somewhat unique and they might have other issues. Also some of them seem to still have trouble with “thirteen” and “Maria”.
I think when the Docker image comes out I’ll do further tests with my actual Rhasspy. I tried putting the beta URL as Remote HTTP but it didn’t like that at all. It crashed on restart until I removed the URL from profile.json.
I hadn’t trained that one just yet. I can put it on the list, though
Thanks for testing the other voices. I wonder what it is about “thirteen” and “Maria”.
Thanks! Dare I ask your favorite French voice ?
Yeah, I’m having a lot of trouble with those voices for some reason. I think they may just work better split out as single speaker models, maybe with eSpeak.
This is certainly very interesting. Although I’m not a Russian native speaker, I did take a little listen to the first Russian sample. I’m just a student, and the monologue was beyond my level, but it sounded in the ballpark to me.
I am curious, is the eventual plan to provide this as one of the options in Rhasspy, or is it something that would be used externally?
On short sentences or just single words, most of the times there is a strange noise at the end of the voice output. Not sure, why this happens. If the text gets longer, everything is ok (most of the times).
There is a nice and realistic speaking-break after a comma. But no break after the end of a sentence. That makes the text hard to understand.
Should the API of mimic3-server already be functional?
I think Siwis_low is the best one. But this voice is very close to voice use in public transport in Toulouse. So I don’t like it much. In fact it sounds like an actress reading a book. So good voice but not natural.
Else, I would say zeckout, I feel like I’m listening to an old science teacher.
Edit: I re-listen voices with a good headset, Siwis low is better than the other in terms of quality.
I don’t think there is anything special about those words, it’s just that they happen to be in my voice outputs and Larynx has issues with them. There are probably many more words with issues, I’m just not using them in any voice prompts.
I’d agree in terms of audio quality. Many of the voices I trained from the M-AILabs dataset don’t have great audio quality since they were recorded by volunteers for Librivox using whatever hardware they had.
Thank you! Was it difficult to get working on the phone at all?
This seems to be a general problem with the TTS model I’m using. If the dataset doesn’t contain the speaker saying single words or very short phrases, the model has a hard time producing them. For now, I think I’ll have to consider the M-AILabs voices as intended for reading long-form text only
I think I can at least fix the pausing issues after a period for now
Yes, if you’re running it locally you can check out http://localhost:59125/openapi/ to see what’s available. It should also be compatible with anything that’s meant to talk to MaryTTS. You just have to make sure your “MaryTTS voice” is something like “en_UK/apope_low”.