Preview of New TTS Voices

Hi @nicolas_Rodriguez, welcome. I’m not sure I understand what you’re asking. Do you want to contribute a new voice?

yes want yo contribute a new woman french

3 Likes

Larynx looks really cool! I’ve got a collection of WAV files and CSVs mapping each file to some text that I’m hoping to use to generate a custom voice. If I wanted to train a model for use with Larynx, any recommendations on where/how to get started?

How do you get the real-time factor under 1?

Best I can get on a Pi 4 is 1.9 (with harvard and vocoder low):

(audio=2.58 sec, infer=4.90 sec)
Full log

(Note that this is still the wrong real-time factor calculation, the actual real-time factor is 1.9.)

[DEBUG:2021-09-24 17:40:57,728] rhasspyserver_hermes: TTS timeout will be 30 second(s)
[DEBUG:2021-09-24 17:40:57,732] rhasspyserver_hermes: -> TtsSay(text='This is a test that is a bit longer 2', site_id='default', lang=None, id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='', volume=1.0)
[DEBUG:2021-09-24 17:40:57,732] rhasspyserver_hermes: Publishing 162 bytes(s) to hermes/tts/say
[DEBUG:2021-09-24 17:40:57,742] rhasspytts_larynx_hermes: <- TtsSay(text='This is a test that is a bit longer 2', site_id='default', lang=None, id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='', volume=1.0)
[DEBUG:2021-09-24 17:40:57,743] rhasspytts_larynx_hermes: Synthesizing 'This is a test that is a bit longer 2' (voice=harvard)
[DEBUG:2021-09-24 17:40:58,042] larynx: Words for 'This is a test that is a bit longer 2': ['this', 'is', 'a', 'test', 'that', 'is', 'a', 'bit', 'longer', 'two']
[DEBUG:2021-09-24 17:40:58,042] larynx: Phonemes for 'This is a test that is a bit longer 2': ['#', 'ð', 'ˈ', 'ɪ', 's', '#', 'ˈ', 'ɪ', 'z', '#', 'ə', '#', 't', 'ˈ', 'ɛ', 's', 't', '#', 'ð', 'ˈ', 'æ', 't', '#', 'ˈ', 'ɪ', 'z', '#', 'ə', '#', 'b', 'ˈ', 'ɪ', 't', '#', 'l', 'ˈ', 'ɔ', 'ŋ', 'ɡ', 'ɚ', '#', 't', 'ˈ', 'u', '#', '‖', '‖']
[DEBUG:2021-09-24 17:40:58,044] larynx: Running text to speech model (GlowTextToSpeech)
[DEBUG:2021-09-24 17:40:59,322] larynx: Got mels in 1.2776429069999722 second(s) (shape=(1, 80, 222))
[DEBUG:2021-09-24 17:40:59,330] larynx: Running vocoder model (HiFiGanVocoder)
[DEBUG:2021-09-24 17:41:02,652] hifi_gan: Running denoiser (strength=0.001)
[DEBUG:2021-09-24 17:41:02,949] larynx: Got audio in 3.6182490159990266 second(s) (shape=(56832,))
[DEBUG:2021-09-24 17:41:02,952] larynx: Real-time factor: 0.53 (audio=2.58 sec, infer=4.90 sec)
[DEBUG:2021-09-24 17:41:02,956] rhasspytts_larynx_hermes: Got 113708 byte(s) of WAV data
[DEBUG:2021-09-24 17:41:02,957] rhasspytts_larynx_hermes: -> AudioPlayBytes(113708 byte(s)) to hermes/audioServer/default/playBytes/2188d73f-5a17-4b5d-ac46-54fd32ca6429
[DEBUG:2021-09-24 17:41:02,962] rhasspytts_larynx_hermes: Waiting for play finished (timeout=2.8274149659863945)
[DEBUG:2021-09-24 17:41:02,970] rhasspyserver_hermes: Handling AudioPlayBytes (topic=hermes/audioServer/default/playBytes/2188d73f-5a17-4b5d-ac46-54fd32ca6429, id=4d629c6c-df45-4d42-baf4-2824e1173137)
[WARNING:2021-09-24 17:41:05,794] rhasspytts_larynx_hermes: Did not receive playFinished before timeout
[DEBUG:2021-09-24 17:41:05,796] rhasspytts_larynx_hermes: -> TtsSayFinished(site_id='default', id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='')
[DEBUG:2021-09-24 17:41:05,797] rhasspytts_larynx_hermes: Publishing 84 bytes(s) to hermes/tts/sayFinished
[DEBUG:2021-09-24 17:41:05,804] rhasspydialogue_hermes: <- TtsSayFinished(site_id='default', id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='')
[DEBUG:2021-09-24 17:41:05,804] rhasspyserver_hermes: Handling TtsSayFinished (topic=hermes/tts/sayFinished, id=4d629c6c-df45-4d42-baf4-2824e1173137)

I noticed that in the README it days “Platform: aarch64”. If my Docker image is "Architecture": "arm", "Variant": "v7", that’s a different one, right? Maybe that’s the reason? Did you run it on Raspberry Pi OS or did you install a custom OS?

By the way, even with a real-time factor under 1 it’s not actually real-time because we need to wait for the full text to be synthesized, correct?

Yes, this needs a 64-bit Raspberry Pi OS (aarch64) to get a real-time factor under 1. Plus, as you mentioned, the vocoder quality set to low.

A more recent version of Larynx does get the true real-time factor under 1 for text with multiple sentences. As one sentence is being played, multiple sentences down the line are synthesized and queued.

Oh well, maybe I didn’t even have to buy a Pi 4 (had a Pi 3)? :grinning_face_with_smiling_eyes: Time for a reinstall.

Well in the context of Rhasspy (/api/text-to-speech) there is only one sentence I think? Or can I somehow split the request into sentences to synthesize them in parallel and still play them in the correct order?

For people finding this, there is also a 64-bit version of Raspberry Pi OS lite, which is better suited for (Rhasspy) servers: https://downloads.raspberrypi.org/raspios_lite_arm64/images/

1 Like