Preview of New TTS Voices

I’ve had my head down coding for the past few months on a new version of Larynx for text to speech. My main goal has been speed, since the last version (based on MozillaTTS) was painfully slow, even on a desktop machine.

I’m happy to say that the new version of Larynx is much faster. Even on a Pi 4, you can get faster than realtime speech using the lower quality settings!

I’m planning to release Rhasspy 2.5.10 with the new Larynx this upcoming week, but I wanted to give everyone a preview of the 35 voices (8 languages) that will be available:

Samples for all 35 voices:

  • English (en-us, 20 voices)
    • blizzard_fls (F, accent)
    • cmu_aew (M)
    • cmu_ahw (M)
    • cmu_aup (M, accent)
    • cmu_bdl (M)
    • cmu_clb (F)
    • cmu_eey (F)
    • cmu_fem (M)
    • cmu_jmk (M)
    • cmu_ksp (M, accent)
    • cmu_ljm (F)
    • cmu_lnh (F)
    • cmu_rms (M)
    • cmu_rxr (M)
    • cmu_slp (F, accent)
    • cmu_slt (F)
    • ek (F, accent)
    • harvard (F, accent)
    • kathleen (F)
    • ljspeech (F)
  • German (de-de, 1 voice)
    • thorsten (M)
  • French (fr-fr, 3 voices)
    • gilles_le_blanc (M)
    • siwis (F)
    • tom (M)
  • Spanish (es-es, 2 voices)
    • carlfm (M)
    • karen_savage (F)
  • Dutch (nl, 3 voices)
    • bart_de_leeuw (M)
    • flemishguy (M)
    • rdh (M)
  • Italian (it-it, 2 voices)
    • lisa (F)
    • riccardo_fasol (M)
  • Swedish (sv-se, 1 voice)
    • talesyntese (M)
  • Russian (ru-ru, 3 voices)
    • hajdurova (F)
    • nikolaev (M)
    • minaev (M)

If you hear a problem with any voice, or would like to donate your own, please let me know! In most cases, I can train a new voice with about 1.5 hours of quality audio.

These voices were possible thanks to:

  • Public audio datasets, some donated by Rhasspy users!
  • Feedback from Rhasspy users on language-specific sentences and pronunciations
  • My mini GPU “cluster”, with one GPU donated by a Rhasspy user :slight_smile:
12 Likes

Under which license these voices will be released ?

This is really great work! :star_struck:

I have been playing around with the new larynx on my laptop a bit and was astonished by the quality of the (mainly german) TTS quality.

Eager to include it in my smart home, I tried the docker container first on my laptop, which worked flawlessly. But starting it on my small (Intel Celeron) server failed, which I attributed to missing AVX support. As you mentioned in this issue there seems to be some problem with pytorch and AVX, which should be fixed with pytorch 1.7.0.

Is the dockerfile for rhasspy/larynx somewhere available, so that I could try to build a noavx version myself?

1 Like

On french voices, not mine but the first two, the sound \k\ is missing (they pronounce “allergique” \ \a.lɛʁ.ʒi\ instead of \a.lɛʁ.ʒik\ .

I agree with @oscaropen it sounds very good, (at least french voices). A friend tell me "Wow it’s your voice ! ".

1 Like

Each voice has a README with a link to the original dataset, so they will be under whatever license the author has set. This is usually either public domain (CC-0) or something very relaxed/open. I will update the Larynx README to point specifically to the licenses :+1:

I will upload a new set Docker images this week too. I haven’t built the Dockerfile for this version of Larynx just yet. As luck would have it, the switch away from MozillaTTS to the Onnx Runtime eliminated the AVX problem! So there will be no need for a “noavx” version.

Thanks! I’ve observed this kind of thing at the end of sentences with the Russian and Italian voices too. The input phonemes are correct, but the voice cuts off the last sound for some reason. Maybe we can figure this out over time after everything is released :slight_smile:

3 Likes

Sure we can figure it out after the release.

I tested it on my computer, I found an other bug with french, with larynx the word de is pronounces \dam\ . It should be pronounces \də\ .
Per example in the sentence Ils sont totalement isolées de la métropole

Out of interest does Larynx benefit from running on Aarch64 like tensorflow and especially tensorflow-lite with the Pi4 we have available.

I was really impressed with the Pi400 running https://github.com/TensorSpeech/TensorFlowTTS which only gives approx 1.3x realtime and needs streaming output.
Out of interest what does Larynx manage and also does it have a streaming output?

1 Like

Larynx uses Onnx, which seems to be pretty similar to Tensorflow Lite. There’s an official wheel for Aarch64, while I had to build my own for 32-bit ARM without optimizations enabled (it crashed with them). So there seems to be an advantage on 64-bit ARM.

I was also impressed with TensorFlowTTS, and I would have preferred to go with them instead of going the PyTorch to Onnx route. But then I read this odd license paragraph:

Overall, Almost models here are licensed under the Apache 2.0 for all countries in the world, except in Viet Nam this framework cannot be used for production in any way without permission from TensorFlowTTS’s Authors. There is an exception, Tacotron-2 can be used with any purpose. If you are Vietnamese and want to use this framework for production, you Must contact us in advance.

I really don’t know what to make of that, but I didn’t like it.

It does a kind of “streaming” output by working on sentences in parallel. So you don’t have to wait for an entire paragraph to finish synthesizing before getting audio output (on the command-line).

When I start really training Tacotron2 models, though, there’s a chance for real streaming output. Tacotron2 produces mel frames in chunks, which can be forwarded to a vocoder. When I tested this, though, I got scratching audio artifacts and decided to put it on hold for now. I really need to get back to KW detection :wink:

Yeah I could not work the Vietnam thing out, so not being Vietnamese I was wasn’t so bothered.
Not bothered which but was really curious on performance figures.

When you throwing around loads of tensors the wider the databus the more simultaneously you can handle.
Armv7 -> Aarch64 really is 2-3x perf improvement as with TF its all been optimized for 64bit as it is faster due to it being predominately math libs.

When I have been playing with that Google-KWS it gives accuracy results for TF vs TFL but also the speed increase of a TFL quantised model is also really big.

I0329 19:49:03.096672 139843685021504 test.py:495] tf test accuracy, stream model state external = 100.00% 200 out of 609
I0329 19:49:40.055655 139843685021504 test.py:495] tf test accuracy, stream model state external = 100.00% 400 out of 609
I0329 19:50:17.611867 139843685021504 test.py:495] tf test accuracy, stream model state external = 100.00% 600 out of 609
I0329 19:50:19.069626 139843685021504 test.py:500] TF Final test accuracy of stream model state external = 100.00% (N=609)

INFO: TfLiteFlexDelegate delegate: 2 nodes delegated out of 34 nodes with 1 partitions.

I0329 19:52:51.021229 139843685021504 test.py:619] tflite test accuracy, stream model state external = 100.000000 200 out of 609
I0329 19:52:55.191242 139843685021504 test.py:619] tflite test accuracy, stream model state external = 100.000000 400 out of 609
I0329 19:52:59.372943 139843685021504 test.py:619] tflite test accuracy, stream model state external = 100.000000 600 out of 609
I0329 19:52:59.534713 139843685021504 test.py:624] tflite Final test accuracy, stream model state external = 100.00% (N=609)

Its not all running on TFL as 2 nodes do delegate out to run TF but the speed increases are pretty huge.

ONNX Runtime mobile can execute all standard ONNX models but what that exactly means I don’t know as just scraping the tip with tensflow & tensorflow-lite and with flex-delegates all I have gathered is how confusing it is to delegate out and often how constraining the basic functions of the ‘lite’ runtimes can be.

I presume the benefits of 64bit would be the same and also running ONNX mobile as to the above prob has similar results.

If you have the process time of a Pi4 producing a approx 10 sec sentence it would be interesting to compare as model vs model / framework vs framework is so confusing and don’t think really there is any metric you can use.

Tensorflow tends to have faster optimised versions as its a static based lib vs dynamic libs like pytorch so its far less flexible and why pytorch garners so much research as no need to write and compile out due to its dynamic nature.
I think onnx training can be either as can be used with TF and pytorch and its down to how training and models have been implemented if static optimisation is implemeneted.

I guess with the Pi4 it doesn’t matter so much but a whole rake of different frameworks can eat quite a bit of memory as opposed to several uses of 1.

I would have a read of https://github.com/TensorSpeech/TensorFlowTTS/issues/522 as that also suffers from a ‘crackling’ sound but seems if you overlap slightly and feed a queue it can be done.

I thought you where ‘too busy for kw’?

Just for comparison, in my Speech-to-Text project a tflite model which was quantized gives a prediction speedup of about 2x, compared to the non-quantized model, both running on a Raspi 4 using tflite runtime only.

But as you already mentioned, tflite functions are a bit constraining, so some layers need a bit of workaround to get rid of the flex-delegation.

@DANBER or @rolyan_trauts, do you know if TFLite can be installed without the full Tensorflow package?

I’ve struggled with this on the Onnx side too. PyTorch will happily convert a model to Onnx, only for you to find out that half the operations aren’t supported. I got incredibly lucky that the NVIDIA Tacotron2 code and the original GlowTTS models had fully-supported operations.

Everything else I’ve tried has failed, and I don’t understand the models well enough to swap things out.

Yes, I’m using the following for my container:
RUN pip3 install --no-cache-dir --extra-index-url https://google-coral.github.io/py-repo/ tflite_runtime

But in this case you can only use fully supported operations, but if the net did work with onnx it might work with tflite too

1 Like

I don’t think it really matters as that is what delegation is for when I am running TF-Lite with the Google-KWS 2 nodes are delegated out.
Yeah you have to install full TF which is what I am using but in my code

import tensorflow as tf
# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="/home/pi/google-kws/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter1.allocate_tensors()

You got lucky as the fused LSTM RNN layers obviously work for Tacotron2 but sod say developing your own custom C++ GRU and compiling and distributing your own custom TF as that is why special Ops and delegation has been created.

I don’t really care that I am installing the full TF on a PI as I am using the TF.Lite methods not the full TF calls and with the correctly converted model it results in 200mb mem and a near x10 speed increase of the TF model that I don’t use for inference but the training end of test does use.

With tensorflow its RNNs that are the problem where it supports ‘fused LSTMs’ its part of the roadmap but common layers like a GRU are not supported but you can just delegate those out and often the higher order (nearer input) layers can run TF-Lite so when delegated layers are run the hard work of reducing parameters has already been done.

You don’t have to just have TFlite installed on your machine as its part of the full TF code its the other way round where the TFLite only binary is just the TFlite code minus the rest of TF.

When I ran a 20ms streaming KWS near 80% load of a Pi zero I am running the full TF binary but using the TFL call methods.
The zero just sucks as it has no Neon which TF & TFL are optimised for also the 2-3x speed up of 64bit is also not possible so you get this huge 80% load which in context of a Pi3A+ ends up just less than 20% of a single core so the TF load is no problem when you can provide 20x the TF operations but its essential to have a Neon and the 2-3x speed of Aarch64 shouldn’t really be ignored.

You got any perf figures on Larynx? Wav duration vs processing time? Pi 4?

Woooow, it’s really amazing !
French voices are really good. The first one get a typical accent :wink:
When you mentionned "I can train a new voice with about 1.5 hours of quality audio., do you intend that we need to record a dedicated text of a duration of 1.5 hours ?
If so, is it possible to cut it :upside_down_face:. If it is, I’m in :+1:

Thanks for your hardwork. My house is very exicted to speak flawlessly :nerd_face:

1 Like

That’s great, thank you :slight_smile: The sentences for French are available here. These were chosen to maximize phonetic coverage while keeping the number of sentences low. @tjiho read poems (about ~2700 sentences).

I have a website set up for contributions. PM me for a link (it’s in beta) if you’d like to do that. Or I have some local software.

Sure. This in on a Pi 4 with the highest quality setting:

DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Got mels in 1.214732290000029 second(s) (shape=(1, 80, 600))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Running denoiser (strength=0.005)
DEBUG:larynx:Got audio in 25.65960519300006 second(s) (shape=(153600,))
DEBUG:larynx:Real-time factor: 0.26 (audio=6.97 sec, infer=26.88 sec)
DEBUG:larynx:Synthesized 307244 byte(s) in 26.891741037368774 second(s)

and this is with the lowest quality setting:

DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Got mels in 1.1592592529996182 second(s) (shape=(1, 80, 600))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Initializing denoiser
DEBUG:hifi_gan:Running denoiser (strength=0.005)
DEBUG:larynx:Got audio in 3.2229672579996986 second(s) (shape=(153600,))
DEBUG:larynx:Real-time factor: 1.59 (audio=6.97 sec, infer=4.39 sec)
DEBUG:larynx:Synthesized 307244 byte(s) in 4.393189907073975 second(s)

You can see that the vocoder takes the bulk of the time, and that reducing the vocoder quality takes it from 1/4 real-time to about 1.5x.

This is with the default onnxruntime Python wheel and optimizations turn on.

Is there a big difference between x4 realtime posted and x0.75 lowest vocoder quality in terms of listening?

I did some improvements during my record session, like adding key binding to button, so I didn’t use my mouse when I recorded.
When I’ll have some time I’ll commit it and open a pull request.

2 Likes

I can hear a difference with my headphones, but it’s not as noticeable over speakers. It also seems to vary by voice and by the strength of the denoiser.

1 Like