Preview of New TTS Voices

Out of interest does Larynx benefit from running on Aarch64 like tensorflow and especially tensorflow-lite with the Pi4 we have available.

I was really impressed with the Pi400 running which only gives approx 1.3x realtime and needs streaming output.
Out of interest what does Larynx manage and also does it have a streaming output?

1 Like

Larynx uses Onnx, which seems to be pretty similar to Tensorflow Lite. There’s an official wheel for Aarch64, while I had to build my own for 32-bit ARM without optimizations enabled (it crashed with them). So there seems to be an advantage on 64-bit ARM.

I was also impressed with TensorFlowTTS, and I would have preferred to go with them instead of going the PyTorch to Onnx route. But then I read this odd license paragraph:

Overall, Almost models here are licensed under the Apache 2.0 for all countries in the world, except in Viet Nam this framework cannot be used for production in any way without permission from TensorFlowTTS’s Authors. There is an exception, Tacotron-2 can be used with any purpose. If you are Vietnamese and want to use this framework for production, you Must contact us in advance.

I really don’t know what to make of that, but I didn’t like it.

It does a kind of “streaming” output by working on sentences in parallel. So you don’t have to wait for an entire paragraph to finish synthesizing before getting audio output (on the command-line).

When I start really training Tacotron2 models, though, there’s a chance for real streaming output. Tacotron2 produces mel frames in chunks, which can be forwarded to a vocoder. When I tested this, though, I got scratching audio artifacts and decided to put it on hold for now. I really need to get back to KW detection :wink:

Yeah I could not work the Vietnam thing out, so not being Vietnamese I was wasn’t so bothered.
Not bothered which but was really curious on performance figures.

When you throwing around loads of tensors the wider the databus the more simultaneously you can handle.
Armv7 -> Aarch64 really is 2-3x perf improvement as with TF its all been optimized for 64bit as it is faster due to it being predominately math libs.

When I have been playing with that Google-KWS it gives accuracy results for TF vs TFL but also the speed increase of a TFL quantised model is also really big.

I0329 19:49:03.096672 139843685021504] tf test accuracy, stream model state external = 100.00% 200 out of 609
I0329 19:49:40.055655 139843685021504] tf test accuracy, stream model state external = 100.00% 400 out of 609
I0329 19:50:17.611867 139843685021504] tf test accuracy, stream model state external = 100.00% 600 out of 609
I0329 19:50:19.069626 139843685021504] TF Final test accuracy of stream model state external = 100.00% (N=609)

INFO: TfLiteFlexDelegate delegate: 2 nodes delegated out of 34 nodes with 1 partitions.

I0329 19:52:51.021229 139843685021504] tflite test accuracy, stream model state external = 100.000000 200 out of 609
I0329 19:52:55.191242 139843685021504] tflite test accuracy, stream model state external = 100.000000 400 out of 609
I0329 19:52:59.372943 139843685021504] tflite test accuracy, stream model state external = 100.000000 600 out of 609
I0329 19:52:59.534713 139843685021504] tflite Final test accuracy, stream model state external = 100.00% (N=609)

Its not all running on TFL as 2 nodes do delegate out to run TF but the speed increases are pretty huge.

ONNX Runtime mobile can execute all standard ONNX models but what that exactly means I don’t know as just scraping the tip with tensflow & tensorflow-lite and with flex-delegates all I have gathered is how confusing it is to delegate out and often how constraining the basic functions of the ‘lite’ runtimes can be.

I presume the benefits of 64bit would be the same and also running ONNX mobile as to the above prob has similar results.

If you have the process time of a Pi4 producing a approx 10 sec sentence it would be interesting to compare as model vs model / framework vs framework is so confusing and don’t think really there is any metric you can use.

Tensorflow tends to have faster optimised versions as its a static based lib vs dynamic libs like pytorch so its far less flexible and why pytorch garners so much research as no need to write and compile out due to its dynamic nature.
I think onnx training can be either as can be used with TF and pytorch and its down to how training and models have been implemented if static optimisation is implemeneted.

I guess with the Pi4 it doesn’t matter so much but a whole rake of different frameworks can eat quite a bit of memory as opposed to several uses of 1.

I would have a read of as that also suffers from a ‘crackling’ sound but seems if you overlap slightly and feed a queue it can be done.

I thought you where ‘too busy for kw’?

Just for comparison, in my Speech-to-Text project a tflite model which was quantized gives a prediction speedup of about 2x, compared to the non-quantized model, both running on a Raspi 4 using tflite runtime only.

But as you already mentioned, tflite functions are a bit constraining, so some layers need a bit of workaround to get rid of the flex-delegation.

@DANBER or @rolyan_trauts, do you know if TFLite can be installed without the full Tensorflow package?

I’ve struggled with this on the Onnx side too. PyTorch will happily convert a model to Onnx, only for you to find out that half the operations aren’t supported. I got incredibly lucky that the NVIDIA Tacotron2 code and the original GlowTTS models had fully-supported operations.

Everything else I’ve tried has failed, and I don’t understand the models well enough to swap things out.

Yes, I’m using the following for my container:
RUN pip3 install --no-cache-dir --extra-index-url tflite_runtime

But in this case you can only use fully supported operations, but if the net did work with onnx it might work with tflite too

1 Like

I don’t think it really matters as that is what delegation is for when I am running TF-Lite with the Google-KWS 2 nodes are delegated out.
Yeah you have to install full TF which is what I am using but in my code

import tensorflow as tf
# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="/home/pi/google-kws/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")

You got lucky as the fused LSTM RNN layers obviously work for Tacotron2 but sod say developing your own custom C++ GRU and compiling and distributing your own custom TF as that is why special Ops and delegation has been created.

I don’t really care that I am installing the full TF on a PI as I am using the TF.Lite methods not the full TF calls and with the correctly converted model it results in 200mb mem and a near x10 speed increase of the TF model that I don’t use for inference but the training end of test does use.

With tensorflow its RNNs that are the problem where it supports ‘fused LSTMs’ its part of the roadmap but common layers like a GRU are not supported but you can just delegate those out and often the higher order (nearer input) layers can run TF-Lite so when delegated layers are run the hard work of reducing parameters has already been done.

You don’t have to just have TFlite installed on your machine as its part of the full TF code its the other way round where the TFLite only binary is just the TFlite code minus the rest of TF.

When I ran a 20ms streaming KWS near 80% load of a Pi zero I am running the full TF binary but using the TFL call methods.
The zero just sucks as it has no Neon which TF & TFL are optimised for also the 2-3x speed up of 64bit is also not possible so you get this huge 80% load which in context of a Pi3A+ ends up just less than 20% of a single core so the TF load is no problem when you can provide 20x the TF operations but its essential to have a Neon and the 2-3x speed of Aarch64 shouldn’t really be ignored.

You got any perf figures on Larynx? Wav duration vs processing time? Pi 4?

Woooow, it’s really amazing !
French voices are really good. The first one get a typical accent :wink:
When you mentionned "I can train a new voice with about 1.5 hours of quality audio., do you intend that we need to record a dedicated text of a duration of 1.5 hours ?
If so, is it possible to cut it :upside_down_face:. If it is, I’m in :+1:

Thanks for your hardwork. My house is very exicted to speak flawlessly :nerd_face:

1 Like

That’s great, thank you :slight_smile: The sentences for French are available here. These were chosen to maximize phonetic coverage while keeping the number of sentences low. @tjiho read poems (about ~2700 sentences).

I have a website set up for contributions. PM me for a link (it’s in beta) if you’d like to do that. Or I have some local software.

Sure. This in on a Pi 4 with the highest quality setting:

DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Got mels in 1.214732290000029 second(s) (shape=(1, 80, 600))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Running denoiser (strength=0.005)
DEBUG:larynx:Got audio in 25.65960519300006 second(s) (shape=(153600,))
DEBUG:larynx:Real-time factor: 0.26 (audio=6.97 sec, infer=26.88 sec)
DEBUG:larynx:Synthesized 307244 byte(s) in 26.891741037368774 second(s)

and this is with the lowest quality setting:

DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Got mels in 1.1592592529996182 second(s) (shape=(1, 80, 600))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Initializing denoiser
DEBUG:hifi_gan:Running denoiser (strength=0.005)
DEBUG:larynx:Got audio in 3.2229672579996986 second(s) (shape=(153600,))
DEBUG:larynx:Real-time factor: 1.59 (audio=6.97 sec, infer=4.39 sec)
DEBUG:larynx:Synthesized 307244 byte(s) in 4.393189907073975 second(s)

You can see that the vocoder takes the bulk of the time, and that reducing the vocoder quality takes it from 1/4 real-time to about 1.5x.

This is with the default onnxruntime Python wheel and optimizations turn on.

1 Like

Is there a big difference between x4 realtime posted and x0.75 lowest vocoder quality in terms of listening?

I did some improvements during my record session, like adding key binding to button, so I didn’t use my mouse when I recorded.
When I’ll have some time I’ll commit it and open a pull request.


I can hear a difference with my headphones, but it’s not as noticeable over speakers. It also seems to vary by voice and by the strength of the denoiser.

1 Like

hello can you give me ling to vox-check for registered new women sexy french voice please

Hi @nicolas_Rodriguez, welcome. I’m not sure I understand what you’re asking. Do you want to contribute a new voice?

yes want yo contribute a new woman french


Larynx looks really cool! I’ve got a collection of WAV files and CSVs mapping each file to some text that I’m hoping to use to generate a custom voice. If I wanted to train a model for use with Larynx, any recommendations on where/how to get started?

How do you get the real-time factor under 1?

Best I can get on a Pi 4 is 1.9 (with harvard and vocoder low):

(audio=2.58 sec, infer=4.90 sec)
Full log

(Note that this is still the wrong real-time factor calculation, the actual real-time factor is 1.9.)

[DEBUG:2021-09-24 17:40:57,728] rhasspyserver_hermes: TTS timeout will be 30 second(s)
[DEBUG:2021-09-24 17:40:57,732] rhasspyserver_hermes: -> TtsSay(text='This is a test that is a bit longer 2', site_id='default', lang=None, id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='', volume=1.0)
[DEBUG:2021-09-24 17:40:57,732] rhasspyserver_hermes: Publishing 162 bytes(s) to hermes/tts/say
[DEBUG:2021-09-24 17:40:57,742] rhasspytts_larynx_hermes: <- TtsSay(text='This is a test that is a bit longer 2', site_id='default', lang=None, id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='', volume=1.0)
[DEBUG:2021-09-24 17:40:57,743] rhasspytts_larynx_hermes: Synthesizing 'This is a test that is a bit longer 2' (voice=harvard)
[DEBUG:2021-09-24 17:40:58,042] larynx: Words for 'This is a test that is a bit longer 2': ['this', 'is', 'a', 'test', 'that', 'is', 'a', 'bit', 'longer', 'two']
[DEBUG:2021-09-24 17:40:58,042] larynx: Phonemes for 'This is a test that is a bit longer 2': ['#', 'ð', 'ˈ', 'ɪ', 's', '#', 'ˈ', 'ɪ', 'z', '#', 'ə', '#', 't', 'ˈ', 'ɛ', 's', 't', '#', 'ð', 'ˈ', 'æ', 't', '#', 'ˈ', 'ɪ', 'z', '#', 'ə', '#', 'b', 'ˈ', 'ɪ', 't', '#', 'l', 'ˈ', 'ɔ', 'ŋ', 'ɡ', 'ɚ', '#', 't', 'ˈ', 'u', '#', '‖', '‖']
[DEBUG:2021-09-24 17:40:58,044] larynx: Running text to speech model (GlowTextToSpeech)
[DEBUG:2021-09-24 17:40:59,322] larynx: Got mels in 1.2776429069999722 second(s) (shape=(1, 80, 222))
[DEBUG:2021-09-24 17:40:59,330] larynx: Running vocoder model (HiFiGanVocoder)
[DEBUG:2021-09-24 17:41:02,652] hifi_gan: Running denoiser (strength=0.001)
[DEBUG:2021-09-24 17:41:02,949] larynx: Got audio in 3.6182490159990266 second(s) (shape=(56832,))
[DEBUG:2021-09-24 17:41:02,952] larynx: Real-time factor: 0.53 (audio=2.58 sec, infer=4.90 sec)
[DEBUG:2021-09-24 17:41:02,956] rhasspytts_larynx_hermes: Got 113708 byte(s) of WAV data
[DEBUG:2021-09-24 17:41:02,957] rhasspytts_larynx_hermes: -> AudioPlayBytes(113708 byte(s)) to hermes/audioServer/default/playBytes/2188d73f-5a17-4b5d-ac46-54fd32ca6429
[DEBUG:2021-09-24 17:41:02,962] rhasspytts_larynx_hermes: Waiting for play finished (timeout=2.8274149659863945)
[DEBUG:2021-09-24 17:41:02,970] rhasspyserver_hermes: Handling AudioPlayBytes (topic=hermes/audioServer/default/playBytes/2188d73f-5a17-4b5d-ac46-54fd32ca6429, id=4d629c6c-df45-4d42-baf4-2824e1173137)
[WARNING:2021-09-24 17:41:05,794] rhasspytts_larynx_hermes: Did not receive playFinished before timeout
[DEBUG:2021-09-24 17:41:05,796] rhasspytts_larynx_hermes: -> TtsSayFinished(site_id='default', id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='')
[DEBUG:2021-09-24 17:41:05,797] rhasspytts_larynx_hermes: Publishing 84 bytes(s) to hermes/tts/sayFinished
[DEBUG:2021-09-24 17:41:05,804] rhasspydialogue_hermes: <- TtsSayFinished(site_id='default', id='2188d73f-5a17-4b5d-ac46-54fd32ca6429', session_id='')
[DEBUG:2021-09-24 17:41:05,804] rhasspyserver_hermes: Handling TtsSayFinished (topic=hermes/tts/sayFinished, id=4d629c6c-df45-4d42-baf4-2824e1173137)

I noticed that in the README it days “Platform: aarch64”. If my Docker image is "Architecture": "arm", "Variant": "v7", that’s a different one, right? Maybe that’s the reason? Did you run it on Raspberry Pi OS or did you install a custom OS?

By the way, even with a real-time factor under 1 it’s not actually real-time because we need to wait for the full text to be synthesized, correct?

Yes, this needs a 64-bit Raspberry Pi OS (aarch64) to get a real-time factor under 1. Plus, as you mentioned, the vocoder quality set to low.

A more recent version of Larynx does get the true real-time factor under 1 for text with multiple sentences. As one sentence is being played, multiple sentences down the line are synthesized and queued.

Oh well, maybe I didn’t even have to buy a Pi 4 (had a Pi 3)? :grinning_face_with_smiling_eyes: Time for a reinstall.

Well in the context of Rhasspy (/api/text-to-speech) there is only one sentence I think? Or can I somehow split the request into sentences to synthesize them in parallel and still play them in the correct order?

For people finding this, there is also a 64-bit version of Raspberry Pi OS lite, which is better suited for (Rhasspy) servers:

1 Like