This is not really related to Rhasspy - but I am interested in using this text-to-speech engine, as the quality is very good. My problem is related to the rendering speed, for example, this phrase:
curl -G --output - --data-urlencode 'text=Welcome to the world of speech synthesis!' 'http://192.168.0.61:5002/api/tts' | aplay
Takes about 10 seconds to render before playback on the synesthesiam mozilla-tts container on an AMD FX-6300 6core processor with 16GB ram.
I know its quite old hardware, but has anyone else used this tts service and obtained near instant (1-2 second delay or less) results on newer generation processors, or if there are any tweaks I can do to speed up the rendering on my hardware?
I’ve been working on adding a fork of Mozilla TTS to Rhasspy, called Larynx. Most of the models I’ve trained are smaller and faster than the LJSpeech one included in that Docker image.
Larynx is available in 2.5.8, but I don’t have an English voice trained just yet. When I do, you might want to give that a try and see if it works better on your system
Just tried 2.5.9 with larynx and the the en-us kathleen voice set as i wanted to get away from google wavenet and all other voices are to robotic for my girlfriedn
The processing still takes a lot of time on my server tho.
Im running rhasspy on a 4 Core / 8 Threads Xeon with 16GB of RAM and it still takes arround 8 seconds for your example text to “render”.
Any way to speed this up more? (rhasspy running in docker)
I’m working on this now
The kathleen voice is a Tacotron2 model, which is more CPU intensive than the other voices (they use GlowTTS). I’m working to pre-compile the models using PyTorch, so they can actually be run outside of Python entirely.
Additionally, I’m going to add multiple options for vocoders – the post-processing step that makes it turns the TTS model output into WAV audio. If I can get those pre-compiled as well, it should be quite a speed up
Thanks for the info, all this deeplearning tts stuff is new to me but great to hear there is room for improvement
Do you think it will reach the performance required for realtime usage or is it possible to speed this up with additional hardware? (gpu/cuda)
A GPU (with CUDA) definitely speeds things up, but accessing it through Docker is more painful.
I think there’s plenty of room for CPU performance improvement. Besides JIT compilation (“pre-compiled”), there’s also quantization – where fewer bits are used to hold/process the model’s weights. I’m new to all of this, so it will take some time to sort out what works well for Rhasspy’s use cases.
For me it definitly way faster on two year old desktop hardware (Intel i7-8565U, 8 hyperthreads)
time curl -G --output /dev/null --data-urlencode 'text=Welcome to the world of speech synthesis!' 'http://127.0.0.1:5002/api/tts'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 137k 100 137k 0 0 357k 0 --:--:-- --:--:-- --:--:-- 357k
noglob curl --proto-default https -G --output /dev/null --data-urlencode 0,00s user 0,00s system 1% cpu 0,392 total
I am using glowing-tts which is faster than the tactron model:
tts-server --model_name tts_models/en/ljspeech/glow-tts --vocoder_name vocoder_models/en/ljspeech/mulitband-melgan
I am using tts 0.9
I’m working on a new version of Larynx that uses versions of GlowTTS and Tacotron2 that can be exported to the Onnx Runtime (kind of like TFLite). Not only is this faster, but I think I’ve found a way do stream the audio so it doesn’t have to wait until the end of the sentence to start speaking…