Issues with text to speech performance

Hello,
I’m having performance issues when generating audio samples using Larynx and MaryTTS. When I switch back to NanoTTS, performance is very good and the generation of the sample is almost instantaneous. But I would like to have better quality, so I tried both Larynx and MaryTTS.

When using the MaryTTS web UI, samples generate very quickly, with what feels like real time or close to real time performance. However, when going through Rhasspy, generating the same sample takes many seconds more.

Larynx is even slower. Generating the sample for “what time is it?” on low quality takes about 7 seconds, and I see CPU spiking throughout that period.

This is all on a Pi 4.

I have two questions:

  • is this expected behaviour for Larynx? I thought I had read somewhere that performance is close to realtime on a Pi4
  • does anyone have any idea how I might explain/address the performance difference I am seeing between the MaryTTS web ui and using MaryTTS through Rhasspy?

Any pointers would be much appreciated! Thanks!

Hi @bastiaanterhorst,

The latest version of Larynx in 2.5.10 should be faster (after the first sentence is spoken – it still has to load the models).

How are you running MaryTTS? Rhasspy uses the same API as the web ui, so something else must be going on.

Interesting, thanks for the reply. Good to know the first call will load the model.

I’m running MaryTTS through the docker image that you maintain. It works really well through the web UI, and a tad slow through Rhasspy.

Larynx I only tried through Rhasspy itself, but now I am considering installing the docker version as well to compare the performance. I am beginning to think the slowness I am seeing has nothing to do with the TTS system, but rather something around how TTS is called and then finally sent to my audio out.

I am using MQTT for audio playing. I have determined that the delay is not caused by the system that handles the MQTT events and plays the audio, but perhaps a delay is incurred between TTS and the sending of the audio data over MQTT?

I was really hoping Larynx would be fast enough on a Pi 4, since it sounds so much better than any of the other systems I’ve tried (except Wavenet, but that isn’t a fair comparison of course).

To make my question a bit more concrete, here’s what I’m seeing on a subsequent TTS request (so after the initial model load) using Larynx with the low quality vocoder and the cmu_jmk voice.

[DEBUG:2021-05-12 09:13:43,976] rhasspyserver_hermes: Handling TtsSayFinished (topic=hermes/tts/sayFinished, id=61f33d88-8a6b-44be-bedd-23a23ac99769)
[DEBUG:2021-05-12 09:13:43,976] rhasspyserver_hermes: Handling TtsSayFinished (topic=hermes/tts/sayFinished, id=8bcb35cb-1161-4c4e-8199-b56e92e1ba83)
[DEBUG:2021-05-12 09:13:41,853] rhasspyserver_hermes: Handling AudioPlayBytes (topic=hermes/audioServer/default/playBytes/9fe892c5-6f56-47e8-bddb-1d0cf9f0e442, id=61f33d88-8a6b-44be-bedd-23a23ac99769)
[DEBUG:2021-05-12 09:13:41,851] rhasspyserver_hermes: Handling AudioPlayBytes (topic=hermes/audioServer/default/playBytes/9fe892c5-6f56-47e8-bddb-1d0cf9f0e442, id=8bcb35cb-1161-4c4e-8199-b56e92e1ba83)
[DEBUG:2021-05-12 09:13:37,877] rhasspyserver_hermes: Publishing 152 bytes(s) to hermes/tts/say
[DEBUG:2021-05-12 09:13:37,877] rhasspyserver_hermes: -> TtsSay(text='What a wonderful day it is.', site_id='default', lang=None, id='9fe892c5-6f56-47e8-bddb-1d0cf9f0e442', session_id='', volume=1.0)
[DEBUG:2021-05-12 09:13:37,875] rhasspyserver_hermes: TTS timeout will be 30 second(s)

So it looks to me like generating “What a wonderful day it is.” takes 4 seconds. Which is about half of real-time, so it feels slow. But I’m curious what others are seeing in terms of performance here?

Another test, now comparing larynx through Rhasspy with Larynx through its web UI. I used a longer sentence this time to see if that als increases the difference.

Via Rhasspy: 24 seconds

[DEBUG:2021-05-12 10:03:37,763] rhasspyserver_hermes: Handling AudioPlayBytes (topic=hermes/audioServer/default/playBytes/bce170e6-380d-465e-9d50-20c080323c98, id=5d203bfb-bee5-4e11-962a-7321c5816769)
[DEBUG:2021-05-12 10:03:37,760] rhasspyserver_hermes: Handling AudioPlayBytes (topic=hermes/audioServer/default/playBytes/bce170e6-380d-465e-9d50-20c080323c98, id=8bcb35cb-1161-4c4e-8199-b56e92e1ba83)
[DEBUG:2021-05-12 10:03:13,883] rhasspyserver_hermes: Publishing 301 bytes(s) to hermes/tts/say
[DEBUG:2021-05-12 10:03:13,882] rhasspyserver_hermes: -> TtsSay(text='Analytic propositions are true or not true solely by virtue of their meaning, whereas synthetic propositions truth, if any, derives from how their meaning relates to the world.', site_id='default', lang=None, id='bce170e6-380d-465e-9d50-20c080323c98', session_id='', volume=1.0)

Via Larynx web ui: 20 seconds

I cannot explain the 4 second difference between Rhasspy and Larynx. But also, given that this sample plays in 9 seconds, the performance is less than half of real time.

It feels to me like something is off, but I am not sure how I might address this at all…

Any pointers @synesthesiam?

1 Like

Ah, I think I figured it out. My Rpi 4 has the 64-bit Raspberry Pi OS install. The library I use for neural network execution (onnx) is optimized for 64-bit; I had to compile the 32-bit version myself without optimizations. I hadn’t realized it would make such a difference even on Rpi 4 hardware.

So I need to update the docs to say that the advertised “faster than realtime” needs a 64-bit OS :frowning:

@synesthesiam thanks for your response! I switched to a 64 bit OS and that reduced the 24 seconds to 11 seconds. Really nice.

It would be good to also mention this in the getting started tutorial for Raspberry Pi, which explicitly mentions installing a 32 bit OS.

1 Like

Glad this worked, @bastiaanterhorst! I’ve made a note in the docs that I’ll push up soon :+1:

1 Like