When using other TTS system other than Espeak, the play back is super fast

bwong · July 26, 2020, 11:58pm

Hello. A total n00b here. When I use espeak as my tts, the play back is fine. When I choose something else ie PicoTTS the play back is insanely fast and I cannot make out any words.

I installed the pico2wave utils and such on the rpi and the generate a wav file via the command line and transferred that wav file to my personal laptop with VLC and it plays correctly, but when I try to use the TTS within Rhasspy web UI it is super fast, so fast that you can’t hear any words just high pitched gibberish as if someone was fast forwarding the audio.

How can I debug this/fix this?

Thank you!

DeadEnded · July 31, 2020, 1:02pm

I think I’m looking for the same solution - using NanoTTS on 2.5.4 and was hoping to find a setting to adjust the playback speed. NanoTTS has it in the CLI but I don’t think that I can set it from the Rhasspy GUI.

If this isn’t something available yet, I might submit a feature request to have it added. It might be difficult though if each TTS piece has a different way of doing it… maybe just a field where you can manually enter the CLI commands…

Cheers!
DeadEnd

fastjack · July 31, 2020, 1:08pm

This looks like an ALSA sample rate issue. Your WAV file is 16KHz but is probably played on a 44.1KHz or 48KHz output. Try to prefer plughw: instead of hw: audio output devices.

bwong · August 1, 2020, 3:05am

mmhmm but then how does espeak work fine but nothing else works…is there away to set the play back rate correctly?

I set the default device as plughw, if i just use hw. no audio is played at all.

I guess the question is how did people set up TTS with things other than eSpeak.

edit: I am using a Jabra SPEAK 510 USB if that is of any relevance.

bwong · August 15, 2020, 4:12am

Did you make any progress on solving this on your end? I can’t seem to figure out the recording mechanism that would cause it to be recorded at the wrong frequency. I would think, if its an audio system configuration problem, I would expect it to be wrong for eSpeak as well. But eSpeak is the only one that works.

DeadEnded · August 15, 2020, 4:20am

No I haven’t had time to dig into it.
Playback for me isn’t as bad as yours… just a little fast, but normal I think for the program.
I just wanted to slow it down a touch (<10%) to make it a little more understandable. Its not like fast forwarding like you’re having.

How do you have Rhasspy installed? Mine is the docker container on a server - not using the PI… that could be a difference… also what speaker are you using?

bwong · August 15, 2020, 6:29am

Mmm. I was doing a test using arecord on default and it would record at 16Khz and the playback using aplay would be ‘fastforwarded’ the only way to remedy that is changing the record --rate 48000 and then the play back would be correct. What I am failing to understand is what rate is the TTS systems using by default, why is espeak the only one seemingly outputting the correct WAV file.

I have Rhasspy installed in a docker container on a RPI3. I am using a USB Jabra 510 speaker/microphone combo.

I installed the Pico2wave system and generate a wav file which plays correctly using aplay , but on the Rhasspy playback ( call to TTS) it is as if its super fast-forwarded and unintelligible.

I’ve been searching in vain on this forum for a solution. I wonder if its my hardware (mic/speaker) issue, but that seems unlikely and like someone says its a most likely a configuration issue, but I don’t know where to start with the debugging that or at least finding out more information to lead me to a potential solution.

DeadEnded · August 15, 2020, 5:12pm

Just some ideas that are going through my mind:

Are you doing this inside the Rhasspy container or on the host?
If in the container, are you recording and playing using CLI for both, or CLI for one, and Rhasspy for the other (I doubt this)?

I just googled the aplay/arecord settings and it seems that somehow your defaults are not the same:

-r, --rate=#<Hz>
Sampling rate in Hertz. The default rate is 8000 Hertz.

So it should be defaulting to 8000… not sure why your system would be different, but I am no expert on the matter. I did see something about setting -f and that could change the rate… so if you are setting the format, it could effected it.

Just some thoughts…

DeadEnd

bwong · August 16, 2020, 8:24am

I was running the aplay and arecord on the host itself. I haven’t tried running the commands via CLI on Rhasspy container.

I haven’t looked/dived into to the code to understand how the hermes-tts portion plays back the WAV that is generated by whatever TTS system the user has chosen within the Rhasspy UI. Does it use arecord if it does then figuring out how that fits into the system may help me figure out the fast-forwardness of the playback. When testing the speak function within Rhasspy UI.

Thanks for the thoughts. I’ll continue to see if there’s any tidbits on setup that I may have missed somewhere along the way. I thought it should be fairly straight forwards plug and play mic/speaker. Like I said, it works for eSpeak, and I was surprised it didn’t just automatically worked when I switched to a different TTS system.

bwong · April 21, 2021, 6:32am

A bit late, but the fix for me was to reclone and run a clean docker install, and then everything just worked out of the box. Not sure what happened with the initial install/run.