Custom Text to speech

Hi there,

I am trying to adopt custom TTS script written for snips: jarvis_says
And I get this error: [DEBUG:2020-06-23 14:08:01,062] rhasspytts_cli_hermes: Got 554 byte(s) of WAV data [DEBUG:2020-06-23 14:08:01,063] rhasspytts_cli_hermes: -> AudioPlayBytes(554 byte(s)) [ERROR:2020-06-23 14:08:01,064] rhasspytts_cli_hermes: handle_say Traceback (most recent call last): File "rhasspy-tts-cli-hermes/rhasspytts_cli_hermes/__init__.py", line 159, in handle_say File "rhasspy-tts-cli-hermes/rhasspytts_cli_hermes/utils.py", line 9, in get_wav_duration File "wave.py", line 510, in open File "wave.py", line 164, in __init__ File "wave.py", line 131, in initfp wave.Error: file does not start with RIFF id [DEBUG:2020-06-23 14:08:01,069] rhasspytts_cli_hermes: -> TtsError(error='file does not start with RIFF id', site_id='default', context=None, session_id=None)

I found workaround by just adding aplay at the end of script to play output. But would like to know what could be an issue. Script generates wav files as standard output using mpg123 -w.

The custom TTS command should output WAV data on standard out. The error message says it got 554 bytes, which seems low for WAV data.

Looking at the script, it prints out some information and ultimately writes the WAV to a separate file. An easy fix would be to (1) change the echo lines to print to standard error instead:

echo "..." >&2

and (2) output the WAV file to standard out at the end:

cat "${outfile}"

Thank you. But still got the error:

[DEBUG:2020-06-25 06:31:32,386] rhasspytts_cli_hermes: Got 69718 byte(s) of WAV data
[DEBUG:2020-06-25 06:31:32,388] rhasspytts_cli_hermes: -> AudioPlayBytes(69718 byte(s))
[ERROR:2020-06-25 06:31:32,391] rhasspytts_cli_hermes: handle_say
Traceback (most recent call last):
File “rhasspy-tts-cli-hermes/rhasspytts_cli_hermes/init.py”, line 159, in handle_say
File “rhasspy-tts-cli-hermes/rhasspytts_cli_hermes/utils.py”, line 9, in get_wav_duration
File “wave.py”, line 510, in open
File “wave.py”, line 164, in init
File “wave.py”, line 131, in initfp
wave.Error: file does not start with RIFF id
[DEBUG:2020-06-25 06:31:32,396] rhasspytts_cli_hermes: -> TtsError(error=‘file does not start with RIFF id’, site_id=‘default’, context=None, session_id=None)

Can you save the out file somewhere and than do file your.wav from the command line?
The error says there is something wrong with the headers of the wav or they are missing.
The file command should give you an output like:

your.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz 

if the headers are correct.
If you have sox installed you could also do play -V3 your.wav for some more detailed output like this:

play -V3 your.wav
play WARN alsa: can't encode 0-bit Unknown or not applicable
play:      SoX v14.4.2
play INFO formats: detected file format type `wav'

Input File     : 'your.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:02.00 = 32000 samples ~ 150 CDDA sectors
File Size      : 64.0k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no


Output File    : 'default' (alsa)
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:02.00 = 32000 samples ~ 150 CDDA sectors
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no

play INFO sox: effects chain: input        16000Hz  1 channels
play INFO sox: effects chain: output       16000Hz  1 channels
In:100%  00:00:02.00 [00:00:00.00] Out:32.0k [  -===|===-  ]        Clip:0    
Done.
play -V3 BbELUSn33n4YTQHSwMME.wav 
play WARN alsa: can't encode 0-bit Unknown or not applicable
play:      SoX v14.4.2
play INFO formats: detected file format type `wav'

Input File     : 'BbELUSn33n4YTQHSwMME.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:01.41 = 31104 samples ~ 105.796 CDDA sectors
File Size      : 62.3k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no


Output File    : 'default' (alsa)
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:01.41 = 31104 samples ~ 105.796 CDDA sectors
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no

play INFO sox: effects chain: input        22050Hz  1 channels
play INFO sox: effects chain: output       22050Hz  1 channels
In:100%  00:00:01.41 [00:00:00.00] Out:31.1k [  ====|====  ]        Clip:0    
Done.
pi@rhasspy:~/rhasspy/profiles/en/poly $ play -V3 BbELUSn33n4YTQHSwMME.wav 
play WARN alsa: can't encode 0-bit Unknown or not applicable
play:      SoX v14.4.2
play INFO formats: detected file format type `wav'

Input File     : 'BbELUSn33n4YTQHSwMME.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:01.41 = 31104 samples ~ 105.796 CDDA sectors
File Size      : 62.3k
Bit Rate       : 353k    
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no


Output File    : 'default' (alsa)
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:01.41 = 31104 samples ~ 105.796 CDDA sectors
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no

play INFO sox: effects chain: input        22050Hz  1 channels
play INFO sox: effects chain: output       22050Hz  1 channels
In:100%  00:00:01.41 [00:00:00.00] Out:31.1k [  ====|====  ]        Clip:0    
Done.

May be the Sample Rate issue ?

instead of the cat you could use sox your.wav -L -e signed-integer -c 1 -r 16000 -b 16 -t wav - which should output the wav to standard out in the right format although im not sure if you are not going to get a header length warning that way but that should work.

With sox I get missing file name error.

sox BbELUSn33n4YTQHSwMME.wav -L -e signed-integer -c 1 -r 16000 -b 16 -t wav
sox:      SoX v14.4.2
sox FAIL sox: missing filename

Let me research. Thank you

you are missing a - right at the end of the sox command that is what tells it to output to standard out instead of a file

rhasspy:/home/pi/rhasspy/profiles/en/poly# python

Python 2.7.16 (default, Oct 10 2019, 22:02:15)

import wave
wave.open(’/home/pi/rhasspy/profiles/en/poly/oLrrJmfetpSrcDKTcDNw.wav’)
<wave.Wave_read instance at 0x76712850>

So I can import file to Pytnon, which tells me that audio file is ok.

Are there any other commands in your script that are printing stuff to standard out? This would explain the behavior, since your WAV file is fine but Rhasspy is getting back text + WAV data from the script.

I found the problem. Had to output “awscli” part to standard out >&2

1 Like

Glad you got it solved! :slight_smile: What should we add to the documentation to help others?

Here is full script and how to: ( I didn’t develop this script, I just adapted it to use with Rhasspy)

If running Rhasspy in docker environment then copy script to profile folder and the adjust path.

#!/bin/bash
# Shell script to replace TTS in Rhasspy with AWS polly
#
# Install and configure aws cli as per https://docs.aws.amazon.com/polly/latest/dg/getting-started-cli.html
# Installed in /home/<user>/.local/bin, configure with aws configure and provide key, secret, etc.
#
# Under Rhasspy Web UI change TTS config to Local Command and provide path to the script f.e: /home/<user>/..rhasspy/profiles/en/rhasspy_says.sh
# make script executable (#chmod a+x /home/<user>/..rhasspy/profiles/en/rhasspy_says.sh)
# install mpg123 (# apt-get install mpg123) for the mp3->wav conversion
#
# Update following:
# 1. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
# 2. Path to cache folder: f.e: /home/<user>/rhasspy/poly (depending on your usage and available space you may need to clean content of this folder)
# 3. Path to awscli (# which aws)
# 4. Change your favorite Voice to use (https://docs.aws.amazon.com/polly/latest/dg/voicelist.html)
# 5. Choose your Language
#
# Input text and parameters will be used to calculate a hash for caching the mp3 files so only
# "new speech" will call polly, existing mp3s will be transformed in wav files directly

export AWS_ACCESS_KEY_ID="A...."
export AWS_SECRET_ACCESS_KEY="D.....m"
export AWS_DEFAULT_REGION="us-east-1"

# Folder to cache the files - this also contains the .txt file with all generated mp3
cache="/home/pi/rhasspy/profiles/en/poly/"

# Path to aws binary
awscli='/home/pi/.local/bin/aws'

# Voice to use
voice="Emma"

# Lang to use
lang="en-US"
echo 'Lang: ' $lang >&2

###### Should not need to change parameters below this

# format to use
format="mp3"

# Sample rate to use
samplerate="22050"

# passed text string
text="<speak><lang xml:lang=\"$lang\">$1</lang></speak>"
echo 'Input text:' $text >&2

name=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 20 | head -n 1)

# target file to return to rhasspy-tts (wav)
outfile="${cache}/${name}.wav"

echo 'Output file:' $outfile >&2

# check/create cache if needed
mkdir -pv "$cache"

# hash for the string based on params and text
md5string="$text""_""$voice""_""$format""_""$samplerate"
echo 'Using string for hash': $md5string >&2

hash="$(echo -n "$md5string" | md5sum | sed 's/ .*$//')"
echo 'Calculated hash:' $hash  >&2

cachefile="$cache""$hash".mp3
echo 'Cache file:' $cachefile >&2

# do we have this?
if [ -f "$cachefile" ]
then
    echo "$cachefile found." >&2
    # convert
   mpg123 -w "$outfile" "$cachefile"
   cat "${outfile}"
else
    echo "$cachefile not found, running polly" >&2
    # execute polly to get mp3 - check paths, voice set to $voice
   $awscli polly synthesize-speech --output-format "$format" --voice-id "$voice" \
        --sample-rate "$samplerate" --text-type ssml --text "$text" "$cachefile" >&2
    # update index
    echo "$hash" "$md5string" >> "$cache"index.txt
    # execute conversion to wav
    mpg123 -w  $outfile $cachefile
cat "${outfile}"
fi