Hi, I have Rhasspy up and running in a docker on my Raspberry pi. Together with the Node-RED integration. I can control all my smart devices via speech, very cool!!
The only down side is the text to speech quality. I have setup espeak for that which sounds like a robot. You can hardly understand what he said, sorry…
On my phone I use the google text to speech engine com.google.android.tts:nl-nl in Tasker which can read my notifications out loud. Thats perfect and sounds very natural.
Question 1: I wonder if there is also a comparable natural Dutch female voice offline available for Rhasspy?
Question 2: I read something about MBROLA voices, I found there a female Dutch voice but the installation is not very clear yet for me. Do I only have to copy the file in my docker? No future config?
Question 3: Otherwise I have to use the same as is available on android, the online Google voice engine.
Q2: I suppose that is the Google WaveNet implementation?
I hope some people have already some knowledge about this and can inform me about the current status before I try every possible speak engine.
@romkabouter great to hear about the used cache.
With removing some (unnecessary) variables from current notifications I don’t need to trigger the online service so much. So it’s than almost offline. Great solution, I will setup that way!
Even with variables, it is probably a limited set and no random text every time.
So the cache builds, until every combination is in the set.
Changing voice and/or sample rate triggers a new call to the service (the cache is MD5 hash by voice and samplerate), so choose wisely
I suggest using 16000 or 22050 as output.
I use temperature values in 2 decimals in my notifications so that takes a while before all those unique combinations are indexed
So I will change those to real integer values.
I’m a bit further.
I enabled the google service and created and place the generated json on my local profile location.
I had to run also marytts.
I defined the right voice, matching my langauge from https://cloud.google.com/text-to-speech/docs/voices (can I also use the voice type Standard?)
But now this, what goes wrong?
This is the error I got.
Traceback (most recent call last):
File "/usr/lib/rhasspy/rhasspy-tts-wavenet-hermes/rhasspytts_wavenet_hermes/__init__.py", line 125, in handle_say
"audio_config": audio_config,
File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/google/cloud/texttospeech_v1/services/text_to_speech/client.py", line 353, in synthesize_speech
response = rpc(request, retry=retry, timeout=timeout, metadata=metadata)
File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "/usr/lib/rhasspy/.venv/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument.
[DEBUG:2021-01-25 00:04:32,945] rhasspytts_wavenet_hermes: -> TtsError(error='400 Request contains an invalid argument.', site_id='default', context='9ca50550-71d2-4c27-9e20-473677fc00841', session_id='')
Must the site_id be a number or should the session_id not empty?
Before the pulldown with voices was empty. Now it’s filled (maybe it where some settings in google wavenet), I selected the correct voice and set the sample rate and now it works! Great!
What is the difference in sample rate? I don’t hear it. The quality (and wav file size)?
Yes, this size is bigger and the quality higher.
But, like you already found, 16000 or maybe 22050 is good enough for voice and there will not be a lot of quality gain with 44100.