Google Wavenet is used only for new sentences (a combination of text, language and samplerate)
That is cached and replayed from cache if the same text needs to be spoken.
But if Google Wavenet shuts down AND you want to use new text that will not be possible,
It has a free tier.
Check here and how to start with venv and all other installation methods
Which language will you be using Rhasspy for? In general, I recommend Kaldi for STT and nanoTTS for TTS on a Raspberry Pi. If you have a Pi 4, you might give Larynx TTS a try (use “Low Quality” for speed).
If you follow the docs to create a virtual environment, a rhasspy.sh script gets generated which automatically activates the environment.
I’d highly recommend sticking with Docker, though. It’s much easier to set up and to keep updated.
By default, Kaldi is set to generate a “Text FST” instead of something called an n-gram model (which is what DeepSpeech uses).
Kaldi’s Text FST will only ever recognize sentences from your trained voice commands, whereas the n-gram model will accept “similar” sentences. Depending on your use case, you may want the extra strictness from Kaldi.
Not similar words, but similar sentences. The words are fixed by what’s in sentences.ini, but with the n-gram model, the sentences are matched according to a 3-word context window.
So if you have lots of slots with phrases like “the living room” and “the downstairs bathroom”, it might accept “the downstairs room” even if you never had that in a slot. With Kaldi + text FST, that will not happen.