Before I dive in - does Rhasspy support my use case?

I’m running around in circles, thinking I have found a solution only to find the products are EOL or not produced anymore, like the Matrix Voice or ReSpeaker. My brain is shutting down, the more I read about everything. I hope you can give me a little push in the right direction.

So I’m creating my own home voice assistant, but I want to go really custom, developing my own (locally hosted) server that handles the speech processing. So what I need is a wake word-only system that simply relays all the audio in real-time to my server (stream or chunks), including the part that came before the wake word. The server streams back a real-time audio response which should be played back to my own speaker.

I think I can turn to a Rhasspy, but I’m not entirely sure about the part where I completely bypass Rhasspy’s core functionality and only use the wake word mechanism (Porcupine seems great). I did look a bit through the docs, but couldn’t directly find an answer. Here’s what ChatGPT had to say about it:

To have Rhasspy relay the audio stream in real-time to your server, you’d need to customize its pipeline, as Rhasspy primarily handles wake word detection and local voice processing. Here’s a potential method to set it up:

Steps to Relay Audio from Rhasspy to Your Server:

  1. Custom Audio Forwarding Script:
  • You can modify Rhasspy’s pipeline by creating a custom Python or Node.js script that intercepts the audio as soon as the wake word is detected and forwards the audio to your server via a WebSocket or REST API.
  • Rhasspy has support for custom command scripts when a wake word is detected. You can hook into this to send the recorded audio to your server in real time.
  1. Enable Streaming via MQTT or WebSocket:
  • Rhasspy communicates using MQTT and WebSocket protocols for various events. You could set up Rhasspy to stream or send chunks of audio through one of these protocols.
  • Have your server subscribe to these events or streams, processing the audio in real time as it’s sent from Rhasspy.
  1. Use External Audio Handlers:
  • Instead of using Rhasspy’s internal speech-to-text processing, you could configure Rhasspy to act as an audio collector that pushes audio data to an external service (your server). You would need to modify its configuration to disable internal speech recognition and instead trigger a streaming action to your server when it detects the wake word.
  1. Rhasspy Remote HTTP Integration:
  • Rhasspy supports remote HTTP endpoints for speech recognition. After detecting the wake word, Rhasspy could send the recorded audio via HTTP POST to your server. Your server would then handle the processing of the audio in real time.

Is this accurate?

if you look at the satellite, and only configure wakeword, asr, and tts services i think you have what you want. you can create your own asr

continuous recognition and output will take some work. as ‘words’ don’t come in audio buffer chunk boundaries.

tts is done at the end point, so that is a challenge too, as the response text might not come in partials but all at once… but that is the request service engine problem

for my asr, using google cloud reco,

i need to figure out how to create a stream for reco input in python, and feed the audio_chunk buffers in, then i could get partial responses out

Well, yesterday I discovered the ESP32-S3-Korvo-1, which seems current. I guess I’m going the ESP32-S3 route after all. Your reply encourages me this was the er right choice.

I dont know if im too late to answer your question here but I was going to add my 2 cents. I am currently using Rhasspy in a very similar way that you described.

Everything runs locally on a server I set up. I have satellites, which are RPIs. All the audio is sent to my local server and processed there. If my internet were to go out, everything would still work. If my router loses power then things break. But if internet goes out but router is still powered on, then everything works.

For processing the audio, I use porcupine that is installed on the server. So when I talk to a satellite, the audio chunks are sent to the server for processing. After wake word is triggered, the following audio is again processed by the server to see if it matches a sentence I have pre-programmed. If it matches then I have a python script that decides what to do, AKA what should the speaker say back. The response message is sent to the satellite over MQTT and the speaker then says the response. So text to speech is handled on the individual satellite, the server does not do that. Not sure if thats a deal breaker for you.

One of the good things about rhasspy is that its highly customizable. So it can fit pretty much any use case. So I think you would make the right choice going with Rhasspy.