This will allow for a more modular approach and ease the separation of concern between Rhasspy components.
This should also ease the base/satellite configuration.
The communication layer can be either a builtin websocket or HTTP2 endpoint or a external MQTT server.
The Hermes/Hermod protocol has some flaws but it should be a good starting point for a complete Rhasspy communication protocol.
Plus it will allow all Snipsters to move over to Rhasspy easily and for interaction with projects like Alice.
What do you think?
I agree, there is a lot more to the protocol then used now
I agree, too. I like to see MQTT (Hermes/Hermod) for the complete internal and external (Skills) communication.
I agree. I have started documenting the current state of Rhasspy’s Hermes implementation on the reference page. This collects information that was already there but scattered around Rhasspy’s documentation. So now you can easily see what has been implemented already and what not yet
Great work, I have made a small comment on the playBytes topic.
Here’s my plan for this so far:
I’ve been adding messages to the rhasspy-hermes library. This is just a set of classes that, when JSON-ified, will match the Hermes messages. We can extend Hermes by adding extra fields to these classes (or new classes!).
A library like rhasspy-nlu will build on
rhasspy-hermes and become a service like rhasspy-nlu-hermes. These Hermes services should be standalone MQTT services with a command-line interface, and can be installed via source, Debian package, and Docker image.
Here’s the state of things so far:
- Need to handle entity injection
- Audio Server
- Service for MQTT -> speakers
- Service for microphone -> MQTT (like @koan’s)
- Dialogue Manager
rhasspy-dialogue-hermes service that will manage sessions and coordinate other components
I’m not sure if the Rhasspy web interface should be combined with the Dialogue Manager, or if the two should be separate services.
I would definitely separate the web interface, which is an interface, and the dialogue manager, which is a core behavioural component.
I think separation is better. I see the webinterface more as a settings manager (+some extras)
The dialogue manager is, like koan said, a core component.
I agree with @koan and @romkabouter
Separation of concern is instrumental for future improvement and maintenance.
That’s the way it should be. With the planned steps of @synesthesiam we come very close to a Snips replacement. But Snips always lacked a management web console like Rhasspy has one . If the individual components run as separate services, communicate with each other via the MQTT protocol and there is a seperate management web console it will be very pleasant to work with the “new” Rhasspy.
OK, I have a basic set of Hermes compatible services:
- Dialogue Manager
All of these services take
--siteId parameters to configure MQTT. I’d also recommend passing
--debug to see what’s going on.
Going forward, I need help to figure out how we should deploy these services and how to incorporate them into the main Rhasspy interface.
As it stands, I’ve been able to get each of the services above to build with PyInstaller, so they can be deployed as Debian packages, Docker images, or source code (virtual environment). I still need to add builds for ARM, of course.
- Should Rhasspy become just a call out to supervisord or Docker compose? We can always bundle
mosquitto if they don’t want to install an MQTT broker.
- How do we handle profile files and training? Is this just a separate service? What if different services run on different machines, but need access to specific profile files?
- What can we do to make it easier for people to understand and submit pull requests for individual services (or add new ones)?
Rhasspy’s main documentation and main repository could have links to all repositories of the services. General issues can still be opened on the main repository.
If all services run in their own container and the Rhasspy interface too, starting Rhasspy can come down to running Docker Compose indeed. We can even publish Docker Compose files for satellites, for specific combinations of services, and so on.
This is awesome!
rhasspy-hotword service should not listen to MQTT streamed audio frames as in a base/satellite configuration this will spam the MQTT broker all the time.
I think there should be a
rhasspy-satellite service that handle the audio input AND output and do the wake word detection. This is pretty much what Amazon, Google and Snips are doing and is the single requirement for a satellite to work.
Only when the wake word is detected this service will notify the MQTT broker and start streaming audio frames.
This service should also output audio (feedback sounds, tts generated wav files, etc) from MQTT when asked.
The base services like ASR and NLU will have to use the same profile files as they are pretty closely related.
The rest of the services are not related to each other as long as they use the Hermes protocol so they can be easily separated.
Installing a base or satellite via a docker compose is perfect.
How can we choose which ASR, NLU and TTS we want for the base then? Should the base include multiple systems like today (kaldi, pocketsphinx, deepspeech, etc)? If they are separated (which sounds better) maybe there can be a simple docker compose generator script to select these?
What do you think?
I think this is not correct if we see the hermes services as snips replacement.
The mqtt hotword service should listen to the streamed audio, because that is what it’s for.
The audio stream itselfs comes from the microphone service, which is fed via the internal mic or other sources. In snips, all of this is used with an internal broker so we should definitely deploy mosquitto with rhasspy if we want this.
The audio service indeed spams the broker, but this does only affect network if an external broker is used or audio coming from other sources (like my streamer)
The sattelite service which snips used, is basically a hotword and audioserver. Some code is in there that when a hotword is detected on the interal mqtt audiostream, the service starts relaying the stream to the base broker.
For Snips, the main component is the DialogueManager. This service is the one putting the different services to work via mqtt messages.
A nice picture about the interacttions can be found here: https://snips.gitbook.io/tutorials/t/technical-guides/listening-to-intents-over-mqtt-using-python
This is a little outdated probably, but in it’s basis still correct
I haven’t though of an internal MQTT broker between the wake word and the audio server…
That’ll work but might consume additional resources (CPU and memory) to constantly pack, unpack and push/pull the audio frames to/from this internal broker and increase installation complexity… maybe this can be simpler if these services are in the same process and use the extracted audio frames from record input, pass them to the wake word engine and when detected push them to the base MQTT broker… Although I really like the idea of having separated input handler and output handler…
Both have pros and cons I guess…
As I think that modularity should be one of the main objectives of Rhasspy then I’ll rally the internal MQTT broker proposition. The resources part can be addressed later if required (premature optimization…).
So we’ll have a base with:
- the main MQTT broker
- ASR of choice
- NLU of choice
- TTS of choice
And a satellite with:
- an internal MQTT broker
- Input of choice
- Output of choice
- Wakeword of choice
- SatelliteManager (can allow to handle satellite registering and bridge both MQTT brokers to avoid maintaining many connections)
- the web UI that connects to the main manager via HTTP rest API?
Does this look correct?
It seems Rhasspy is heading the MQTT road which is pretty cool .
This sounds like a pretty good overview of the architecture, @fastjack and @romkabouter!
I will let the vocal specialists on this, but would just be sure that wakeword detection is on the satellite. No network flooding when no wakeword is detected please. Maybe a ping here and there so the base know the satellite is still up or updating some settings for example but no continual audio streaming between the two.
What I understand from snip is that the satellite listen, detect the wakeword, and only then stream the audio to base so the base can do asr then nlu then intent recognition
Also the satellite may have a minimal api http to handle minimal interface to have http access so we can see how it goes, restart it (even reboot the pi) and send TTS speech to it. Or we send TTS speech to the base with siteid and the base forward it to the right satellite. Like this anything can send speech command to any device via http without having mqtt itself.
And related to this : https://github.com/synesthesiam/rhasspy/issues/102
We would need /api/devices, even only on the base, to get all the rhasspy devices via http: all base and satellites with siteid, ip, port, so we can just do an http request and know the setup and all base/satellites to later know which siteid to send speech commands and such (or send to the base with the right siteid).
I also think siteid should be in settings apart from mqtt settings. Actually I don’t have mqtt, don’t need it to get notify in jeedom plugin for intent recognition and send speech commad. This would allow, even if base/satellites talks via mqtt, to control all this with another external device without mqtt. This would allow powerfull lightweight control for entire setup, and allow any smarthome solution to handle entire rhasspy setup with all its satellites without having mqtt itself.
That’s what the ‘internal’ MQTT broker @romkabouter and @fastjack are talking about is for: the hotword detector and audio service on the satellite device run on the same device together with a MQTT broker and satellite manager, and these four services each run in a separate Docker container and are connected to an internal (virtual) network. So no network flooding is happening, because the ‘network’ is contained on the device.
At the same time, these containers are also connected to the ‘real’ network, so when the hotword detector detects a hotword, a rhasspy-satellite service starts relaying the audio messages received on the internal MQTT broker to the external MQTT broker on your home network, and your Rhasspy server starts receiving audio for command handling. When the text is captured, the satellite manager stops relaying the audio to the external MQTT broker.