Fully support the Hermes or Hermod protocol

fastjack · December 17, 2019, 5:52pm

This will allow for a more modular approach and ease the separation of concern between Rhasspy components.

This should also ease the base/satellite configuration.

The communication layer can be either a builtin websocket or HTTP2 endpoint or a external MQTT server.

The Hermes/Hermod protocol has some flaws but it should be a good starting point for a complete Rhasspy communication protocol.

Plus it will allow all Snipsters to move over to Rhasspy easily and for interaction with projects like Alice.

What do you think?

romkabouter · December 17, 2019, 6:47pm

I agree, there is a lot more to the protocol then used now

thinker · December 17, 2019, 8:27pm

I agree, too. I like to see MQTT (Hermes/Hermod) for the complete internal and external (Skills) communication.

koan · December 18, 2019, 8:46am

I agree. I have started documenting the current state of Rhasspy’s Hermes implementation on the reference page. This collects information that was already there but scattered around Rhasspy’s documentation. So now you can easily see what has been implemented already and what not yet

romkabouter · December 18, 2019, 12:35pm

Great work, I have made a small comment on the playBytes topic.

synesthesiam · December 18, 2019, 2:37pm

Here’s my plan for this so far:

I’ve been adding messages to the rhasspy-hermes library. This is just a set of classes that, when JSON-ified, will match the Hermes messages. We can extend Hermes by adding extra fields to these classes (or new classes!).

A library like rhasspy-nlu will build on rhasspy-hermes and become a service like rhasspy-nlu-hermes. These Hermes services should be standalone MQTT services with a command-line interface, and can be installed via source, Debian package, and Docker image.

Here’s the state of things so far:

NLU
- rhasspy-nlu-hermes built on rhasspy-nlu
ASR
- rhasspy-asr-kaldi-hermes built on rhasspy-asr-kaldi and rhasspy-silence

TODO:

ASR
- rhasspy-asr-pocketsphinx and rhasspy-asr-pocketsphinx-hermes
NLU
- Need to handle entity injection
Wake
- rhasspy-wake-*
TTS
- rhasspy-tts-*
Audio Server
- Service for MQTT -> speakers
- Service for microphone -> MQTT (like @koan’s)
Dialogue Manager
- rhasspy-dialogue-hermes service that will manage sessions and coordinate other components

I’m not sure if the Rhasspy web interface should be combined with the Dialogue Manager, or if the two should be separate services.

Thoughts?

koan · December 18, 2019, 2:41pm

I would definitely separate the web interface, which is an interface, and the dialogue manager, which is a core behavioural component.

romkabouter · December 18, 2019, 2:45pm

I think separation is better. I see the webinterface more as a settings manager (+some extras)
The dialogue manager is, like koan said, a core component.

fastjack · December 18, 2019, 2:49pm

I agree with @koan and @romkabouter
Separation of concern is instrumental for future improvement and maintenance.

koan · December 19, 2019, 7:50am

3 posts were split to a new topic: Streamline Rhasspy development on GitHub

thinker · December 18, 2019, 7:24pm

That’s the way it should be. With the planned steps of @synesthesiam we come very close to a Snips replacement. But Snips always lacked a management web console like Rhasspy has one . If the individual components run as separate services, communicate with each other via the MQTT protocol and there is a seperate management web console it will be very pleasant to work with the “new” Rhasspy.

synesthesiam · December 20, 2019, 10:22pm

OK, I have a basic set of Hermes compatible services:

ASR
- rhasspy-asr-pocketsphinx-hermes (Docker image)
- Uses acoustic_model, dictionary.txt, and language_model.txt from Rhasspy profile
NLU
- rhasspy-nlu-hermes (Docker image)
- Uses intent.json from Rhasspy profile
Dialogue Manager
- rhasspy-dialogue-hermes (Docker image)
- Handles sessions and listens for wake word detected event
TTS
- rhasspy-tts-cli-hermes (Docker image)
- Calls external program for text to speech (optionally plays audio too)
Microphone
- rhasspy-microphone-cli-hermes (Docker image)
- Streams raw audio from external program (e.g., arecord)
Hotword
- rhasspy-wake-porcupine-hermes (Docker image)
- Listens for MQTT audio and does wake word detection

All of these services take --host, --port, and --siteId parameters to configure MQTT. I’d also recommend passing --debug to see what’s going on.

Next Steps

Going forward, I need help to figure out how we should deploy these services and how to incorporate them into the main Rhasspy interface.

As it stands, I’ve been able to get each of the services above to build with PyInstaller, so they can be deployed as Debian packages, Docker images, or source code (virtual environment). I still need to add builds for ARM, of course.

Questions

Should Rhasspy become just a call out to supervisord or Docker compose? We can always bundle mosquitto if they don’t want to install an MQTT broker.
How do we handle profile files and training? Is this just a separate service? What if different services run on different machines, but need access to specific profile files?
What can we do to make it easier for people to understand and submit pull requests for individual services (or add new ones)?

koan · December 20, 2019, 10:45pm

Rhasspy’s main documentation and main repository could have links to all repositories of the services. General issues can still be opened on the main repository.

koan · December 20, 2019, 10:51pm

If all services run in their own container and the Rhasspy interface too, starting Rhasspy can come down to running Docker Compose indeed. We can even publish Docker Compose files for satellites, for specific combinations of services, and so on.

fastjack · December 21, 2019, 8:12am

This is awesome!

The rhasspy-hotword service should not listen to MQTT streamed audio frames as in a base/satellite configuration this will spam the MQTT broker all the time.

I think there should be a rhasspy-satellite service that handle the audio input AND output and do the wake word detection. This is pretty much what Amazon, Google and Snips are doing and is the single requirement for a satellite to work.

Only when the wake word is detected this service will notify the MQTT broker and start streaming audio frames.

This service should also output audio (feedback sounds, tts generated wav files, etc) from MQTT when asked.

The base services like ASR and NLU will have to use the same profile files as they are pretty closely related.

The rest of the services are not related to each other as long as they use the Hermes protocol so they can be easily separated.

Installing a base or satellite via a docker compose is perfect.

How can we choose which ASR, NLU and TTS we want for the base then? Should the base include multiple systems like today (kaldi, pocketsphinx, deepspeech, etc)? If they are separated (which sounds better) maybe there can be a simple docker compose generator script to select these?

What do you think?

romkabouter · December 21, 2019, 8:46am

I think this is not correct if we see the hermes services as snips replacement.
The mqtt hotword service should listen to the streamed audio, because that is what it’s for.

The audio stream itselfs comes from the microphone service, which is fed via the internal mic or other sources. In snips, all of this is used with an internal broker so we should definitely deploy mosquitto with rhasspy if we want this.
The audio service indeed spams the broker, but this does only affect network if an external broker is used or audio coming from other sources (like my streamer)

The sattelite service which snips used, is basically a hotword and audioserver. Some code is in there that when a hotword is detected on the interal mqtt audiostream, the service starts relaying the stream to the base broker.

For Snips, the main component is the DialogueManager. This service is the one putting the different services to work via mqtt messages.
A nice picture about the interacttions can be found here: https://snips.gitbook.io/tutorials/t/technical-guides/listening-to-intents-over-mqtt-using-python

This is a little outdated probably, but in it’s basis still correct

fastjack · December 21, 2019, 9:43am

I haven’t though of an internal MQTT broker between the wake word and the audio server…

That’ll work but might consume additional resources (CPU and memory) to constantly pack, unpack and push/pull the audio frames to/from this internal broker and increase installation complexity… maybe this can be simpler if these services are in the same process and use the extracted audio frames from record input, pass them to the wake word engine and when detected push them to the base MQTT broker… Although I really like the idea of having separated input handler and output handler…

Both have pros and cons I guess…

As I think that modularity should be one of the main objectives of Rhasspy then I’ll rally the internal MQTT broker proposition. The resources part can be addressed later if required (premature optimization…).

So we’ll have a base with:

the main MQTT broker
ASR of choice
NLU of choice
TTS of choice
DialogueManager/MainManager

And a satellite with:

an internal MQTT broker
Input of choice
Output of choice
Wakeword of choice
SatelliteManager (can allow to handle satellite registering and bridge both MQTT brokers to avoid maintaining many connections)

And somewhere:

the web UI that connects to the main manager via HTTP rest API?

Does this look correct?

It seems Rhasspy is heading the MQTT road which is pretty cool .

koan · December 21, 2019, 9:50am

This sounds like a pretty good overview of the architecture, @fastjack and @romkabouter!

KiboOst · December 21, 2019, 10:07am

I will let the vocal specialists on this, but would just be sure that wakeword detection is on the satellite. No network flooding when no wakeword is detected please. Maybe a ping here and there so the base know the satellite is still up or updating some settings for example but no continual audio streaming between the two.

What I understand from snip is that the satellite listen, detect the wakeword, and only then stream the audio to base so the base can do asr then nlu then intent recognition

Also the satellite may have a minimal api http to handle minimal interface to have http access so we can see how it goes, restart it (even reboot the pi) and send TTS speech to it. Or we send TTS speech to the base with siteid and the base forward it to the right satellite. Like this anything can send speech command to any device via http without having mqtt itself.

And related to this : https://github.com/synesthesiam/rhasspy/issues/102

We would need /api/devices, even only on the base, to get all the rhasspy devices via http: all base and satellites with siteid, ip, port, so we can just do an http request and know the setup and all base/satellites to later know which siteid to send speech commands and such (or send to the base with the right siteid).

I also think siteid should be in settings apart from mqtt settings. Actually I don’t have mqtt, don’t need it to get notify in jeedom plugin for intent recognition and send speech commad. This would allow, even if base/satellites talks via mqtt, to control all this with another external device without mqtt. This would allow powerfull lightweight control for entire setup, and allow any smarthome solution to handle entire rhasspy setup with all its satellites without having mqtt itself.

koan · December 21, 2019, 10:26am

That’s what the ‘internal’ MQTT broker @romkabouter and @fastjack are talking about is for: the hotword detector and audio service on the satellite device run on the same device together with a MQTT broker and satellite manager, and these four services each run in a separate Docker container and are connected to an internal (virtual) network. So no network flooding is happening, because the ‘network’ is contained on the device.

At the same time, these containers are also connected to the ‘real’ network, so when the hotword detector detects a hotword, a rhasspy-satellite service starts relaying the audio messages received on the internal MQTT broker to the external MQTT broker on your home network, and your Rhasspy server starts receiving audio for command handling. When the text is captured, the satellite manager stops relaying the audio to the external MQTT broker.