Fully support the Hermes or Hermod protocol

Great work, I have made a small comment on the playBytes topic.

Here’s my plan for this so far:

I’ve been adding messages to the rhasspy-hermes library. This is just a set of classes that, when JSON-ified, will match the Hermes messages. We can extend Hermes by adding extra fields to these classes (or new classes!).

A library like rhasspy-nlu will build on rhasspy-hermes and become a service like rhasspy-nlu-hermes. These Hermes services should be standalone MQTT services with a command-line interface, and can be installed via source, Debian package, and Docker image.

Here’s the state of things so far:

TODO:

  • ASR
    • rhasspy-asr-pocketsphinx and rhasspy-asr-pocketsphinx-hermes
  • NLU
    • Need to handle entity injection
  • Wake
    • rhasspy-wake-*
  • TTS
    • rhasspy-tts-*
  • Audio Server
    • Service for MQTT -> speakers
    • Service for microphone -> MQTT (like @koan’s)
  • Dialogue Manager
    • rhasspy-dialogue-hermes service that will manage sessions and coordinate other components

I’m not sure if the Rhasspy web interface should be combined with the Dialogue Manager, or if the two should be separate services.

Thoughts?

1 Like

I would definitely separate the web interface, which is an interface, and the dialogue manager, which is a core behavioural component.

3 Likes

I think separation is better. I see the webinterface more as a settings manager (+some extras)
The dialogue manager is, like koan said, a core component.

2 Likes

I agree with @koan and @romkabouter
Separation of concern is instrumental for future improvement and maintenance.

3 posts were split to a new topic: Streamline Rhasspy development on GitHub

That’s the way it should be. With the planned steps of @synesthesiam we come very close to a Snips replacement. But Snips always lacked a management web console like Rhasspy has one . If the individual components run as separate services, communicate with each other via the MQTT protocol and there is a seperate management web console it will be very pleasant to work with the “new” Rhasspy. :heart_eyes:

2 Likes

OK, I have a basic set of Hermes compatible services:

All of these services take --host, --port, and --siteId parameters to configure MQTT. I’d also recommend passing --debug to see what’s going on.

Next Steps

Going forward, I need help to figure out how we should deploy these services and how to incorporate them into the main Rhasspy interface.

As it stands, I’ve been able to get each of the services above to build with PyInstaller, so they can be deployed as Debian packages, Docker images, or source code (virtual environment). I still need to add builds for ARM, of course.

Questions

  • Should Rhasspy become just a call out to supervisord or Docker compose? We can always bundle mosquitto if they don’t want to install an MQTT broker.
  • How do we handle profile files and training? Is this just a separate service? What if different services run on different machines, but need access to specific profile files?
  • What can we do to make it easier for people to understand and submit pull requests for individual services (or add new ones)?
4 Likes

Rhasspy’s main documentation and main repository could have links to all repositories of the services. General issues can still be opened on the main repository.

1 Like

If all services run in their own container and the Rhasspy interface too, starting Rhasspy can come down to running Docker Compose indeed. We can even publish Docker Compose files for satellites, for specific combinations of services, and so on.

1 Like

This is awesome!

The rhasspy-hotword service should not listen to MQTT streamed audio frames as in a base/satellite configuration this will spam the MQTT broker all the time.

I think there should be a rhasspy-satellite service that handle the audio input AND output and do the wake word detection. This is pretty much what Amazon, Google and Snips are doing and is the single requirement for a satellite to work.

Only when the wake word is detected this service will notify the MQTT broker and start streaming audio frames.

This service should also output audio (feedback sounds, tts generated wav files, etc) from MQTT when asked.

The base services like ASR and NLU will have to use the same profile files as they are pretty closely related.

The rest of the services are not related to each other as long as they use the Hermes protocol so they can be easily separated.

Installing a base or satellite via a docker compose is perfect.

How can we choose which ASR, NLU and TTS we want for the base then? Should the base include multiple systems like today (kaldi, pocketsphinx, deepspeech, etc)? If they are separated (which sounds better) maybe there can be a simple docker compose generator script to select these?

What do you think?

2 Likes

I think this is not correct if we see the hermes services as snips replacement.
The mqtt hotword service should listen to the streamed audio, because that is what it’s for.

The audio stream itselfs comes from the microphone service, which is fed via the internal mic or other sources. In snips, all of this is used with an internal broker so we should definitely deploy mosquitto with rhasspy if we want this.
The audio service indeed spams the broker, but this does only affect network if an external broker is used or audio coming from other sources (like my streamer)

The sattelite service which snips used, is basically a hotword and audioserver. Some code is in there that when a hotword is detected on the interal mqtt audiostream, the service starts relaying the stream to the base broker.

For Snips, the main component is the DialogueManager. This service is the one putting the different services to work via mqtt messages.
A nice picture about the interacttions can be found here: https://snips.gitbook.io/tutorials/t/technical-guides/listening-to-intents-over-mqtt-using-python

This is a little outdated probably, but in it’s basis still correct

3 Likes

I haven’t though of an internal MQTT broker between the wake word and the audio server…

That’ll work but might consume additional resources (CPU and memory) to constantly pack, unpack and push/pull the audio frames to/from this internal broker and increase installation complexity… maybe this can be simpler if these services are in the same process and use the extracted audio frames from record input, pass them to the wake word engine and when detected push them to the base MQTT broker… Although I really like the idea of having separated input handler and output handler…

Both have pros and cons I guess…

As I think that modularity should be one of the main objectives of Rhasspy then I’ll rally the internal MQTT broker proposition. The resources part can be addressed later if required (premature optimization…). :+1:

So we’ll have a base with:

  • the main MQTT broker
  • ASR of choice
  • NLU of choice
  • TTS of choice
  • DialogueManager/MainManager

And a satellite with:

  • an internal MQTT broker
  • Input of choice
  • Output of choice
  • Wakeword of choice
  • SatelliteManager (can allow to handle satellite registering and bridge both MQTT brokers to avoid maintaining many connections)

And somewhere:

  • the web UI that connects to the main manager via HTTP rest API?

Does this look correct?

It seems Rhasspy is heading the MQTT road which is pretty cool :blush:.

5 Likes

This sounds like a pretty good overview of the architecture, @fastjack and @romkabouter!

2 Likes

I will let the vocal specialists on this, but would just be sure that wakeword detection is on the satellite. No network flooding when no wakeword is detected please. Maybe a ping here and there so the base know the satellite is still up or updating some settings for example but no continual audio streaming between the two.

What I understand from snip is that the satellite listen, detect the wakeword, and only then stream the audio to base so the base can do asr then nlu then intent recognition

Also the satellite may have a minimal api http to handle minimal interface to have http access so we can see how it goes, restart it (even reboot the pi) and send TTS speech to it. Or we send TTS speech to the base with siteid and the base forward it to the right satellite. Like this anything can send speech command to any device via http without having mqtt itself.

And related to this : https://github.com/synesthesiam/rhasspy/issues/102

We would need /api/devices, even only on the base, to get all the rhasspy devices via http: all base and satellites with siteid, ip, port, so we can just do an http request and know the setup and all base/satellites to later know which siteid to send speech commands and such (or send to the base with the right siteid).

I also think siteid should be in settings apart from mqtt settings. Actually I don’t have mqtt, don’t need it to get notify in jeedom plugin for intent recognition and send speech commad. This would allow, even if base/satellites talks via mqtt, to control all this with another external device without mqtt. This would allow powerfull lightweight control for entire setup, and allow any smarthome solution to handle entire rhasspy setup with all its satellites without having mqtt itself.

4 Likes

That’s what the ‘internal’ MQTT broker @romkabouter and @fastjack are talking about is for: the hotword detector and audio service on the satellite device run on the same device together with a MQTT broker and satellite manager, and these four services each run in a separate Docker container and are connected to an internal (virtual) network. So no network flooding is happening, because the ‘network’ is contained on the device.

At the same time, these containers are also connected to the ‘real’ network, so when the hotword detector detects a hotword, a rhasspy-satellite service starts relaying the audio messages received on the internal MQTT broker to the external MQTT broker on your home network, and your Rhasspy server starts receiving audio for command handling. When the text is captured, the satellite manager stops relaying the audio to the external MQTT broker.

2 Likes

This sounds great, @fastjack, @romkabouter, and @koan!

For the non-Docker case, we could have the hotword service listen to local (internal) MQTT broker and publish to a remote one. We could even create a special topic for the internal broker that is meant for raw audio chunks instead of tiny WAV files to avoid overhead. Another option is to accept raw audio directly from an external program or over UDP with gstreamer.

It seems like the web UI could be split into the part that lets you test and train, and the configuration portion. I’m not sure what configuration becomes when Rhasspy is split into services. Does the web server generate a Docker compose file or a supervisord.conf file?

1 Like

I hadn’t thought of that, awesome idea!

1 Like

Maybe we can have a common shared space on the host that every docker service used to read/write configuration files.
HassIO does that as well, folders like config/share/ssl and such.
I use the proximanager for certicate auto-renewal and the files are placed in /share
My HassIO instance is used these same ssl files.
That way, you could have 1 UI controlling the configuration files, while the services use each their own specific config just like now.

As much as I like the MQTT support, I still believe that the excellent ways of using Rhasspy should be kept so that MQTT is always optional.

1 Like

The underlying communication layer can be either websocket events on a single endpoint or a MQTT broker. Both work pretty much the same way. They can be secured using tls and credentials so it should be fine.

As long as the websocket events are the same as the MQTT topic/message it should be easy.

The base DialogueManager/MainManager can provide a default websocket endpoint for all the other services (asr, nlu, tts, satellite) to subscribe (like MQTT).

The satellite manager can provide the same for local communication.

An additional MQTT bridge service can eventually also forward websocket events to an MQTT broker and relay MQTT message from this broker as websocket events for Hermes protocol over MQTT compatibility if required.

That way Rhasspy does not depend on another piece of software to work out of the box.

I also like MQTT a lot but it does not seem like a really required dependency.

2 Likes