Base / Satellite configuration

Dunno if it possible yet, didn’t saw anything in the documentation.

Would it be possible to have :

  • A Base device, being the main Rhasspy with all settings, intents, slots and such.
  • A Satellite device, ideally running on a Rpi zero, with a minimal installation with just wakeword, that could stream the audio to the base, so the base have the origine site_id and do the intent recognition and handling ? Then we could handle answering on the origin site_id.

:hugs:

1 Like

I wrote Hermes Audio Server exactly for this purpose: I’m using it as a satellite for Rhasspy. It only listens to the site ID you configured for its audio output, and the audio input is published on the MQTT topic with its site ID. So as soon as Rhasspy supports multiple independently working site IDs, you can have multiple satellites this way.

Of course this requires you to have an MQTT server running, and it doesn’t include a wakeword service. I already noticed you don’t like MQTT, but I don’t mind it, my whole home automation system is using MQTT and it’s very flexible.

You could also use this approach:

You install Rhasspy then both on the ‘base’ and the ‘satellite’, but only use the wakeword and audio services on the satellite. Note that Rhasspy currently doesn’t support the Raspberry Pi Zero yet.

2 Likes

This is not that I don’t like MQTT, it is really powerfull and works great. Just I’m not fond of multiplicating running stuff for this and that :wink:

A base/sat configuration would work nice with mqtt I think.

Will wait for further Rhasspy development regarding this setup. Still a long way to go before replacing Snips in production, but seems on the good road :kissing:

Is there something in the queue regarding some base/satelite easy setup then ?

Have ask something regarding a /api/devices endpoint : https://github.com/synesthesiam/rhasspy/issues/102

So we can ask the base to return its dependent satellites for integration in plugin, like the beta Jeedom plugin. Actually it doesn’t need any dependencies like mqtt and such and works great, so having the dependent satellites with their IP, site_id etc would allow sending speech command back to the right device and keep the plugin extremely lightweight and reliable.

@KiboOst, this EXACTLY what I’m after as well. I know mutli-sat is in the works, and I’ve been giving my “ME TOO” to all of those conversations over on the HA forums, but you very simply outlined exactly how this should work IMO. Thanks!

The separation of Rhasspy into services (ASR, NLU, Dialogue, TTS, Audio IN, Audio OUT, wakeword) seems to be under way.

The use of the Hermes protocol over MQTT between all these services as well.

What remains is the definition of what the “base” and “satellite” services will do apart from providing an embedded MQTT broker like hbmqtt.

Base

  • Should be responsible for Rhasspy’s “brain” and handle stuffs that cannot be easily distributed like training, intents/slots management, intent unit-tests, etc.
  • Can handle satellite registration and state
  • Can handle configuration deployment to base services and satellites.

Satellite

  • Should be responsible for interaction with the outside world at a specific site and avoid spamming the base with audio frames if not awaken by a wake word.

Thoughts?

2 Likes

Agree with that.
Dunno if satellite should have a limited interface to set wakeword and TTS services and settings, tell it which is the master, its siteId, etc.

Also, should we send text to /api/text-to-speech on the satellite, or only use /api/text-to-speech on the master with a ?siteId= parameter so the master can forward it to the corresponding satellite ?

Also the master will need an /api/devices that return the list of its satellites with their siteId, so we have an external way to discover the rhasspy mesh.

Regarding mqtt, I guess we could start from snips mqtt topics and implement the same ones with their rhasspy names ?

Oh, don’t forget builtins slots :hugs:

As interactions between base and satellite use the Hermes protocol over MQTT, I guess the best way would be to use the base as an entry point to Rhasspy network (directly via the base MQTT broker or using the HTTP<->MQTT rhasspy-server-hermes service @synesthesiam created).

So, I wonder if it is the right thing to do to think of this as a “base” and “satellite”. I know that I have used these terms before, but in a future universe, I wonder if it makes sense to think of things in these terms. Maybe it is better to think of Rhasspy as a collection of “services” that can be mixed and matched such as:

  • Admin interface
    • Hosts a HTTP web user interface
  • API gateway
    • HTTP API
    • Webhooks
    • Websocket API
  • Wakeup
    • Wake word detection
    • Other wakeup mechanisms
  • VAD
  • STT
  • NLU
  • TTS
  • Audio I/O
    • Audio input
    • Audio output
  • Dialog manager
  • Configuration service
    • Service configuration
  • Display cards (visual displays)

No “base/hub/master” or “satellite” per se. Just a collection of services that map onto one or more devices. How these services map to devices is what is interesting and can be different for different device types and evolve over time. For example, one or more services might map onto a stand-alone device (such as what @romkabouter is working on with the Matrix ESP32 Voice: Audio I/O and wakeup and maybe eventually VAD). Other services might best map onto a central host (e.g. admin and other core processing services). Alternative implementations of services requiring different compute capabilities might also be interesting that trade off functionality and host resource requirements. @synesthesiam has expressed his concern with performance on a rPI, but one could envision a future where some services have alternative implementations that require more compute power than a rPi has, but provide more functionality; up to the end user to choose what works for them. In fact, it seems as if he is already moving in the direction of alternative implementations based upon what underlying technology is to be used.

We are certainly faced with a decision as to whether to base the communication “fabric” that these services rely upon is MQTT or something else (not sure what that would be, as enterprise level frameworks seem like overkill for this purpose). It seems plausible that we can hide this underlying communication mechanism via an API gateway service for those that don’t want to be exposed to (e.g.) MQTT or whatever underlying communication mechanism is chosen.

It seems that this shouldn’t be a concern of a so-called satellite or any device hosting (one or more) services. Its the services that define the interface, not a particular device.
There doesn’t need to be a “master”, in fact we should avoid that. A device might host one or more services, so defining a “satellite” interface assumes a specific mapping of services to a device type. Clients shouldn’t have to know about network topology, nor should devices be limited or fixed in what services that they might implement. A service responds to commands and emits “events” which other services can consume. The service in question shouldn’t care about clients that issue command or other “downstream” services that consume events. The client shouldn’t care about which device implements a named service. In fact, it seems that the mapping of services to a particular device isn’t a constant, the example being @romkabouter work on the Matrix voice (he keeps adding more “services” that the “device” implements). We shouldn’t constrain ourselves to a particular service/device mapping.

Rather than targeting a specific “device” (satellite), wouldn’t it be better to target a named service, eg. wakeup a particular symbolic ID, via an API gateway or directly via a MQTT topic? The client making the request then does not have to know the underlying network topology or device capability (i.e. a device URL or what services a particular device at a particular URL might implement) or what the actual communication mechanism is. Related services can be logically “grouped” together with a symbolic ID such as a site ID independent of whether individual services are deployed on to a single device or multiple devices. Its up to the underlying communication fabric (e.g. MQTT) to facilitate this. Just make a request to the gateway that targets some symbolic ID, or publish a payload to the appropriate MQTT topic. The service receiving the request doesn’t have to care about the client making the request, or what other services might be interested in the “events” that result from that request.

4 Likes

@banderson This would be awesome! I agree that the services should only handle their task without knowledge of the others.

@Harvester mentionned centralized logging. If services could publish their logs to MQTT, a log service can help make sense of what is happening in the entire service network.

A few questions though:
How would the installation of all these services work?
How would the local/global MQTT brokers be handled to avoid spamming the network with audio frames?
Should we deploy an MQTT broker on each device ourselves?
Should services respond to a « ping » topic with their state so they can register and be manageable/monitored by an UI or API?

:blush:

Seems mqtt is the right choice for underlying discussions framework for all rhasspy devices.

Anyway, each devices must know who is the master, and I don’t know if we can install the actual rhasspy on pi0. So it may be hard to not have a master version and a satellite version of rhasspy.

Maybe we can install rhasspy on another device, set it as satellite then on the master interface, have a new tab for master/satellites settings and tell the master this IP is a satellite and it setup it as so, through a mqtt settings set topics to set wakeword services/files/settings on the satellite. But seems hard anyway to not differentiate a master from a satellite. We could also have two master-satellite setup in a house with different wakeword for example, to handle different sort of things.

Snips handled this as a satellite setup and editing the toml file to on both master and satellite so they know each others and discuss on mqtt. If we don’t want to flood the network, each device must know which state it is and who is its master / its satellites no ?

Hmm, maybe. I wonder if there might be better solutions for this problem, such as syslog on Linux systems. Not sure how syslog would work on an embedded device though, might be worthy of some research…does anyone know a good way to accomplish this? I’ll add a “log service” to the list of services in the notes I am keeping. Of course, a “log service” doesn’t necessarily have to be implemented using MQTT.

I assume you are asking how an end user would “install” services onto target devices? Not necessarily what the service “package” would look like (Docker, dpkg, dmg, etc)? Would this possibly be part of the administrative service and web interface or maybe an installation service? The end user might then specify a device (and type), select an appropriate service package for that device type, install it, and finally configure it? Hmm.

I’m thinking that talking about a “local” MQTT broker is really an implementation detail of how a logical collection of related services that might reside on a single device (or highly coupled devices) might choose to interact over a private API and not really in the scope of the overall architecture. It seems like these services still need to conform to the public API of each service, regardless of their implementation details. Talking about a “local” MQTT broker might be confusing the issue.

I’m also not sure I completely understand the details yet of when and how audio frames need to be visible between services. It does seem like an embedded device implementation of wakeup and audio recording for consumption by another service such as STT might not need to “share” audio frames prior to the publication of audio that is suitable for STT? Is event publication sufficient over a public API (via MQTT topic)? Does wakeup need to publish audio frames via an API at all? What is the use case for that? Is it possible that we can avoid “publishing” audio frames via MQTT by (e.g.) having the audio producer service simply publish a (stream) URL to an MQTT topic where it is then the responsibility of any client interested in that audio to connect to the URL and consume the audio stream? Optimizing an implementation (e.g. on a single device) might then just be sharing audio frames over a private, on-device mechanism without the need to push the audio out globally via MQTT. The service could still offer other clients an audio stream URL, but that doesn’t mean it has to be used if all the services that might require that audio stream are co-resident on a device. This still allows for flexibility if some future client was for some reason interested in that audio.

Why would we need to do this? Seems to me there is one global MQTT broker that serves as the communication fabric between services. If an implementation of a set of logically related services could be optimized for deployment on a single device (or even on a set of tightly coupled devices) with an internal MQTT broker, then so be it. I’m thinking that using an internal MQTT broker for this purpose is probably overkill, and that there are better ways of doing this.

Isn’t this simply publishing state changes to a retained MQTT topic? When a client subscribes to this state topic, it receives the current state and is subsequently informed of state changes via the topic subscription.

Would a rpi zero really be the target for all Rhasspy services? Wouldn’t one just install a small set of appropriate services on a pi zero and delegate other services to more capable hosts in the network? Why does a “device” need to know who the master is? I’m not sure I understand the use case for this, or why there needs to be any master. In my mind, there is just a set of services, each of which may be interested in “events” published by other services (e.g. on MQTT topics). Is there a need for a “workflow” manager that orchestrates the process, or can this be done in a more de-centralized manner where individual services take action when a triggering event occurs in the system (as published in various MQTT topics). It seems like it would be highly advantageous to avoid a “master” and/or a “orchestration” service. I think that would be hard to maintain and would unnecessarily constrain the flexibility of the system. But, I may certainly be missing something important here. I’m thinking this is mostly concerned with how STT audio input is constructed via wakeup and VAD and the desire to avoid unnecessarily publishing audio frames?

This is not to say that there isn’t some configuration “plumbing” that needs to occur such that individual services are aware of other services. For example, a STT service probably needs to know about the ID of each audio service such that the STT service can subscribe to MQTT topics of interest that the audio services publish audio available events to.

Gosh, sorry about how my posts seem to end up being so long winded. There just seems to be a lot of stuff to discuss. This is a good discussion!

1 Like

I agree wholeheartedly with your focus on a decentralized architecture (and this is the way Rhasspy is going anyway), but I think that in practice people will use these therms because they’re quite clear, and because most installations will use one system for the heavy processing and a couple of ‘light’ systems just for audio and the wake words. So I find it perfectly acceptable to think of “base” and “satellite” devices, as these are the common installation types.

3 Likes

As we already have a global MQTT broker, why not use it to also publish logs? Running an additional syslog service will require that each of the other services needs to know about 2 endpoints (MQTT and Syslog)… Seems overly complicated as we already have a message bus running…

In my mind too :wink:

I think a workflow manager may indeed not be required. For the communication between the audio IN and wakeword though, the question still stand as they require a local message sharing…

Audio frames should leave the satellite device (a device that do not do its own ASR, NLU) only when a wake word is detected to avoid flooding the MQTT network with thousands of unrequired audio frame messages all the time. The more satellites we’ll have, the more audio frames per seconds. This is not good. What was proposed is to keep the audio frames “local” (on a local/internal MQTT broker shared with the audio IN and wakeword service to use a common protocol foundation, the Hermes protocol, between services) and only broadcast them globally when a wake word is detected and stop when ASR is done.

Services that needs them (ASR, utterance record, etc.) should get the frames as soon as possible to start “online” decoding. I’m not a fan of the stream URL publishing as it requires another connection to be opened, maintained and close. We maintain a permanent connection to the broker already so why not simply broadcast the data to whatever services that needs them?

Maybe… I was thinking about some kind of service “auto discovery”. The Web UI could send a “ping” and get all the services to respond with their state and config so we can easily have a complete overview of the services topology and what is going on. Just an idea though (sometimes I get lost in those :stuck_out_tongue:)

Very interesting discussion indeed and in my opinion quite an essential one as this will shape the future Rhasspy services :slight_smile:

1 Like

To avoid broadcasting entire network, we could have in the interface a simple way to Add a satellite with it IP. Then the master will get it with its siteId in its config and know everyone to talk / listen to.

Install a new satellite, configure services (wakeword, audio in / out), then on the master, add it and the master will set everything needed. And will be able to answer to /api/devices with the list of satellites ip/siteId :hugs:

Fair enough. I think we agree then that there is no “master” then. To reiterate, a satellite “typically” hosts audio related stuff (wakeup, audio in) whilst a base hosts all of the other services. Hopefully we don’t let these concepts creep into the architecture as I can see a future where a more capable satellite might host additional services (VAD?) and there may be “compute” hosts that can run more sophisticated and resource demanding services that might not reside on the “base”.

Sure. Keeping it simple with less system dependencies sounds like a good thing ™!
As long as we can avoid more traffic than the MQTT broker can reasonably handle on the host and network where it is deployed.

Understood. Maybe we have the service decomposition wrong? Might it be possible to avoid having to share audio frames across services by decomposing functionality differently?

Maybe the wakeup service is responsible for “listening” on an audio channel/device until it detects a wake word, keeping all of the audio local to that service. When a wake word is detected, it publishes an event where subsequently an “audio record” service is triggered and begins recording the audio for consumption by other services. No sharing of audio frames across services necessary until wakeup has occurred and recording has started. On a stand-alone device, an optimized implementation might do both wakeup and audio record in a single process (or firmware binary) avoiding unnecessary round trips to the MQTT broker. The wakeup and record services can still implement the public service interface, but not rely upon that interface (and the MQTT round trip) for internal coordination. I don’t think we necessarily need an “internal” MQTT broker to accomplish this in an efficient manner, that just seems like overkill and overly complicated deployment.

Hmm, maybe. I guess there are some tradeoffs here…latency between wakeup and audio being available to a client consumer and the subsequent overhead of publishing and consuming frames using the broker vs a direct link between the audio producer and consumer via a stream. It does seem important to consider the efficiency of audio frame delivery once the recording process has started - publishing frames to the MQTT broker vs stream efficiency. Are the frames “retained”, how does one synchronize between producer and consumer, will frames be “lost”? Does anyone have experience with this approach or an example implementation? Also, teardown of the stream doesn’t seem to really be an issue since it wouldn’t effect the latency between wakeup and the audio being available at the client. Or maybe a UDP approach would be better and avoid the stream setup overhead entirely? Probably better for multiple clients too as one could use multicast rather than having to manage multiple connections. I guess I’m hoping that we can avoid using MQTT for audio frames as it feels like it could be pretty high overhead and there might be synchronization issues.

As a side note, I have been avoiding reading the Hermes protocol so as not to be biased by its design. At some point though…I should!

Zeroconf/Bonjour/mdns for auto discovery? I don’t know how HASS discovery over MQTT works, but that also might be worth researching. It still might be possible to do this discovery passively via MQTT by using topic wildcards such that you just see the devices that have registered themselves under some will defined topic hierarchy.

You should :slight_smile: Because a lot of what you’re talking about now is already in there :wink:

1 Like

Didn’t tried it yet but could be nice for debugging :grin:

I also like MQTT.fx:

https://mqttfx.jensd.de/

1 Like

Wow !

Just got the portable windows version, started it, set the IP of my SNIPS master. And it is constantly receiving audioFrame on hermes/audioServer/salle topic !!!

Though snips last updates solved this !

So Snips is flooding my network :exploding_head:

Please get this sorted out with rhasspy ! No wakeword detected, no listening, no streaming !!