New user hereā¦been a lurker for awhile. I like how this is proceeding! I am not a domain expert in voice systems, but have been interested in this technology for quite some time. I have a Matrix Voice and ReSpeaker gathering dust that I hope to dust off and use soon! Especially the Matrix Voice.
Some comments about the architectureā¦
It seems that things are moving in the direction of a ādistributedā system. That is, one (or more?) ābaseā stations whose responsibility is to consume audio from a satellite, perform ASR, NLU, TTS (sending the audio back to a satellite for output) and possibly dialog management whilst the satellite responsibility is to produce audio (for input to ASR) and consume audio (presumably from TTS). From an architectural point of view, It seems important to keep in mind that the satellite might (in the future) be a (standalone) device with limited resources (such as a Matrix Voice).
With this in mind, whilst it might be the path of least resistance for an initial implementation to implement a satellite node using an āinternalā MQTT broker this certainly seems like overkill in the long run for a ālocalā service to manage the satellite. As long as the internal MQTT broker is an implementation detail of the satellite and does not architecturally āleakā outside the boundaries of the satellite, I donāt see a problem. We should keep in mind the possibility of implementing a satellite on a standalone device with more limited resources such as the Matrix Voice ESP32. As such, as long as the satellite implements the correct āinterfacesā that the base station requires, then we keep the option open to reimplement the satellite functionality in a more compact and resource efficient manner.
It occurs to me that exposing wake word, led control, etc. outside of the satellite is really unnecessary and inappropriate as this is arguably an implementation detail of a āparticular type of satelliteā and need not be globally exposed. One might envision other types of satellites that donāt operate the same way (with a wake word) such as activation via face recognition, a button, or presence detection. Iām not saying that these necessarily make sense right now, but the architecture should not preclude other types of satellites. Thus, there should be no need to expose these details outside of the satelliteā¦its just a producer of recorded audio and a consumer of audio for playback. One might envision a sort of āpluginā architecture for the satellite that lets one plug in various ālocalā functionality in order to operate some arbitrary satellite device. Home Assistants plugin and event architecture come to mind as one possible architectural example. A purpose built software for (e.g.) a Matrix Voice is another example of how one might implement a satellite.
There was also a question about how to āconfigureā all of the various ānodesā in the system. Why not via the system wide āglobalā MQTT broker? The broker is easily accessable to all nodes in the system (base and satellite) as well as providing a simple form of persistence for the configuration. That means that each satellite (and base station) only needs a small ābootstrapā configuration to operate, i.e. its assigned ID (node ID/site ID?) and the URL and credentials necessary to access the global MQTT broker. Using the MQTT broker would also allow the web interface to easily author and view the configuration for each satellite (and base station) and publish it the an appropriate MQTT topic. If the satellite is listening to that topic, it can easily reload the config dynamically. Same for the web interface, all it needs is the URL and credentials to access the global MQTT broker to configure satellites and the base station.
I know that some of the above assumes a global MQTT broker is part of the core architecture. A pub/sub broker such as this is pretty useful in a distributed system, and MQTT is a proven service. Maybe there are ways of doing the same thing using HTTP and/or websockets, but Iām not familiar with how these might be used in the same manner. This is not to say that some HTTP or websocket APIs might also be useful, but it does seem as if MQTT is a reasonable choice to base the architecture on.
My thoughts. Keep up the good work! This is pretty cool!
ba