Question: multiple room satellites, sharing a single CPU

Alextrical · March 24, 2024, 5:39pm

Is it possible to use multiple USB conference rooms microphones sharing a single central processor.

I.e. in a 5 room setup, could it be possible to have a conference room mic/speaker in each room and have them all work independently, using one central PC for processing

synesthesiam · March 24, 2024, 6:53pm

What kind of processing do you mean?

Alextrical · March 24, 2024, 7:47pm

Sorry for the vagueness
I meant is there a way to setup effectively 5 satellite devices that can pickup audio, separately from multiple rooms, and act as though they where 5 ESP32-S3-BOX-3 devices that send audio back to a central server for STT processing

Possibly using Docker multiple docker containers?

Sorry for the vagueness, I’m just trying to asses if the open source project I’m working on could be of use to this community (or adapted in a way that could be)

synesthesiam · March 24, 2024, 11:47pm

Sure, this can be done with just one Docker container or one HA instance actually. You only need a satellite device that’s powerful enough to connect to WiFi and send TCP packets. I’m assuming that wake word detection is being done on the satellite, but HA supports streaming wake word detection. This could be replicated with a second Docker container, though.

Without HA, you need to spin up a Docker container with wyoming-faster-whisper or wyoming-vosk on your central server. These do STT for any number of TCP clients using the Wyoming protocol (specifically the STT event flow). You will need to detect the end of the voice command on the satellite right now, but I plan to add this feature to the server in the future (HA already does this).

So the flow from a satellite looks like this:

Create a TCP connection to the STT Docker container
Send the audio-start event, a series of audio-chunk events, then audio-stop when the voice command is over
Receive the transcript event and do something with the text
Close the connection and wait for the wake word again

With HA, you can use the websocket API to do everything (wake word, speech-to-text, intent-recognition, text-to-speech) or just some of those things. You do need a long-lived access token, however.

Another option with HA is to make each satellite a Wyoming satellite. While I use Raspberry Pi’s in all my examples, the same simple Wyoming events are used (JSON and sometimes JSON + raw audio). So anything that can do TCP and handle JSON can become a satellite.

Hope this helps, let me know if you have more questions!

Alextrical · March 25, 2024, 10:33am

Effectively the thought that crossed my mind was that i have a Proxmox server (A HP T730) hosting the Home Assistant server, and a few other services, while still having plenty of additional resource available.

As its possible to use a £3.50 USB CAT5 extender to locate a USB 2.0 device up to 150ft away, we could remove some of the complexity of having to wirelessly stream audio from scattered low powered nodes around the home (Be it a ESP32 or RPi)

I suppose this is more of a thought experiment, to try and judge if using a ESP32-S3 as a remote node was the best option, compared against using a remote RPi or a USB cable linking it back directly to the Home assistant server for local processing

synesthesiam · March 26, 2024, 8:42pm

I’d guess that the cost of running the cables (not just the cables themselves) would be far more than the individual nodes.

If you have good local wake word detection, the streaming would be very minimal unless you’re playing a lot of music.