I’m running Rhasspy on multiple satellites and I’m trying to avoid simultaneous activation of the same wake word. I’ve been able to solve that through rate-limiting in node-red, but the first satellite to process a wake word isn’t necessarily the closest satellite to the speaker.
Is there any way to get the VAD or Current Energy Data (like in the audio statistics section of speech to text) so that I can compare the satellites and choose the one with the greatest measurement? All my intent processing is in node-red, so if possible it would be through mqtt or websockets.
My satellites are running on Raspberry Pi Zero 2 Ws and my microphones are ReSpeaker Mic Array v2.0s (they have VAD maybe I can somehow tap into that instead?).
Likely even better than Vad energy the actual argmax of the current KW trigger could be used.
I have never really understood the satelite peer-2peer like infrastructure or broadcasting multiple audio over Mqtt but you have a stream so you could calculate it, but really the main identifier of the KW has already gone and its the Arg-Max of the KW or audio statistics of the KW that should set the best command stream and the others can just be dropped.
In V3 I have been advocating for a KWS server as there is no need for any KWS to be system aware it needs a minimum of commands ‘Start/Stop’ then 2 payloads of the binary audio and api metadata where ‘start/stop’ and current KW threshold be it based on argmax or energy is all that is needed.
Also likely a low latency 1to1 connection like webscokets is likely better than a broadcasting packet headers over every node in that protocol network.
A KWS server is just an abstraction bridge so any KWS can be connected as only the backend has Rhasspy specifics.
There are some great KWS out there such as Picovoice where you can set a sensitivity but sadly the API only returns a boolean that the threshhold has been exceded and not the argmax value but you could strap on vad there.
Also you could even try purely VAD or personel VAD based system and transmit a full capture where KW is part of the ASR predicate to allow for dynamic programable KW.
Same thing in operation as transmits a quality metric from a ring buffer to compare against others in a zone so you can select best and drop others, also with concurrent zones it will queue one for ASR until the 1st is finnished.
There are a couple of PersonalVad projects on github haven’t tested to how well they operate.
They are not massively accurate but in conjunction of the KW being part of the ASR predicate the combination should be accurate and you only need to train PersonalVad and not KW.
The longer you do feature extraction the better and prob purely to select the best stream they likely would be good enough and actually likely doesn’t matter as with many CTC beamsearch like ASR you want a file based ASR than stream as they work on overall context.
I think the implementation is a really important and relatively simple step as rather than trying to beat physics with high priced array microphones simpler, more cost effective distributed microphones can be used.
You can also do it with a single client that has multiple microphone(s)/array(s) or even multiples of those but looks and acts as a single with its output.
I am waiting with curiosity to how v3 will pan out, but actually its pretty simple as there are many examples for websockets but audio chains are really simple and serial chains like gstreamer.