Master/Satellite Setup vs Individual Instances of Rhasspy - What Do You Use?

maleko · November 30, 2020, 12:11pm

I have been playing with Rhasspy 2.5 for a couple of weeks and it seems pretty sweet. Currently I am on 2.5.8.

I setup Rhasspy on a single Pi 4 with a Respeaker 4 mic array. It worked pretty well, however I was finding that the range of the Respeaker wasn’t that great, from more than 2 metres away it was failing to pick up commands unless I screamed them and if there was any sort of background noise from my oven, tv or projector then there was no hope of it working.

Because of this, I bought a couple of pi zero w and a couple of Respeaker 2 mic arrays and Respeaker 6 mic array kits to experiment with different master/satellite setups. My master device was on a Windows 10 pc, the pi 4 with 4 mic array was one satellite and the 2 pi zeros were the other 2 satellites (I tried both the 2 and 6 mic arrays with these). Setup of each device was as per the docs.

The following outlines my experience when using the Master/Satellite setup, where all 3 satellites and master are in a single large room. Note that Master does not have any mic setup attached to it as it is just used for processing power:

Command recognition seems much poorer than when using a single device - often when I speak a command, the completely wrong (and not remotely similarly phrased) intent is executed - like I say turn on tv and the microwave turns off.
For some reason it seemed to favour off commands - I would say blah blah on and instead it would almost always execute blah blah off. Getting the corresponding on command to execute was much harder and maybe only happened one in five attempts.
It never complained about an unrecognised intent - instead it just seemed to execute another intent, I guess what it reckoned was the closest match. More often than not it turned off my TV, despite many of the commands I tried not being close to turn off tv. On my single pi 4 setup (ie not using Master/Satellite) Hermes LED Control flashes red on an unrecognised command (pattern=projectalice), but it didn’t when using the satellite, instead it always flashed blue (ie successfully recognised the intent) and then executed a comand (but often not the right command).
The main issue I experienced was that despite there being significant distance between satellites (over 8 metres between 2 of them is the largest distance) when I issued a command, all 3 of the satellites would execute a command - but they would execute different commands and often the further away satellites would take much longer to execute - say 20 - 30 seconds after the first command had executed. This resulted in totally unpredictable behaviour.

As a result of the above testing, I think I will move back to using single instances of Rhasspy, say each instance running on either a pi 3b+ or pi 4 (2 or 4gb).

I was just wondering what peoples’ general experiences were running individual instances of Rhasspy vs Mater/Satellite setup? Are the problems I am experiencing normal? What fixes or work arounds did you implement to overcome them?
I have read a number of posts on here, reddit and even found this https://laurentchervet.wordpress.com/2019/03/ however the solution seemed Snips specific and I didn’t really understand how to implement it.

Questions:

If I switch back to using indiviual instances of Rhasspy then will I still experience the same issue whereby multiple devices execute different commands? (I was hoping not as the range seemed to be more limited and commands seemed better recognised vs using the Master/Satellite setup)

If my issues are caused by using Respeakers, then what mic array would you reccommend instead? I bought Respeakers as they seemed to be used by most people online, however the performance seems really so so (pretty decent for the 4 mic, apart from the previously mentioned range issue, in my experience the 2 and 6 mic array models were awful.)

The performance of the pi zero seems pretty dire - does this have enough power to be an effective satellite?

If anyone has successfully implemented multiple devices in a single room then would you mind discussing it here? What is your setup and what steps did you have to take to prevent any of the issues that I discussed?

Thanks

moqart · November 30, 2020, 1:26pm

Are you using the same TTS and intent recognition as before you changed changed to a satelite setup?
Not sure which TTS option is the most accurate one for english as i am running rhasspy with a german profile.

At the moment i am not having an active satelite, because i dont need one at the moment but i have set one up before on a pi 0 w and it worked fine. Unfortunatly it is not able to run mycroft precise for local wakeword recognition which i use on my master (pi 4).

The pi zero w struggles to do local wakeword recognition depending on the system you choose. Otherwise there is not a problem with using it as a satelite.
At the beginning i also had the issue of random intents getting recognized when the wakeword accidently triggered and i heard some words, but that became less of an issue once my sentences contained a larger number of words. Then the random words it recognized did not form a valid intent anymore.

Solving the issue with multiple satelites picking up what you say is more difficult to solve. Did you make sure to give them different siteIds? I find it a bit unexpected that with 8m between them both react to the same wakeword. Using multiple Intances would not reduce thi issue of multiple triggering at once unless you use different wakewords for them(which you could also do in the case of a satelite setup).

maleko · December 30, 2020, 5:48pm

Hi moqart,

Apologies for the delay in coming back to you.

Before switching to a master / satelite setup I was using Pocketsphinx and Fsticuffs, when I switched to a satelite setup I switched to using Kaldi and Fsticuffs on the base station and Remote HTTP for both on the satelite.

I have since switched to operating multiple distinct instances of Rhasspy, all running on Pi 4B+ with Respeaker 4. I found that in this setup, using Kaldi for STT worked better in a noisy home environment (when I used Pocketsphinx my wakeword often struggled to be detected over low level music or TV, whereas Kaldi seems to work better). I also seem to get fewer wrongly recognised intents with this setup which is good. I have also switched to using Mycroft Precise instead of Porcupine which seems to be working better.

As you said, switching to separate instances does not fix the issue with the same command being picked up by multiple devices. On the plus side it is less of a problem in that now at least most of the time the same command is recognised by both devices. Occasionaly when I go to turn on a light it goes on and then immediately off again as the closer Pi detects the correct “on” command and the far away one detects the incorrect “off” command, but it’s not a huge issue most of the time. Also, to answer your question I always make sure to give all my devices a unique siteId.

Sometimes the Pi right beside me doesn’t detect the command (I think due to interferance from the TV playing) but the one at the far end of the room does, which I agree is quite surprising. It seems that the Respeaker 4 is very easily affected by background noise.

What would be awesome would be some way to score / rate a command so that if multiple devices picked up the same command then only the one with the highest score (ie the closest one) would execute the command.

I think that I might need to bite the bullet and look at a more expensive mic setup with some sort of noise cancellation and directionality to it. So far I have tried repeaker 2, 4 and 6 mic arrays and have been pretty unimpressed with the performance. The 4 at least kind of works, the 2 and 6 were useless.

Thanks

rolyan_trauts · December 30, 2020, 6:33pm

I don’t use any because my opinion is the current infrastructure is a whole load of bloat and we are still lacking a good KWS.

But also the same about mems arrays without included DSP and beamforming as by summing due to having DSP spacing you just create some unwanted high pass filters.
If you have an array use a single channel and unless you can get some algs for your array to at least compensate for the effects a endfire or broadside array will provide use a single mic.

What you should do is position the array from you and speak at normal levels and record examples and post to a online drive somewhere so we can maybe provide input.
But you are right as a singular mic has a huge gulf to DSP embedded versions such as Matrix Voice, Acusis S, Respeaker USB or even my Anker Powerconf.
But even then they are still extremely prone to problems of noise and there is a threshold for all where near noise : far voice ratio can be extremely low and they will fail.
Even big data DSP if placed badly with common noise of a threshold will fail.
They do have algs and noise reduction and they train there systems all with identical inputs whilst we try to provide studio recordings and studio level DSP and hence why we fail.

The beamformers generally do a better job than just what is essentially pointless as a redundant collection of mics without DSP.

There has been this focus to copy Amazon/Google without thought to actually what they employ and they have embedded DSP in low cost silicon and we do not.
In fact the emulation of Amazon/Google creates units that have a semblance of looks but in use are extremely poor imitations.

If we had a decent KWS which we do not we could use multiple unidirectional mics with multiple instances of that KWS and instantly create low cost non DSP beamforming.
A satellite it self is firstly a voice Human Machine Interface that depending on how many instances are running from KW hit it broadcasts the mic with the best confidence hit and delivers that KW confidence via MQTT until silence.
The audio server, groups remote satellites as rooms and the best KW confidence hit is used for a room session.
Distributed arrays can cope with noise where by placement at least one mic can have a good ratio of far noise : near voice.
Pixel rings and audio are outputs not inputs are distinct and seperate HMI’s that are associated together via a GUI with a base default of IP, but this whole bloat and definition of a specific system is fubar to me.

Pixel rings, audio out and KWS should be simple and interoperable and New Year after saying for long enough I intend to do something.
They will not be pointlessly branded as some specific system just purely interopable modules to work with any ASR/Intent system via a front of house audio & session server that can create an asynchronous audio queue and play to a loopback adapter on a single queue if the ASR will stream.