I’ve managed to do some testing with a Playstation Eye microphone (4-mic array) and some pulseaudio filters.
My use case is Rhasspy for a media player. The user asks it to play something (stored music collection, internet radio stations, etc) and maybe asks some information, like current time. The main problem with this is that while the music is playing it is hard to get rhasspy to understand the wake word. It’s even harder for it to understand a command.
In my tests, I’ve found out that when the beamforming is used, the system removes almost all background music that is playing. The resulting user voice sounds a little distorted and the wake word phase still works.
So the idea for discussion is the following:
- Use all pulseaudio filters, including beamforming for capturing the wake word;
- Train a custom wake word (for instance, with Microft Precise) using all the above filters, so that the wake word phase works as good as it can.
- After the wake word is detected, use a script to turn down the volume and also turn off beamforming (so that the distortion ends, but the volume is so low that the ASR doesn’t suffer).
- After the command is interpreted, turn the volume back to where it was and turn on beamforming to wait for the next wake word.
This solution would be interesting for anyone who had more the one microphone.