Yeah but I have hunch the porcupine is a clean model maybe expecting a clean input that maybe has need of preprocessing as you say.
I have been playing with the Google-KWS state-of-art NN models and actually you can train a KW to be resilient to noise if you can train the noise in.
I am sort of getting sick of watching my paltry 1050ti plod through a training but I have had some really good results with a single mic.
Also reverting back from state-of-art array mics to old unidirectional electrets when singular can also harness the natural ‘beamforming’ they have.
For most running a single mic they prob have the wrong type in a omnidirectional that will pick up sound (noise) equally from all directions.
Raven doesn’t cope with noise as it just captures a few samples of your voice and goes and should be great as a quick start for capturing a dataset.
Someone somewhere really needs to spruce up the logic of capture as the ASR ‘command’ voice is not part of the capture that leaves you with maybe adding the universal ‘Google command set’ for not voice.
Raven and returned ASR make state-of-art custom datasets because the dataset is exactly how you use voice commands.
With that type of dataset you could make an extremely resilient KWS with something you can train such as Precise.
As I say I have been playing with
As the streaming CRNN tensorflow-lite model it blows away Precise with accuracy & load and have been getting to grips with model creation and parameters.
I did expect to see by now some Arm boards with embedded DSP and sort of wondering if Covid has derailed much as the DSP is not expensive or the big guys in the game have control over the market.
Rockchip & Allwinner did threaten some SoCs but haven’t seen anything avail in a form we can use.
Raven is a special case for quick KW & dataset capture if anyone does make that more robust but a major part of the problem is the KWS we and models they employ are far from state-of-art or self learning as come with a static model.
State-of-the-art is returning to single mic so manufactures can make next gen phones without need of multiple mics and sound access holes.
State-of-the-art room pickup is multiple beamforming/bss distributed mics as far pickup in excess noise fields is just pointless to try and find that meager ripple in air.
Also its looking at what noise is which in a domestic situation its mainly media and the solution to that is merely volume ducking and AEC.
The above KWS tensorflow-lite model can do 20ms streaming and provide an extremely accurate envelope and run in less than 20% single core load of a pi3A+
So yeah you can run 2x KWS on a Pi3 and more just not with the KWS you have and Porcupine is an OK KWS but it gives zero metric out other than a KW has been hit so relatively useless.
State-of-the-art with KWS is https://github.com/google-research/google-research/tree/master/kws_streaming which my repo is just an adaption of that to try and make use a little easier.
State-of-the-art is likely to be a single brain that is fed from multiple distributed mics/arrays with the best signal picked and likely further preprocessing, but prob a strange focus when the KWS we have are so far away from state of the art and even the ASR as state-of-the art ASR can understand through noise.
One of the latest speech toolkits has tools for beamforming but is intel/nvidia based so likely jetson or above but not for the likes where Rhass is in the name as for voice AI raspberry is really the wrong platform.