Its WER Word error rate https://github.com/Picovoice/wake-word-benchmark#results but I get the feeling PV ran that test in pretty clean conditions.
Pocketsphinx and snowboy (retired and now given free as open source) are pretty poor KWS by todays standards.
Validation of KWS models have state-of-art pushing 98% on benchmark datasets such as the Google commandset which contains much variation and bad data(10%) to create a benchmark dataset as all would return 100%.
With custom datasets (record your own) its relatively easy to garner 100% validation and create noise tolerant models.
I can have 70dB noise (music or voice) and as long as I raise my voice to similar level a modern model such as a CRNN will actually detect.
I dunno if all streaming KWS use many inferences @ 20ms but in my tests even a simple sum of the inference envelope rather than singular snapshots of a threshold means its also extremely low in regards to false/positive negatives, false negatives are not a problem as often its just resay that KW maybe louder / more clearly. If you train right and use a more modern model it will make porcupine look as antiquated as it does snowboy & pocketsphinx.
Still the problem from there is KW is recognized but ASR not so much.