First, a huge thanks to @synesthesiam for this amazing project! Hopefully the issues I am experiencing are either due to misconfiguration by a newbie (me) or work-in-progress bugs with this release.
After a couple of days I have reached the point where everything works perfectly, except for the slight matter of reliably issuing voice commands. (There is an old joke among doctors: “The operation was a success and then the patient died.”)
There appear to be two stages where things are going wrong:
-
Dialog management – hand off from the wake word to speech recognition stages seems fragile
-
Speech recognition – very unreliable real-time speech recognition (may be related to 1, see below)
Speech recognition when uploading pre-recorded WAV files using the same mic and listening conditions works better than speech recognition triggered either manually using Rhasspy’s “Wake Up” button or as a result of wake-word processing. So maybe there is something about overall dialog management that is giving the appearance of unreliable STT?
Wake-word recognition is very reliable even under less-than-ideal listening conditions using either Mycroft Precise or Porcupine. However, when using Mycroft it seems often to trigger dialog management twice in rapid succession for a single utterance. That, in turn, seems to cause issues with speech recognition. I have never experienced that with Porcupine (so far, but early days!)
Speech recognition using Pocketsphinx seems rather hit-or-miss at the best of times. Speech recognition with Kaldi seems pretty reliable when operating on uploaded WAV files, even when they were recorded with a certain amount of background noise. Kaldi is also reasonably reliable when operating in real time and in conjunction with Porcupine when the listening conditions are absolutely ideal. However, the tiniest bit of background noise (e.g. the low hum of central air conditioning) seems to prevent it from working. The reliability is low enough in real world conditions that I could not switch my household from Google Assistant to Rhasspy, as dearly as I would love to!
My set-up:
- Docker container built from rhasspy/rhasspy:latest
- Raspberry Pi 4B (production) & Ubuntu laptop (testing / staging)
- ReSpeaker Mic Array V2.0 (production) & built-in laptop mic (testing / staging)
I get effectively identical results on both the Pi and the much more powerful laptop, so it seems as if the issue is somewhere in the software stack rather than simply a matter of the Pi not having enough horse power.
I am trying to understand the many configuration options for the ReSpeaker mic array (which I have had in my possession for a whopping 18 hours at the time of this writing, some of which I was sleeping ) to see if its built-in audio processing could help with the background noise but no luck so far. For the record, I have managed to configure it so that WAV files recorded using arecord
sound acceptably good to human ears, but that doesn’t seem to have helped Rhasspy much when operating in real time.
To be clear, we are not talking about a machine room or the like. The level of background noise that is causing a problem is that of a normal home central air conditioning unit that produces a “hum” that isn’t terribly intrusive to human ears but appears to be enough to completely stymie Rhasspy’s core feature set.
Here is profile.json from the Pi:
{
"command": {
"webrtcvad": {
"vad_mode": "1"
}
},
"dialogue": {
"system": "rhasspy"
},
"intent": {
"system": "fsticuffs"
},
"microphone": {
"command": {
"record_arguments": "udpsrc port=12333 ! rawaudioparse use-sink-caps=false format=pcm pcm-format=s16le sample-rate=16000 num-channels=1 ! queue ! audioconvert ! audioresample ! filesink location=/dev/stdout",
"record_program": "gstreamer"
},
"pyaudio": {
"device": "0"
},
"system": "pyaudio"
},
"mqtt": {
"enabled": "true",
"host": "office-pi4",
"site_id": "office-pi4"
},
"sounds": {
"aplay": {
"device": "plughw:CARD=b1,DEV=0"
},
"system": "aplay"
},
"speech_to_text": {
"system": "kaldi"
},
"text_to_speech": {
"system": "espeak"
},
"wake": {
"system": "porcupine"
}
}
Note that the external MQTT broker is a Mosquitto instance I already use with other bits of my home automation system including Node-RED. When intent processing is successful, the rest of the features work extremely well with my overall home automation setup.
Any advice from more seasoned Rhasspy users as to where to look in the configuration to improve the stability of dialog management and the reliability speech recognition would be greatly appreciated!
Thanks,
Kirk (a.ka. parasaurolophus)