General usability issues with 2.5.0

parasaurolophus · July 1, 2020, 7:03pm

First, a huge thanks to @synesthesiam for this amazing project! Hopefully the issues I am experiencing are either due to misconfiguration by a newbie (me) or work-in-progress bugs with this release.

After a couple of days I have reached the point where everything works perfectly, except for the slight matter of reliably issuing voice commands. (There is an old joke among doctors: “The operation was a success and then the patient died.”)

There appear to be two stages where things are going wrong:

Dialog management – hand off from the wake word to speech recognition stages seems fragile
Speech recognition – very unreliable real-time speech recognition (may be related to 1, see below)

Speech recognition when uploading pre-recorded WAV files using the same mic and listening conditions works better than speech recognition triggered either manually using Rhasspy’s “Wake Up” button or as a result of wake-word processing. So maybe there is something about overall dialog management that is giving the appearance of unreliable STT?

Wake-word recognition is very reliable even under less-than-ideal listening conditions using either Mycroft Precise or Porcupine. However, when using Mycroft it seems often to trigger dialog management twice in rapid succession for a single utterance. That, in turn, seems to cause issues with speech recognition. I have never experienced that with Porcupine (so far, but early days!)

Speech recognition using Pocketsphinx seems rather hit-or-miss at the best of times. Speech recognition with Kaldi seems pretty reliable when operating on uploaded WAV files, even when they were recorded with a certain amount of background noise. Kaldi is also reasonably reliable when operating in real time and in conjunction with Porcupine when the listening conditions are absolutely ideal. However, the tiniest bit of background noise (e.g. the low hum of central air conditioning) seems to prevent it from working. The reliability is low enough in real world conditions that I could not switch my household from Google Assistant to Rhasspy, as dearly as I would love to!

My set-up:

Docker container built from rhasspy/rhasspy:latest
Raspberry Pi 4B (production) & Ubuntu laptop (testing / staging)
ReSpeaker Mic Array V2.0 (production) & built-in laptop mic (testing / staging)

I get effectively identical results on both the Pi and the much more powerful laptop, so it seems as if the issue is somewhere in the software stack rather than simply a matter of the Pi not having enough horse power.

I am trying to understand the many configuration options for the ReSpeaker mic array (which I have had in my possession for a whopping 18 hours at the time of this writing, some of which I was sleeping ) to see if its built-in audio processing could help with the background noise but no luck so far. For the record, I have managed to configure it so that WAV files recorded using arecord sound acceptably good to human ears, but that doesn’t seem to have helped Rhasspy much when operating in real time.

To be clear, we are not talking about a machine room or the like. The level of background noise that is causing a problem is that of a normal home central air conditioning unit that produces a “hum” that isn’t terribly intrusive to human ears but appears to be enough to completely stymie Rhasspy’s core feature set.

Here is profile.json from the Pi:

{
    "command": {
        "webrtcvad": {
            "vad_mode": "1"
        }
    },
    "dialogue": {
        "system": "rhasspy"
    },
    "intent": {
        "system": "fsticuffs"
    },
    "microphone": {
        "command": {
            "record_arguments": "udpsrc port=12333 ! rawaudioparse use-sink-caps=false format=pcm pcm-format=s16le sample-rate=16000 num-channels=1 ! queue ! audioconvert ! audioresample ! filesink location=/dev/stdout",
            "record_program": "gstreamer"
        },
        "pyaudio": {
            "device": "0"
        },
        "system": "pyaudio"
    },
    "mqtt": {
        "enabled": "true",
        "host": "office-pi4",
        "site_id": "office-pi4"
    },
    "sounds": {
        "aplay": {
            "device": "plughw:CARD=b1,DEV=0"
        },
        "system": "aplay"
    },
    "speech_to_text": {
        "system": "kaldi"
    },
    "text_to_speech": {
        "system": "espeak"
    },
    "wake": {
        "system": "porcupine"
    }
}

Note that the external MQTT broker is a Mosquitto instance I already use with other bits of my home automation system including Node-RED. When intent processing is successful, the rest of the features work extremely well with my overall home automation setup.

Any advice from more seasoned Rhasspy users as to where to look in the configuration to improve the stability of dialog management and the reliability speech recognition would be greatly appreciated!

Thanks,
Kirk (a.ka. parasaurolophus)

fastjack · July 1, 2020, 7:15pm

I can use Kaldi ASR with good enough accuracy for an utterance from a distance of a few meters with TV playback less than a meter away from the Respeaker Mic Array v2. The hum of the air conditioning should not impact the accuracy that much.

What firmware have you flashed to the Respeaker? The 1 channel one or the 6 channels one?

parasaurolophus · July 1, 2020, 7:18pm

1 channel firmware.

Specifcally, I followed the instructions at https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/#faq using the

sudo python dfu.py --download 1_channel_firmware.bin

option

fastjack · July 1, 2020, 7:43pm

This is a good start.

Recording from the 6 channels firmware (without configuring ALSA to limit to the AEC channel) mix the AEC channel with the raw mic channels and highly degrade accuracy.

Using the 1 channel firmware is much more simple

If you record an utterance with the Respeaker using arecord and upload it in Rhasspy GUI test interface do you get the correct intent?

parasaurolophus · July 1, 2020, 8:09pm

Yep, I shudder to think what a STT algorithm would make of multichannel audio.

And, yes, uploading WAV files captured using the same mic with arecord works far more reliably than real time capture.

fastjack · July 1, 2020, 8:14pm

Using Rhasspy utterance replay feature, can you check that the first words are correctly recorded when uttering something in real time?

Also what NLU service are you using? Fsticuff or fuzzywuzzy?

parasaurolophus · July 1, 2020, 8:32pm

Fsticuff but I can easily try fuzzywuzzy to see if that changes anything.

I’ve never observed a case where recognition failed due to a partial recording. There seem to be two modes: either it works successfully, or it just hangs in the speech recognition phase waiting for audio input as if it is hearing silence even though I am speaking and the mic led’s are reporting activity.

I.e. when voice commands fail they do not appear to get as far as intent processing. Rather, no speech gets recorded to start with.

fastjack · July 1, 2020, 8:44pm

This looks like an issue with webrtc silence detection… maybe the air conditioning hum is giving the silence detection a hard time…

Some users on this forum have tinkered with webrtc VAD parameters in Rhasspy configuration. Maybe this can help.

parasaurolophus · July 1, 2020, 9:19pm

Thanks @fastjack for your suggestions.

Fiddling with respeaker settings for things like automatic gain control and noise cancellation seems to be having some positive effect, as well. Not quite to the point yet where I would unleash this on unwary family members, but still definitely excited about trying to get this set up to work.

fastjack · July 1, 2020, 9:41pm

Yes, I also noticed that in some situation AGC and DSP filters in general can alter the audio and lead to poorer recognition.

I’ve used the Respeaker Mic Array v2 a lot and never had to change the default configuration to get it to work correctly though (never used it with air conditioning nearby either… so… ).

Maybe webrtc VAD is calibrated with too much noise from the air conditioning. Some noises can be wrongly detected as speech and the far field sensibility of the Respeaker may be so good that it capture the hum. Then the VAD does not detect the real speech loudly enough to trigger the end of the utterance, hanging in speech recognition listening mode… The dialogue manager will eventually kill the session resulting in no intent detected at all.

I’d be curious to know if turning off the air conditioning helps… Does it still happen in a quiet room?

Can you post the logs of a failed session?

Also: Change the Wake WAVs

HorizonKane · July 2, 2020, 7:41am

This thread:

might be very useful for you.

parasaurolophus · July 2, 2020, 7:32pm

Thanks again, @HorizonKane and @fastjack for the tips.

Slowly seeing some improvement with a good deal of experimentation. Not all the way there yet, but can see that this tunnel might actually have some light at its end.

The three things that seemed to have helped the most are

The suggestion to add

"command": {
    "webrtcvad": {
        "throwaway_buffers": 0
    }
},
"rhasspy": {
    "listen_on_start": true
}

to profile.json contained in the thread referenced by @HorizonKane

Moving the mic to a different location (which, counter-intuitively, seems as if it would be more subject to noise than my office desk but seems to have improved things – go figure!)
Learning to speak in ways in which the Porcupine / Kaldi combo seem to perform better

Keys to 3. include speaking loudly, slowly, and with a noticeable caesura between the wake word and the command. This is in line with what other users have reported including the suggestion by some to include an extra word to the beginning of the command utterance.

This is not actually different in kind from using Google Assistant or Alexa, just in degree. I am hopeful that with more profile and mic settings tweaking I might reduce that performance gap to the point where my husband might agree to let me wean us off Google once and for all.

My biggest concern is that I do continue to see cases similar to other users’ reports that once the dialog management flow goes awry, it can stay in a problematic state for some time. During my experimentation I have found that Rhasspy sometimes gets into a state where I have to let it run for a time without trying to issue voice commands before it settles down to the point of being able go through the wake word -> speech recognition -> intent processing cycle successfully even after it has been running for a while and had earlier responded well.

I.e. I still think that the overall dialog management flow could use some hardening.

Thanks again, all!

-Kirk

HorizonKane · July 2, 2020, 9:10pm

I am talking quite naturally with rhasspy and its working fine. Did you experiment with the audio gain value mentioned in the thread?

You might also want to try snowboy as wakeword it has really good performance when you record a hood wakeword with audacity like described in the other thread.

parasaurolophus · July 2, 2020, 10:36pm

The problem I’ve experienced has never been with wake word detection. That has always been very reliable in my experience. The problem has been with the subsequent STT stage. In fact, I experience identical behavior using both wake word processing and manually clicking the “Wake Up” button.

fastjack · July 3, 2020, 12:52am

Do you have an audio output service running?

Do you hear the feedback sounds playing when you say the wake word?

It might be caused by a timeout issue between wake word detection and the ASR listening phase.

The ASR only starts listening after the feedback sound has finished playing. If no audio output is running it cannot send the playback finished event and the dialogue manager has to wait some amount of time before assuming it can safely request the ASR to listen.

We’ll need to see the logs of a failed session (with the timestamps) to hopefully find what is causing this.

Can you post the full logs of a failed session so we can have a look?

Cheers

parasaurolophus · July 9, 2020, 3:07am

Sorry for the delay in responding. Just returned from spending the US Independence Day holiday out of town.

I use the audio output of the ReSpeaker mic array. I do hear the feedback sounds. Part of “training users to speak the way Rhasspy understands” to which I referred previously is to wait a small beat after the acknowledgement beep but not so long that it times out before fully issuing a command.

I’ll try to capture some logs with various failed test scenarios over the next few days. In the meantime, after updating to the latest Docker image as of the afternoon of July 8, 2020 I can report that after only a few tests:

I still have to speak loundly and distinctly to get a wake word or command utterance recognized
Using kaldi and fsticffs, it seems pretty good at recognizing commands that match entries in sentences.ini
However, it seems pretty liberal about “recognizing” invalid commands (false positive speech recognition) with the result that it performs fairly random actions which might be rather unfortunate in some cases (i.e. fstifuffs doesn’t seem a lot more strict than fuzzywuzzy, which I would have expected it to be)

For example, “Do something wonderful” (not in sentences.ini) was recognized as “set guest nightstand one purple” (which is in sentences.ini). Had someone been staying with us, they might have gotten a bit of a surprise.

Here is the log for the case where I would have expected “not recognized” but got a false positive:

[DEBUG:2020-07-08 19:41:57,717] rhasspyserver_hermes: Sent 825 char(s) to websocket
[DEBUG:2020-07-08 19:41:57,712] rhasspyserver_hermes: <- NluIntent(input='set 26 purple', intent=Intent(intent_name='ChangeLightColor', confidence_score=1.0), site_id='cheznous', id=None, slots=[Slot(entity='lights', value={'kind': 'Unknown', 'value': '26'}, slot_name='light', raw_value='guest nightstand one', confidence=1.0, range=SlotRange(start=4, end=6, raw_start=4, raw_end=24)), Slot(entity='colors', value={'kind': 'Unknown', 'value': 'purple'}, slot_name='color', raw_value='purple', confidence=1.0, range=SlotRange(start=7, end=13, raw_start=25, raw_end=31))], session_id='cheznous-porcupine-7673b211-15cb-4b8f-8692-6dc0afe5a553', custom_data=None, asr_tokens=[[AsrToken(value='set', confidence=1.0, range_start=0, range_end=3, time=None), AsrToken(value='26', confidence=1.0, range_start=4, range_end=6, time=None), AsrToken(value='purple', confidence=1.0, range_start=7, range_end=13, time=None)]], asr_confidence=None, raw_input='set guest nightstand one purple', wakeword_id='porcupine', lang=None)
[DEBUG:2020-07-08 19:41:52,430] rhasspyserver_hermes: <- HotwordDetected(model_id='/usr/lib/rhasspy/rhasspy-wake-porcupine-hermes/rhasspywake_porcupine_hermes/porcupine/resources/keyword_files/raspberrypi/porcupine.ppn', model_version='', model_type='personal', current_sensitivity=0.65, site_id='cheznous', session_id=None, send_audio_captured=None, lang=None)

Of course, all the log shows is it working as expected had I actually told it to set the given light to the given color.

On the up side, I can report that bool conversion does now seem to work as expected.

fastjack · July 9, 2020, 4:25am

This is a known issue. Capturing the « none » intent with a limited language model is kind of a challenge. Introducing some noise into the language model might help to avoid triggering an incorrect intent but might worsen accuracy. It is still a work in progress.

See:

parasaurolophus · July 9, 2020, 5:29pm

Using the latest and greatest Docker image as of the time of this writing and with a bit more fiddling with mic placement, I have not (so far) had a repetition of the dialog management issues I reported previously.

As a final summary, the configuration that seems to be running well for me on a Pi 4 (8gb):

PyAudio input using a ReSpeaker Mic Array V2 (USB)
aplay output using a Logitech Z50 self-powered speaker connected to the 3.5 output of the mic array
Mycroft Precise (though Porcupine seems to work at least as well)
Kaldi
Fsticuffs

I currently have internal Rhasspy intent processing turned off, but can imagine it coming in handy for some scenarios in the future.

For now, I have a rather small sentences.ini that references rather larger slots files to map things like Hue device, group and light names to their underlying Hue API id’s. Similarly for friendly-name-to-value-id mapping for a number of Z-Wave devices. Intents get shipped off as JSON to an external Mosquitto instance used by Node-RED.

I intend to experiment soon with setting up a master STT / intent processing server with a number of Pi 3 and Pi 4 satellites in various rooms doing the local audio I/O and wake word processing. Toujours l’audace!

As I start to replicate the hardware setup described above in the bedrooms, it would be nice to keep both the cost and the device footprint a bit smaller.

I wonder if anyone has had success using a more modestly priced integrated conferencing mic / speaker “puck” in place of the ReSpeaker / external speaker combo?

Thanks again for all the support.

-Kirk

fastjack · July 9, 2020, 6:19pm

Nice !

Regarding mic setup:

If you do not need audio playback (music, radio, etc) directly through the box, you can use the 10$ Respeaker 2 mics Hat. It has pretty good far field capacity for its price. It also has LEDs and a small button on top.

For a more « hardcore » approach, you can take a look at @rolyan_trauts posts on this forum.

Following his lead, I’ve just ordered small MEMS I2S mic boards (<4$ each) to test their far field capacity with software AEC and eventually replace the 65$ Respeaker Mic Array.

Cheers

daniele_athome · July 10, 2020, 6:54pm

I’m really interested in that. Please do share your findings

I’m also a user of that same Respeaker USB Mic Array. Maybe a little offtopic here, but how have you configured it? I mean the firmware parameters.