General usability issues with 2.5.0

fastjack · July 1, 2020, 7:15pm

I can use Kaldi ASR with good enough accuracy for an utterance from a distance of a few meters with TV playback less than a meter away from the Respeaker Mic Array v2. The hum of the air conditioning should not impact the accuracy that much.

What firmware have you flashed to the Respeaker? The 1 channel one or the 6 channels one?

parasaurolophus · July 1, 2020, 7:18pm

1 channel firmware.

Specifcally, I followed the instructions at https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/#faq using the

sudo python dfu.py --download 1_channel_firmware.bin

option

fastjack · July 1, 2020, 7:43pm

This is a good start.

Recording from the 6 channels firmware (without configuring ALSA to limit to the AEC channel) mix the AEC channel with the raw mic channels and highly degrade accuracy.

Using the 1 channel firmware is much more simple

If you record an utterance with the Respeaker using arecord and upload it in Rhasspy GUI test interface do you get the correct intent?

parasaurolophus · July 1, 2020, 8:09pm

Yep, I shudder to think what a STT algorithm would make of multichannel audio.

And, yes, uploading WAV files captured using the same mic with arecord works far more reliably than real time capture.

fastjack · July 1, 2020, 8:14pm

Using Rhasspy utterance replay feature, can you check that the first words are correctly recorded when uttering something in real time?

Also what NLU service are you using? Fsticuff or fuzzywuzzy?

parasaurolophus · July 1, 2020, 8:32pm

Fsticuff but I can easily try fuzzywuzzy to see if that changes anything.

I’ve never observed a case where recognition failed due to a partial recording. There seem to be two modes: either it works successfully, or it just hangs in the speech recognition phase waiting for audio input as if it is hearing silence even though I am speaking and the mic led’s are reporting activity.

I.e. when voice commands fail they do not appear to get as far as intent processing. Rather, no speech gets recorded to start with.

fastjack · July 1, 2020, 8:44pm

This looks like an issue with webrtc silence detection… maybe the air conditioning hum is giving the silence detection a hard time…

Some users on this forum have tinkered with webrtc VAD parameters in Rhasspy configuration. Maybe this can help.

parasaurolophus · July 1, 2020, 9:19pm

Thanks @fastjack for your suggestions.

Fiddling with respeaker settings for things like automatic gain control and noise cancellation seems to be having some positive effect, as well. Not quite to the point yet where I would unleash this on unwary family members, but still definitely excited about trying to get this set up to work.

fastjack · July 1, 2020, 9:41pm

Yes, I also noticed that in some situation AGC and DSP filters in general can alter the audio and lead to poorer recognition.

I’ve used the Respeaker Mic Array v2 a lot and never had to change the default configuration to get it to work correctly though (never used it with air conditioning nearby either… so… ).

Maybe webrtc VAD is calibrated with too much noise from the air conditioning. Some noises can be wrongly detected as speech and the far field sensibility of the Respeaker may be so good that it capture the hum. Then the VAD does not detect the real speech loudly enough to trigger the end of the utterance, hanging in speech recognition listening mode… The dialogue manager will eventually kill the session resulting in no intent detected at all.

I’d be curious to know if turning off the air conditioning helps… Does it still happen in a quiet room?

Can you post the logs of a failed session?

Also: Change the Wake WAVs

HorizonKane · July 2, 2020, 7:41am

This thread:

might be very useful for you.

parasaurolophus · July 2, 2020, 7:32pm

Thanks again, @HorizonKane and @fastjack for the tips.

Slowly seeing some improvement with a good deal of experimentation. Not all the way there yet, but can see that this tunnel might actually have some light at its end.

The three things that seemed to have helped the most are

The suggestion to add

"command": {
    "webrtcvad": {
        "throwaway_buffers": 0
    }
},
"rhasspy": {
    "listen_on_start": true
}

to profile.json contained in the thread referenced by @HorizonKane

Moving the mic to a different location (which, counter-intuitively, seems as if it would be more subject to noise than my office desk but seems to have improved things – go figure!)
Learning to speak in ways in which the Porcupine / Kaldi combo seem to perform better

Keys to 3. include speaking loudly, slowly, and with a noticeable caesura between the wake word and the command. This is in line with what other users have reported including the suggestion by some to include an extra word to the beginning of the command utterance.

This is not actually different in kind from using Google Assistant or Alexa, just in degree. I am hopeful that with more profile and mic settings tweaking I might reduce that performance gap to the point where my husband might agree to let me wean us off Google once and for all.

My biggest concern is that I do continue to see cases similar to other users’ reports that once the dialog management flow goes awry, it can stay in a problematic state for some time. During my experimentation I have found that Rhasspy sometimes gets into a state where I have to let it run for a time without trying to issue voice commands before it settles down to the point of being able go through the wake word -> speech recognition -> intent processing cycle successfully even after it has been running for a while and had earlier responded well.

I.e. I still think that the overall dialog management flow could use some hardening.

Thanks again, all!

-Kirk

HorizonKane · July 2, 2020, 9:10pm

I am talking quite naturally with rhasspy and its working fine. Did you experiment with the audio gain value mentioned in the thread?

You might also want to try snowboy as wakeword it has really good performance when you record a hood wakeword with audacity like described in the other thread.

parasaurolophus · July 2, 2020, 10:36pm

The problem I’ve experienced has never been with wake word detection. That has always been very reliable in my experience. The problem has been with the subsequent STT stage. In fact, I experience identical behavior using both wake word processing and manually clicking the “Wake Up” button.

fastjack · July 3, 2020, 12:52am

Do you have an audio output service running?

Do you hear the feedback sounds playing when you say the wake word?

It might be caused by a timeout issue between wake word detection and the ASR listening phase.

The ASR only starts listening after the feedback sound has finished playing. If no audio output is running it cannot send the playback finished event and the dialogue manager has to wait some amount of time before assuming it can safely request the ASR to listen.

We’ll need to see the logs of a failed session (with the timestamps) to hopefully find what is causing this.

Can you post the full logs of a failed session so we can have a look?

Cheers

parasaurolophus · July 9, 2020, 3:07am

Sorry for the delay in responding. Just returned from spending the US Independence Day holiday out of town.

I use the audio output of the ReSpeaker mic array. I do hear the feedback sounds. Part of “training users to speak the way Rhasspy understands” to which I referred previously is to wait a small beat after the acknowledgement beep but not so long that it times out before fully issuing a command.

I’ll try to capture some logs with various failed test scenarios over the next few days. In the meantime, after updating to the latest Docker image as of the afternoon of July 8, 2020 I can report that after only a few tests:

I still have to speak loundly and distinctly to get a wake word or command utterance recognized
Using kaldi and fsticffs, it seems pretty good at recognizing commands that match entries in sentences.ini
However, it seems pretty liberal about “recognizing” invalid commands (false positive speech recognition) with the result that it performs fairly random actions which might be rather unfortunate in some cases (i.e. fstifuffs doesn’t seem a lot more strict than fuzzywuzzy, which I would have expected it to be)

For example, “Do something wonderful” (not in sentences.ini) was recognized as “set guest nightstand one purple” (which is in sentences.ini). Had someone been staying with us, they might have gotten a bit of a surprise.

Here is the log for the case where I would have expected “not recognized” but got a false positive:

[DEBUG:2020-07-08 19:41:57,717] rhasspyserver_hermes: Sent 825 char(s) to websocket
[DEBUG:2020-07-08 19:41:57,712] rhasspyserver_hermes: <- NluIntent(input='set 26 purple', intent=Intent(intent_name='ChangeLightColor', confidence_score=1.0), site_id='cheznous', id=None, slots=[Slot(entity='lights', value={'kind': 'Unknown', 'value': '26'}, slot_name='light', raw_value='guest nightstand one', confidence=1.0, range=SlotRange(start=4, end=6, raw_start=4, raw_end=24)), Slot(entity='colors', value={'kind': 'Unknown', 'value': 'purple'}, slot_name='color', raw_value='purple', confidence=1.0, range=SlotRange(start=7, end=13, raw_start=25, raw_end=31))], session_id='cheznous-porcupine-7673b211-15cb-4b8f-8692-6dc0afe5a553', custom_data=None, asr_tokens=[[AsrToken(value='set', confidence=1.0, range_start=0, range_end=3, time=None), AsrToken(value='26', confidence=1.0, range_start=4, range_end=6, time=None), AsrToken(value='purple', confidence=1.0, range_start=7, range_end=13, time=None)]], asr_confidence=None, raw_input='set guest nightstand one purple', wakeword_id='porcupine', lang=None)
[DEBUG:2020-07-08 19:41:52,430] rhasspyserver_hermes: <- HotwordDetected(model_id='/usr/lib/rhasspy/rhasspy-wake-porcupine-hermes/rhasspywake_porcupine_hermes/porcupine/resources/keyword_files/raspberrypi/porcupine.ppn', model_version='', model_type='personal', current_sensitivity=0.65, site_id='cheznous', session_id=None, send_audio_captured=None, lang=None)

Of course, all the log shows is it working as expected had I actually told it to set the given light to the given color.

On the up side, I can report that bool conversion does now seem to work as expected.

fastjack · July 9, 2020, 4:25am

This is a known issue. Capturing the « none » intent with a limited language model is kind of a challenge. Introducing some noise into the language model might help to avoid triggering an incorrect intent but might worsen accuracy. It is still a work in progress.

See:

parasaurolophus · July 9, 2020, 5:29pm

Using the latest and greatest Docker image as of the time of this writing and with a bit more fiddling with mic placement, I have not (so far) had a repetition of the dialog management issues I reported previously.

As a final summary, the configuration that seems to be running well for me on a Pi 4 (8gb):

PyAudio input using a ReSpeaker Mic Array V2 (USB)
aplay output using a Logitech Z50 self-powered speaker connected to the 3.5 output of the mic array
Mycroft Precise (though Porcupine seems to work at least as well)
Kaldi
Fsticuffs

I currently have internal Rhasspy intent processing turned off, but can imagine it coming in handy for some scenarios in the future.

For now, I have a rather small sentences.ini that references rather larger slots files to map things like Hue device, group and light names to their underlying Hue API id’s. Similarly for friendly-name-to-value-id mapping for a number of Z-Wave devices. Intents get shipped off as JSON to an external Mosquitto instance used by Node-RED.

I intend to experiment soon with setting up a master STT / intent processing server with a number of Pi 3 and Pi 4 satellites in various rooms doing the local audio I/O and wake word processing. Toujours l’audace!

As I start to replicate the hardware setup described above in the bedrooms, it would be nice to keep both the cost and the device footprint a bit smaller.

I wonder if anyone has had success using a more modestly priced integrated conferencing mic / speaker “puck” in place of the ReSpeaker / external speaker combo?

Thanks again for all the support.

-Kirk

fastjack · July 9, 2020, 6:19pm

Nice !

Regarding mic setup:

If you do not need audio playback (music, radio, etc) directly through the box, you can use the 10$ Respeaker 2 mics Hat. It has pretty good far field capacity for its price. It also has LEDs and a small button on top.

For a more « hardcore » approach, you can take a look at @rolyan_trauts posts on this forum.

Following his lead, I’ve just ordered small MEMS I2S mic boards (<4$ each) to test their far field capacity with software AEC and eventually replace the 65$ Respeaker Mic Array.

Cheers

daniele_athome · July 10, 2020, 6:54pm

I’m really interested in that. Please do share your findings

I’m also a user of that same Respeaker USB Mic Array. Maybe a little offtopic here, but how have you configured it? I mean the firmware parameters.

parasaurolophus · July 10, 2020, 7:22pm

I experimented with a number of parameters related to things like AGC. In the end, I didn’t find any had much bearing on the visible behavior in Rhasspy. I’ve been running for a while with all factory default settings, except CNIONOFF 0 (i.e. disable comfort noise). Overall, I found that mic placement had more effect than any of its settings.

“Your mileage may vary,” of course.