Reasons for slow speed executing intents?

Hey there everyone! Lately I’ve been trying to localize all my smart home stuff, and being able to execute basic voice commands (starting with just turning lights on and off, and eventually doing things like timers) is pretty important, as I currently use the cloud versions (google/alexa) of those functions quite a bit. Last night, through a good bit of trial and error, I finally got my first voice command to execute (turning on and off a zwave outlet with a lamp connected), which was exciting! Problem is, it consistently took about ten seconds between when the command was given and when it actually executed, which rides the line between inconvenient and unusable.

Reading through tutorials, posts, etc, it seems there’s sometimes conflicting info about best practices, which is to be expected with such a quickly developing product (seriously, major kudos to everyone working on this project. I was hoping then that someone could take a look at my setup and let me know if there’s some other settings or workflow I could tweak to make it work better.

Setup:

  • Home server (i3-13100, 32 gb ram, nvme as main/boot drive) with ubuntu 23.04 desktop and updated kernel. Running home assistant, mosquitto, zwavejs2mqtt, rhasspy and some other stuff in their own docker containers. Home Assistant is version 2023.7.3, Rhasspy is version 2.5.11.

  • Not using any satellites for now - figured get one machine working before trying multiple.

  • Using external MQTT

  • Using arecord for sound recording. Device is “default:CARD=CMTECK” which is a cheap USB microphone plugged into the host machine without UDP settings input. I couldn’t get Pyaudio to pick up any working devices for some reason.

  • Using Porcupine for wake word handling. Default settings.

  • Using Kaldi for voice to text. Default settings.

  • Using fsticuffs for intent handling. Default settings.

  • Using rhasspy for dialogue management. Default settings.

  • Edit: text to speech and audio playing both disabled since I don’t have a speaker attached.

  • Using Home Assistant for intent handling, with and API access token and “Send intents to Home Assistant (/api/intent/handle)” selected.

A few other notes:

  • When it picks it up (I’ll need to do some fine tuning on sensitivity and maybe look at a better mic I think), recognizing the text and generating an intent seems to only take a couple seconds. The big pause seems to be after there’s an intent and before the action actually happens. This was wrong, see next post. It appears to be speech to text that’s the problem.

  • Turning the light on and off directly through home assistant rather than through voice is instant.

  • I have nodered installed as a docker instance (I thought I’d need it before realizing that rhasspy could pass basic intents directly to home assistant), but it doesn’t have anything deployed.

If anyone can recommend any different settings or some troubleshooting steps that would by super appreciated! I’m away from home now but hoping to build a little list of things to try.

Update - it turns out I was wrong on what was slow. It’s actually the voice to text part.

  • If I type in the command “turn on corner lamp” and hit recognize with the “Handle [lightning bolt]” checked, it happens pretty much immediately.

  • If I speak the command “snowboy (using porcupine system but snowboy as word), turn on the corner lamp” it takes about 8 seconds from the time I finish speaking.

I think next thing to try is messing with voice to text settings. Trying pocketsphynx and mozzilla deep didn’t work at all.

I feel like there could be two things happening: (1) it’s taking too long to realize that the command is over or (2) the actual text to speech is taking a long time.

I’m going to try tweaking VAD sensitivity and seeing what happens. Or maybe silence probability? If either helps, then it’s probably (1).

For (2) I guess I can try switching from Text FST to ARPA, though I’m not really sure what that does. I don’t think processing power is the issue, as reading around people seem to be getting better performance from a pi 3b+ which has a lot less.

If it’s number (1)…I wonder if there’s some way for it to just assume silence if it’s already got all the words it needs to construct a full intent. Like can I tell it that if I say the word “off” and then later say the word “lights” I’m done speaking every time, or if I say {color} followed by “light” I’m done every time.

Tried changing between ARPA and the other one, tries changing VAD sensitivity, and tries changing silence probability. Still 8 second lag :pensive:

A couple things that I think were getting in the way:

Poor mic quality leading to harder to translate text.

Not ensuring to turn of all remote sound play settings when I had an RDP session going.

After ensuring both, it’s working well! Just thought I’d post an update in case anyone else stumbles upon this.

Yes, there are definitely areas which could be improved on; which at the moment requires a fair bit of trial and error to find best settings :frowning:

Mic quality definitely; and the erroneous assumption that 2 (or 4) mics are automagically better than one. I do find that I try to deliberately speak clearly when giving commands to porcupine.

Background noise is a big one. We really take for granted how good our brains are at filtering out the sound we are not interested in (so we can have a conversation in a crowded bar or restaurant). Computers just hear a bunch of noise. I think that voices from a television in the background will be a real issue for a long time.

Several members here are using Jabra and other high quality business conferencing mics … but even these are intended to be used in closed meeting rooms to minimise background noise.

1 Like

don,

I appreciate the response. I had meant to come back with an update, and thought I had, but was apparently mistaken.

Your thoughts were correct, it was 100% a sound quality issue. The USB microphone I was using was cheap and apparently did not do a good enough job. I’m surprised it led to long times rather than a complete failure, but it did. I’ve since moved on to using a Jabra 410 as you mentioned, and am having much better results, with intents (actually automatons now) executing near instantly.

When funds allow, I hope to pick up another and try to get a satellite working with an orange pi zero 3.

I’ll also say that having a speaker/mic combo rather than just mic has been helpful for diagnostic purposes, even if you ignore all the uses of a speaker.