Latency after wake word detection and at start of text-to-speech

I’m running latest git rhasspy (v2.5.1) on an i3-6100 CPU @ 3.70GHz. It’s pretty responsive but I see a delay of about half a second between HotwordDetected and AsrStartListening. What this means is when I say something like “Computer, what time is it” the transcription only includes “is it” because I spoke “what time” too soon. From the logs it looks like a lot of time is spent playing the Wake WAV. Disabling that sound does indeed decrease the delay significantly. Would it be possible to make the audio play asynchronously?

Also, sometimes I notice the beginning of text-to-speech audio is cut off. I thought it might be the same reason as above but with the Recorded WAV causing a delay. I tried disabling that but it didn’t help. Finally I tried opening the WAV file from googlewavenet. The file itself seems to cut off the beginning sound.

I came here looking for a solution to this as well. The latency is annoying. It’s difficult to remember and unnatural to say the wake word then wait a second before saying the command.

Disabling the feedback sounds will indeed speed up the ASR listening phase but you will still have the detection and dialogue management handling delay.

Commercial products like Echo or Google devices use some kind of replay technique to allow for instantaneous command following the wake word. This is not implemented in Rhasspy (yet) but might be in the future.

Hope this helps.

It’s possible, but what happens is the ASR picks it up and it messes with the speech recognition. So Rhasspy goes out of its way to not engage the ASR until the sound has finished playing, resulting in the delay.

As @fastjack mentioned, a kind of “echo cancellation” could be used to filter out the feedback sound from the audio. But nothing is implemented for this yet, and the variety of hardware Rhasspy works on makes it harder.

I think that’s why Echo and Google devices does not do feedback sounds.

When the wake word is detected, the audio buffers are stored and flushed the moment ASR starts listening to compensate for the delay between detection and ASR listening. Snips did something similar with the “replay” feature of their audio server.

This can only work if no feedback sounds are played because as @synesthesiam said it will mess with the ASR.

Maybe the delay if its mostly artificial could be made a configuration that is adjustable. So people who want the wake sound can set it to wait the time the wake sound takes to play before it will start recording. But others who turn the wake word off in favor of speed and/or some visual queue like alexa/google they can speak more naturally to it.

If you disable the feedback sounds by just clearing the text boxes in the web UI, there is almost no delay.

In my testing, though, there still is a little delay. One way this could be mitigated is to have the ASR recording a bit before it gets the official “go ahead” signal, and have the various wake word services report some timestamp of when the hotword was actually detected. The ASR could then go back in its audio buffer and start “recording” from there.

The SpeechRecognition did this using PyAudio and snowboy.

Yeah I had a project a few years ago where i kept a short amount of audio in a “rolling buffer” and waited for snowboy to say hey got wake word and then add to the buffer until the command is done. It did often end up with the wake word in the audio file but post remove of the wake word worked fine.

I am super new to this platform but I would love to help

1 Like

There may be a chance to do better in Rhasspy. Audio frames are broken up by the microphone services according to the Hermes protocol. If we could somehow timestamp them, then the wake word services could note the timestamp when the detection occurred, and the ASR could rewind back to just after that point.

The audio frames are just WAV data, which can apparently hold metadata! As long as the timestamp is monotonic (always increases), I think this would be possible.

So the basic idea is:

  1. Microphone services add timestamps to WAV metadata (or just a counter that always increases)
  2. ASR services keep some number of audio frames before being told to start listening
  3. Hotword services report the timestamp/counter of the audio frame where detection occurred
  4. ASR services look in their buffer for the audio frame right after the detection, and start there

#4 could be accomplished by the dialogue manager slipping an extra field in the asr/startListening message.

Thoughts?

Removing the latency would certainly be great, thought as I understand it it would be mutually exclusive with the beep sounds that signify recording start and so on. So there would be a few different options:

  • beep sounds and delay like right now
  • beep sound and no delay (beep is recorded and could mess stuff up)
  • beep sound and no delay (if there is a way to filter that specific sound, this would be the perfect solution)
  • no beep sound and no delay

So unless the ideal solution is possible rhasspy should support both the beep version with delay and the no delay no beep version, because just scratching the sound seems wrong, I am sure there are quite a few cases where it is useful, even outside of testing if the wakeword was recognized.

I hope this was understandable, 4am is not the best time to type things in English but I just saw this and thought I would respond.

There already is settings for adding delays and stuff so the user should be able to adjust to the recording basically just starting sooner. If you have a 1s wake sound you would set the delay to 1s. I think the setting in the STT section that’s relevant is “Skip Before”

Yes but as I understand the point is that you can speak a full sentence directly after the wakeword and then the audio is “rewinded” to directly after the wakeword was spoken once rhasspy is ready to use. If there is a beep after the wakeword then the beep will be recorded in the audio while speaking and then it will be in the rewind also, hence the beep being somewhat exclusive to speaking without delay.

I think what you’re describing is how Snips worked :slightly_smiling_face:

See the in-code documentation of the metadata Snips used for audio streaming. The Snips developers documented this when I asked for clarification about the format while working on hermes-audio-server last year.

https://github.com/badaix/snapcast Does a really good job of latency synchronization and has a timeframe embedded as you say.
For me I keep thinking for distributed network mics for a room it could be amazeballs and really simple.
But if you want to see a really good implementation of timestamp latency correction Snapcast is the one.

Really its not timecode or latency the word we need is ‘Stream’ both Deepspeech and Kaldi can be used in streaming mode.
Deepspeech are going all out with some sort of asynchronous KWS interface for streaming as the problem you have is when your ASR is less than realtime.

If your ASR is less then realtime you are going to get latency but streaming would massively improve things and if you are using a Pi4 or above then Deepspeech is faster than realtime.
Also I don’t know how but the current single thread is supposedly going to muti-thread a stream, how that works is beyond me.

But basically as said ‘Stream’ is the magic word and for an async stream ‘ring buffer’ is probably at least as magical especially if your ASR is less than realtime.

As waiting for ASR end silence will always incur considerable latency.

But actually is it not really simple that vad detected audio is just streamed to ASR and because you know the keyword, just post process STT and remove the Keyword Prexfix.
Then all the KWS needs to do is send an abort on false KWS VAD initiation.

Or when KWS is detected stream audio to a loopback adapter and only stream post KWS audio and just learn your system pause of KWS latency or a ringbuffer or some sort of delay rather than natural will have to be in the stream.
Delay libs and utilities / ringbuffers already exist but hey just learn to pause to your system speed.

As for timecodes again wow it just seems to be Hermes audio is just garnering more function and bloat as already to ASR all we need is a stream which could be as simple and low load as a loopback adapter.


https://github.com/voice-engine/alsa_plugin_fifo but ALSA has file plugins.
https://www.alsa-project.org/alsa-doc/alsa-lib/pcm_plugins.html

@synesthesiam I think you had it but also doesn’t even need a timestamp, just stream it and abort ASR on KWS failure.