Just a question, but with Google Assistant you can say in one go “Ok google turn on the light”
Why, with snips, rhasspy etc, we have to trigger the wakeword, wait for listening start, and finnally talk ?
Why the asr can’t have entire stream and when wakeword is recognized, decode what was said immediately after wakeword and stop when silence is detected ?
I guess it’s not doable otherwise it would work yet, but why google can do it and not open source assistants ?
But I haven’t looked at how the wakeword and ASR components use this metadata. I don’t speak Rust (yet), and I don’t think Snips has open-sourced the code needed to investigate this.
I think Google (and Amazon probably) do the same thing with their devices. Someone posted about the records containing spoken words before the wakeword « Ok Google ».