Just a question, but with Google Assistant you can say in one go “Ok google turn on the light”
Why, with snips, rhasspy etc, we have to trigger the wakeword, wait for listening start, and finnally talk ?
Why the asr can’t have entire stream and when wakeword is recognized, decode what was said immediately after wakeword and stop when silence is detected ?
I guess it’s not doable otherwise it would work yet, but why google can do it and not open source assistants ?
It is actually doable, just a bit harder. The Python SpeechRecognition library does this with snowboy, for example.
You can add this as an enhancement on Github, and we can look into it
I think Snips could do that as well.
Internally, they use some kind of rewind feature on the Hermes but I do not know any details
Indeed, Snips did it by adding extra metadata in the audio messages, which I bumped into when I was developing hermes-audio-server:
[…] the rewind and replay is what allow us to reduce drastically the necessary gap between hotword detection and asr start of decoding […]
Since then, Snips has documented the format in their source code.
But I haven’t looked at how the wakeword and ASR components use this metadata. I don’t speak Rust (yet), and I don’t think Snips has open-sourced the code needed to investigate this.
I think Google (and Amazon probably) do the same thing with their devices. Someone posted about the records containing spoken words before the wakeword « Ok Google ».
Yes, that is most probably the reason.
I would also love this feature.
I would love to contribute, so if someone is adding this feature i’m willing to help.