Speech and Intent recognition for voice commands of duration > 30 seconds

allaeddin_ben_cheikh · January 7, 2021, 10:17am

Hello everybody,

I need some help to troobleshoot the following problem please:

I am using Rhasspy 2.5 running on docker and I am trying to perform Speech and intent recognition recognition on audio chunks of length ~= 40 seconds and containing few moments of silence.

I am using MozillaDeepspeech for Speech recognition and fsticuffs for intent recognition.

Here is an example of the structure of my audio chunk

[speech_1(3 secs) , silence_1(15 secs), speech_2(5 secs), silence_2(10 secs), speech_3(5 secs)].

And here is the the parameters selected for Deepspeech.

I have managed to obtain good results for audio chunks of length < 30 seconds and don’t contain very long moments of silence ( silence ~= 5 seconds ) .

But if my audio chunk > 30 seconds, Rhasspy returns TimeoutError.

I guess that Timeout = 30 seconds, I have tried to change this parameter using : http://127.0.0.1:12101/api/listen-for-command?timeout=40,

But it failed.

I have noticed that in rhasspy 2.4, there is a parameter named timeout_sec alongside the other parameters (min_sec, silence_sec … ) But I didn’t find this parameter in rhasspy 2.5.

Dou you have any idea how to make Rhasspy listen for voice command for a duration > 30 seconds ?

And I have noticed that the results of speech and intent recognition are better when I use Upload WAV File that contains the audio chunk rather then using Wake Up and say the voice command.

Do you know what is the difference between Upload WAV File and Wake Up?

I know that when I use Wake up, Rhasspy use rhasspy silence in order to identify the begin and the end of a voice command. Is this same mechanism used when I use Upload WAV File ?

Thanks for your help !

Enc3ph4l0n · January 7, 2021, 12:14pm

I had a very quick glance at the code and I think the 30 seconds is hard coded and not currently configurable, but I could be wrong and someone will sure correct me if so.

Wake up just triggers “Hotword Detected” and starts a session, just as if you spoke the hotword yourself. Uploading a WAV file, Rhasspy will process this audio as though you had spoken to it, as you would during a session and recognise the intent.

I presume you’re downloading the last recording and then uploading? I suspect, but don’t know for sure that by uploading you bypass all the timing variables (silence, etc.) and it processes the whole file - but again, I could be wrong, I haven’t had time to check before posting to you.

What do your audio files contain to need such a requirement of 30+ seconds with such speech segmentation? I suspect your minimum duration of 29 seconds will cause undesired results.

allaeddin_ben_cheikh · January 7, 2021, 2:48pm

Hello, Thank your for your clear reply !

My audio files contain the characteristics of a product.
Here is the text that an audio may contain :

[Red , silence_1(15 secs), Flap closure Interior zip pocket, silence_2(10 secs), 23 * 15]

The audio file is too long because it contains long moments of silence.
And I noticed the difference between Wake up and Uploading wav file by following this test :

I used Waked Up and I said my voice command of a duration of 25s. The result was : No Intent recognized.
I recorded my voice command in a wav file using another app and then I uploaded the wav file using Upload WAV File. In this case, rhasspy recognized the intent from the voice command although it has the same length and the same moments of silence of the first voice command.

And can you explain why a duration of seconds can cause undesired results please?

Thanks

Enc3ph4l0n · January 7, 2021, 3:26pm

Your voice command here is 25s, but you’ve set your configuration to expect a minimum of 29s…

Forgive me, I still don’t quite understand your use case here. Do you have a collection of audio files already that you want to process? Have you seen Rhasspy’s sister project? I wonder if that would be more appropriate for your needs:

If you absolutely need to use Rhasspy we can break your issues down further, but for the fact you’re recording your WAV files on a different device and then uploading them into Rhasspy means you could have numerous configuration issues which may not needing solving if the above project fits your project better.

allaeddin_ben_cheikh · January 7, 2021, 3:53pm

No I don’t have a collection of audio files already recorded. I tried the approach of uploading directly the recorded wav file in order to understand why Rhasspy dosen’t recognize intentions from voice commands that contain long moments of silence.

The best for my case is to use the Wake up and say the voice command and not uploading the WAV file directly.

I have managed to reduce the length of my voice commands to 20 seconds. So I set the parameter of min_sec to 20 secs and I tried to tweak the parameters of rhasspy silence ( https://github.com/rhasspy/rhasspy-silence ) but I can’t obtain good results when there is a long moment of silence in my command in spite that I set the parameter silence_sec to 10 seconds. The silence in voice command con last 8 seconds for example )

What I understood from the documentation is that rhasspy listen for my voice command for a minumum of time = min_sec, after that it starts to listen for silence, if there is no word said during silence_sec, it stops listening and starts processing the voice command.
Is it exactly what rhasspy does or I misunderstood the doc ?

Thanks !