Finetuning Rhasspy

TyrionWarMage · October 24, 2021, 2:37pm

I have Rhasspy running for a while now, but i still have too much issues that prevent an acceptable WAF

My setup:

Intel Celeron N4100
4GB Ram
Ubuntu (Rhasspy direct install, no docker)
PSEye Mic
AEC with PulseAudio WebRTC
Porcupine with “snowboy”, confidence 0.7
Kaldi with ARPA, no open transcript, DE model, confidence 0

My specific problems:

Wakeword and intent recognition require a very clear pronounication
Intent recognition does barely work when speaking fast
Intent recognition requires a short delay after wakeword, no “fluent speaking”
Wakeword has very high false negatives when watching movies/series

I already tried snowboy as wakeword engine. It has somewhat lower false negatives rate (especially when watching movies), but substantially higher false positive. I know that really solving (3) would require a feature that Rhasspy does not have: Buffering the audio stream and applying kaldi, starting a few ms in the “past”. For (4), i should probably go for a better AEC algorithm (maybe https://github.com/SaneBow/PiDTLN ?).

Hence, my questions:

Anyone having encountered similar issues (especially 1,2) and has fine tuned the setup accordingly?
Is there a way to resolve 3?
Anyone having a really good AEC setup running (for 4)?

synesthesiam · November 5, 2021, 9:17pm

Hi @TyrionWarMage, welcome.

You’ve hit on most of the major issues Rhasspy has

This may be partially the microphone at fault. @rolyan_trauts can give some good suggestions for a better mic (like this one though I haven’t tried it yet)
Could also be the mic; I’m hoping to add Kaldi fine-tuning in a future release
This is one of the most requested features; I have ideas of how to implement it [1], but haven’t gotten around to it yet, unfortunately.
I haven’t played with AEC yet, but I know some people here have. If you don’t use voice commands while a movie is playing, a hack might be to toggle the wake word on/off automatically using the HTTP API based on the movie player state.

[1] Buffering the audio in the ASR service, and then having the wake word service send an audio timestamp at detection

rolyan_trauts · November 6, 2021, 10:22am

Pulseaudio AEC is very weak as it works and cancels perfectly at low levels but completely fails above what is a relatively low threshold.
Sanebows AEC is awesome but does create load.
Speexdsp works better than Pulseaudio with higher levels and has lower load.

PSeye mic array is pointless as we only use a single channel.

I am not sure how well sanebows copes with clockdrift but really you need a input / output audio on the same device so you don’t get clock drift, for Speexdsp its essential.
Clock drift happens if you have 2x different devices and the slight drift makes cancelation less accurate whilst using the same device means one clock that obviously is in sync.

PS3eye or usb only mics are not a good idea if you want to use AEC.

Mics can be cheap such as Mini Unidirectional 3.5mm Recording Microphone Fr Mobile Phone/Notebook Computer 8852091825343 | eBay

Or if you have a month to a year…

USB-C mics are just power with signal but easy to convert to 3.5mm

TyrionWarMage · November 12, 2021, 10:02am

@synesthesiam Thanks for the info. [1] sound great and i will have a look if i can contribute something. Plenty of python experience, but near to none with sound/stream handling … so no promises

@rolyan_trauts thanks, i will try something that directly connects to the 3.5mm jack. However, i have a receiver hooked up to the HDMI output, hence, i assume i will have clock drift anyhow? Was always wondering, is there a way to measure the drift and set the delay manually?
Concerning sanebow, i hope the N4100 will be able to handle the load, but we’ll see …

rolyan_trauts · November 14, 2021, 2:05am

Its drift not latency much smaller than latency but random so no.
Its why there is no software that uses or advocates multi sound card recording but will use multichannel sound cards at least semi-pro or professionally.

WallyDW · November 17, 2021, 9:45am

For wakeword I use Raven with a Seeed Respeaker 4 Mic Linear Array and that catches it pretty decent.
I can even play music on the speaker connected to the Respeaker and it still catches my wakeword.

Regarding the pulseaudio then it looks like pipewire might be a better way to go, since it tries to replace pulseaudio with something closer to jack and it looks like it will be the future in audio servers on the non-professional setups and even on some professional ones.