Hey all, just looking for some guidance with those familiar with the code base. I’m looking to build out some features for my personal rhasspy setup as a passion project that i’d love to give back to the community, specifically speaker recognition (i.e. Sam is the one talking, and then downstream I already have it so only certain voice commands in HA can be issued if the speaker is Sam) and arbitrary length input, so i can have a sentence like wakeword > search {query} and have everything after search treated as a query that gets sent to a remote API and then the response sent back to the rhasspy TTS engine to read out loud. That last part with the TTS and remote API i also have implemented.
with that said the two parts i’d like some guidance or discussion on:
is how to pass the audio payload after a certain intent is recognized, like “turn off the alarm” to a downstream script to perform the speaker analysis, I’ve worked out the speaker analysis portion, just not how to hook into the audio and pass it along after an intent has been identified. from the code base reading i’ve already done, i think this would be done through hermes right?
the solution is probably similar to the first i imagine, but how to implement arbitrary length input, then terminate on silence and if the first keyword “search” is identified to pass it to a separate python script potentially? at least that’s the solution for arbitrary length input that i’m leaning towards having searched the forums. This would also be done through hermes right?
Anyway, i’ve used rhasspy for a year and love it and hope someone with more knowledge of the codebase will point me in the right direction so i can give something back! Thank you!
if you look at the satellite code it will give you a good view of whats going on…
the problem with passing the audio after intent depends on what you mean… prior audio or upcoming…
the audio is passed on frame by frame/buffer by buffer and forgotten… so there is nothing prior to pass.
the current speaker recognition is done by external services… my HA knows I am present (funny you picked Sam… as thats me!) n a room because my phone is detected by a sensor in each room. not the voice stream…
but you COULD send the audio after hotword to the speaker detector AT THE SAME TIME as on to the ASR…
I think that is how it works now… silence ends the input to the asr transcription…
but I have not see any asr do word be word detection response… there is an enhancement issue open on that .
i’ve looked at what it would take on my google speech asr, and its non-trivial… the asr is getting buffer by buffer audio… no idea where the ‘speech’ happens in the stream…
your overall objective,
wakeword > search {query} and have everything after search treated as a query that gets sent to a remote API and then the response sent back to the rhasspy TTS
is fundamentally how it works… just have to figure out what the ‘intent’ is and what is parameter to intent…
satellite does hotword locally, streams audio to server and intent handler, and receives text back for playback.
I’m working on speaker recognition/verification system also. Would you like to talk about what you’ve come up with so far? I’m working on a different smart-speaker system, but the underlying algorithms would be the same. I’ve got speaker recognition working with pretty good confidence, and am working on a registration system.
I have a prototype of a very crude speaker recognition system, based on the template matching algorithm used in Raven. While Raven struggled to do wake word recognition (its primary purpose), it turns out that you can combine it with another (better) wake word system to do this:
Have users record a few samples each of the wake word
Use the better wake word system to start
When the wake word is detected, feed the last 2 seconds or so of audio into Raven and see which template has the highest probability – this is your speaker
This has the advantage of not really needing training, just a few samples from each speaker. The downside, of course, is that it may fail to detect when a speaker is outside of the training set – this can be mitigated by setting a lower bound on the template probability at least.
Wow, thank you for responding @synesthesiam I’m a big fan of your work. I honestly have no idea how you found the time to work on all of this!
Here’s what I’m working on. I was trying to use VOSK Speaker Identification https://alphacephei.com/vosk/models/vosk-model-spk-0.4.zip. This is an x-vector system, and the idea, as I understand it, is that you train a classifier on as much data as you can, then strip off the one-hot encoder at the end and you are left with the list of features that the network was using to identify speakers. The assumption is that these values will pretty closely match for each recording by the same speaker, so they usually recommend using the cosine-distance formula that was generally used in converting GMM i-vectors to an identification. Unfortunately, this just does not work well enough for more than a couple of people, and even then you have to adjust the sensitivity based on the number of samples you have. My goal is to use this system for a family assistant, where there might be four or five core users (the household) and a number of friends and relatives (maybe 50 people). Voices within this group, brothers or sisters for example, might be fairly similar.
I like your idea of using the wakeword. If this is truly capturing the wakeword, then this would allow you to use text-dependent recognition which should have some advantages. In my case, I’m capturing blocks of audio and then processing them, so I’m looking to do text independent recognition first, then eventually even diarization.
I started from this paper: Portfolio/Voice Classification at master · jurgenarias/Portfolio · GitHub. The author acheived a very impressive 99.8% recognition rate on Librispeech data. I was able to reproduce the results in Tensorflow due to the author kindly including their Jupyter Notebooks. I’m trying to translate the project from Tensoflow to Torch right now. The next step will be to try to produce a set of x-Vectors, and next I am planning to build a small perceptron network to replace the missing categorization layer. Finally, I want to see how applying the same treatment to the x-vectors produced by VOSK anticipating that since the VOSK team have a lot more resources and experience, that should produce even better results.
Finally I am working on a registration system where the computer will ask the user who they are if the voice doesn’t appear to coincide with any of the current voices, and use voice samples from the interaction to train the perceptron layer.
Thanks! I probably spend too much time working on these kinds of things instead of responding to people
Your description gave me an idea: consider using the Google speech embedding model that openWakeWord uses as features for speaker classification. I also have an implementation with models. This embedding model was trained on a massive amount of audio, and produces highly salient features for voice tasks.
The basic flow is audio → mels → embeddings. For wake word detection, a linear model is trained at the end with some number of embeddings (representing a window of audio) to output a probability. In your case, this would just need to be a classifier over possible speakers.