Speaker identification

Hi all!
Has anyone ever had any experience with speaker recognition/identification/verification?

Kaldi provides examples to do this Using i-vectors and x-vectors but it is a bit over my pay grade :sweat_smile:

Anyway, this could be an awesome addition to Rhasspy’s features to allow for user control and avoid the kids from wreaking havoc in the house.

Cheers :blush:


I’ve not tested this yet but with snips I have three wake words, even same phonèmes but different voice and I can run different actions according to who ask something.

Using Snips or Snowboy personal models are indeed a way to do this but I’m not confortable using a solution that is not open sourced.

I’d like a self trainable solution without dependencies on a remote third party service that can disappear at any time. Snips is gone (even if their packages are still functional for now) and Snowboy does not seem to be maintained anymore. How long until their API is deprecated? With the kids growing up their voice print is surely gonna change.

Kaldi’s recipe is worth a try but maybe someone with expertise in this field can help create a new Rhasspy service for this.

Wasn’t aware snowboy wasn’t maintained !!
Indeed we need a viable robust solution !

Look this

I tried it and was not able to make it work unfortunately… Have you succeeded in using it?

I also had a problem I think it’s vad that shit.
in a recordings folder
create marie folder and put a wav of marie’s voice
create a polo folder and put a polo voice wav
all without silence (with sox I do:
/ usr / bin / sox -t alsa MONmic /home/poppy/MyProgram/tmp/marie.wav silence 1 0.1 1% 2 1.0 5% 4t
change MONmic)
and try with python with a test.wav
Example to adapt

import …
from piwho import recognition
CURRENT_DIR_PATH = os.path.dirname(os.path.realpath(file))
DATA_DIR_PATH = os.path.join(CURRENT_DIR_PATH, ‘recordings/’)

def find_speaker():
# save WAV file
wavefile = “test.wav”
os.system(“arecord -d 5 -f cd -t wav “+DATA_DIR_PATH+‘test.wav’)
recog = recognition.SpeakerRecognizer()
name = []
name = recog.identify_speaker(DATA_DIR_PATH+‘test.wav’)
print(name[0]) # Recognized speaker
print(name[1]) # Second best speaker
dictn = recog.get_speaker_scores()
print (dictn)
if name[0] == ‘polo’:
print (” c’est polo qui parle!”)

Thanks. I’ll try that :blush:

Mycroft Precise only work for a single wakeword and does not segregate speakers.

I’m thinking that a NN trained on a wakeword uttered by each speaker should be able to identify in a single shot the triggering of a wakeword and the associated speaker (even if the wakeword are identical for multiple speakers).

I’m experimenting with a CNN model feeding it 160 frames (1s) of 13 dimensions Mel spectrogram… My model trains but the detection is really poor. I will be adding more data to my dataset to see if it improves the accuracy or if it is just a dead end.

If anyone has some expertise in Tensorflow (I’m using Tensorflow.js), Mel Spectrograms, MFCCs etc as I’m a little out of my league :sweat_smile:.

Voxceleb do a comp every year aswell as datasets.

But feature extraction and classification.


Some dude wrote a little about it in this.


Its a bit of a confusion where to have the classification system as KWS would seem logical as that initiates.
But KWS is supposed to be this light always listener so maybe it should pass the keyword on to profile classification scheme.

I can not remember where I saw it or URL but there was a dataset of gender and ages think it was one of the emotion dataset providers.
But gender classification if there is just 2 of oppisite sex should work really well.

Tensorflow and mobilenet have some examples and seem quite easy to implement if you convert audio to MFFC spectograms then likely you could use the code verbatum.

As a follow up, see: KWS small lean and just enough for fit for purpose