Speaker identification

Hi all!
Has anyone ever had any experience with speaker recognition/identification/verification?

Kaldi provides examples to do this Using i-vectors and x-vectors but it is a bit over my pay grade :sweat_smile:

Anyway, this could be an awesome addition to Rhasspy’s features to allow for user control and avoid the kids from wreaking havoc in the house.

Cheers :blush:

2 Likes

I’ve not tested this yet but with snips I have three wake words, even same phonèmes but different voice and I can run different actions according to who ask something.

Using Snips or Snowboy personal models are indeed a way to do this but I’m not confortable using a solution that is not open sourced.

I’d like a self trainable solution without dependencies on a remote third party service that can disappear at any time. Snips is gone (even if their packages are still functional for now) and Snowboy does not seem to be maintained anymore. How long until their API is deprecated? With the kids growing up their voice print is surely gonna change.

Kaldi’s recipe is worth a try but maybe someone with expertise in this field can help create a new Rhasspy service for this.

Wasn’t aware snowboy wasn’t maintained !!
Indeed we need a viable robust solution !

Hi,
Look this

I tried it and was not able to make it work unfortunately… Have you succeeded in using it?

I also had a problem I think it’s vad that shit.
in a recordings folder
create marie folder and put a wav of marie’s voice
create a polo folder and put a polo voice wav
all without silence (with sox I do:
/ usr / bin / sox -t alsa MONmic /home/poppy/MyProgram/tmp/marie.wav silence 1 0.1 1% 2 1.0 5% 4t
change MONmic)
and try with python with a test.wav
Example to adapt

import …
from piwho import recognition
CURRENT_DIR_PATH = os.path.dirname(os.path.realpath(file))
DATA_DIR_PATH = os.path.join(CURRENT_DIR_PATH, ‘recordings/’)

def find_speaker():
# save WAV file
wavefile = “test.wav”
os.system(“arecord -d 5 -f cd -t wav “+DATA_DIR_PATH+‘test.wav’)
recog = recognition.SpeakerRecognizer()
name = []
name = recog.identify_speaker(DATA_DIR_PATH+‘test.wav’)
print(name[0]) # Recognized speaker
print(name[1]) # Second best speaker
dictn = recog.get_speaker_scores()
print (dictn)
#{‘ABC’:‘0.838262623’,‘CDF’:‘1.939837286’}
if name[0] == ‘polo’:
print (” c’est polo qui parle!”)

Thanks. I’ll try that :blush:

Mycroft Precise only work for a single wakeword and does not segregate speakers.

I’m thinking that a NN trained on a wakeword uttered by each speaker should be able to identify in a single shot the triggering of a wakeword and the associated speaker (even if the wakeword are identical for multiple speakers).

I’m experimenting with a CNN model feeding it 160 frames (1s) of 13 dimensions Mel spectrogram… My model trains but the detection is really poor. I will be adding more data to my dataset to see if it improves the accuracy or if it is just a dead end.

If anyone has some expertise in Tensorflow (I’m using Tensorflow.js), Mel Spectrograms, MFCCs etc as I’m a little out of my league :sweat_smile:.

Voxceleb do a comp every year aswell as datasets.
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

But feature extraction and classification.

https://librosa.github.io/librosa/feature.html

Some dude wrote a little about it in this.

http://dl.booktolearn.com/ebooks2/computer/python/9781725656659_Introduction_to_Voice_Computing_in_Python_7a37.pdf

Its a bit of a confusion where to have the classification system as KWS would seem logical as that initiates.
But KWS is supposed to be this light always listener so maybe it should pass the keyword on to profile classification scheme.

I can not remember where I saw it or URL but there was a dataset of gender and ages think it was one of the emotion dataset providers.
But gender classification if there is just 2 of oppisite sex should work really well.

Tensorflow and mobilenet have some examples and seem quite easy to implement if you convert audio to MFFC spectograms then likely you could use the code verbatum.

As a follow up, see: KWS small lean and just enough for fit for purpose

:+1:

I known that this subject is over a year old, but are there any plans for this.

I have played with piwho in the past, but I have no idea how to integrate it as piwho expect a single wav and I do not known how to turn that to a stream listener via mqtt. I was wondering if there are ways to save the audio from the wakeword service and from the asr so that piwho could parse it?

Regards,

Richard

I have been keeping a lazy eye on speaker verification and some in process methods do speaker verification but the error rate is high.
I think the last state-of-art method I read was ‘Speaker identification is hard’.

I think like @fastjack that much can be attained by a known KW and that either individual models of speakers runs or a model with multiple KW of speakers.

It is pretty easy to capture the last wakeword as most pass and operate on wavs as numpy arrays so connect to the source and create a ringbuffer so that when its triggered you have a wav sample of the last KW utterance as on KW hit you just transfer that to Piwho.

But if the KWS is custom based on custom training then it can be made to also provide identification or multiple models with best kw hit can.
Tensorflow-lite and google now create highly accurate models that run in less than 20% load of a single core on a Pi3 so there is plenty of scope for multiple models.

You can try GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws as like you I no longer develop anymore but have spent some time thinking of that much can be provided by simply providing custom trained models.
So it prompts on screen words and records for a dataset and augments what is a small number to create the often several thousand used with a model.

The clever bits are tensorflow and the Googleresearch streaming KWS framework that create small footprint, low resource models.
I did a bit of a repo just to help with install GitHub - StuartIanNaylor/g-kws: Google Streaming KWS Arm Install Guide but really all you need is google-research/kws_streaming at master ¡ google-research/google-research ¡ GitHub

There is so much that you can provide by KW and a streaming KWS model as I have also been thinking you can steer a beamformer by KW in a very similar way as identification.
The streaming KWS envelope passes back a pretty accurate hit envelope that can be used to extract a wav from a ringbuffer to pass to a TDOA to get steering co-ords.

You need to record on the capture device of use as any deviation reduces accuracy but custom datasets and training especially for KW is quite easy and relatively quick when compared with the likes of ASR & TTS.

You can create a simple ringbuffer with a numpy.roll and just shift left and tack on the new audio on the right.
But as in python someone with much more skill has created a much better optimised module such as numpy_ringbuffer ¡ PyPI

g-kws/tfl-stream.py at main ¡ StuartIanNaylor/g-kws ¡ GitHub is one of my non pythonic hacks chunking audio to a model.

Sanebow has created a much more elegant and pythonic API wrapper.