Speaker identification

fastjack · February 29, 2020, 2:25pm

Hi all!
Has anyone ever had any experience with speaker recognition/identification/verification?

Kaldi provides examples to do this Using i-vectors and x-vectors but it is a bit over my pay grade

Anyway, this could be an awesome addition to Rhasspy’s features to allow for user control and avoid the kids from wreaking havoc in the house.

Cheers

KiboOst · February 29, 2020, 2:30pm

I’ve not tested this yet but with snips I have three wake words, even same phonèmes but different voice and I can run different actions according to who ask something.

fastjack · February 29, 2020, 2:45pm

Using Snips or Snowboy personal models are indeed a way to do this but I’m not confortable using a solution that is not open sourced.

I’d like a self trainable solution without dependencies on a remote third party service that can disappear at any time. Snips is gone (even if their packages are still functional for now) and Snowboy does not seem to be maintained anymore. How long until their API is deprecated? With the kids growing up their voice print is surely gonna change.

Kaldi’s recipe is worth a try but maybe someone with expertise in this field can help create a new Rhasspy service for this.

KiboOst · February 29, 2020, 3:10pm

Wasn’t aware snowboy wasn’t maintained !!
Indeed we need a viable robust solution !

kookic · February 29, 2020, 4:57pm

Hi,
Look this

fastjack · February 29, 2020, 6:16pm

I tried it and was not able to make it work unfortunately… Have you succeeded in using it?

kookic · February 29, 2020, 7:10pm

I also had a problem I think it’s vad that shit.
in a recordings folder
create marie folder and put a wav of marie’s voice
create a polo folder and put a polo voice wav
all without silence (with sox I do:
/ usr / bin / sox -t alsa MONmic /home/poppy/MyProgram/tmp/marie.wav silence 1 0.1 1% 2 1.0 5% 4t
change MONmic)
and try with python with a test.wav
Example to adapt

import …
from piwho import recognition
CURRENT_DIR_PATH = os.path.dirname(os.path.realpath(file))
DATA_DIR_PATH = os.path.join(CURRENT_DIR_PATH, ‘recordings/’)

def find_speaker():
# save WAV file
wavefile = “test.wav”
os.system(“arecord -d 5 -f cd -t wav “+DATA_DIR_PATH+‘test.wav’)
recog = recognition.SpeakerRecognizer()
name = []
name = recog.identify_speaker(DATA_DIR_PATH+‘test.wav’)
print(name[0]) # Recognized speaker
print(name[1]) # Second best speaker
dictn = recog.get_speaker_scores()
print (dictn)
#{‘ABC’:‘0.838262623’,‘CDF’:‘1.939837286’}
if name[0] == ‘polo’:
print (” c’est polo qui parle!”)

fastjack · February 29, 2020, 8:39pm

Thanks. I’ll try that

fastjack · May 9, 2020, 7:02pm

Mycroft Precise only work for a single wakeword and does not segregate speakers.

I’m thinking that a NN trained on a wakeword uttered by each speaker should be able to identify in a single shot the triggering of a wakeword and the associated speaker (even if the wakeword are identical for multiple speakers).

I’m experimenting with a CNN model feeding it 160 frames (1s) of 13 dimensions Mel spectrogram… My model trains but the detection is really poor. I will be adding more data to my dataset to see if it improves the accuracy or if it is just a dead end.

If anyone has some expertise in Tensorflow (I’m using Tensorflow.js), Mel Spectrograms, MFCCs etc as I’m a little out of my league .

rolyan_trauts · May 9, 2020, 8:15pm

Voxceleb do a comp every year aswell as datasets.
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

But feature extraction and classification.

https://librosa.github.io/librosa/feature.html

Some dude wrote a little about it in this.

http://dl.booktolearn.com/ebooks2/computer/python/9781725656659_Introduction_to_Voice_Computing_in_Python_7a37.pdf

Its a bit of a confusion where to have the classification system as KWS would seem logical as that initiates.
But KWS is supposed to be this light always listener so maybe it should pass the keyword on to profile classification scheme.

I can not remember where I saw it or URL but there was a dataset of gender and ages think it was one of the emotion dataset providers.
But gender classification if there is just 2 of oppisite sex should work really well.

Tensorflow and mobilenet have some examples and seem quite easy to implement if you convert audio to MFFC spectograms then likely you could use the code verbatum.

fastjack · June 17, 2020, 3:03pm

As a follow up, see: KWS small lean and just enough for fit for purpose

Vandaag · June 16, 2021, 8:04pm

I known that this subject is over a year old, but are there any plans for this.

I have played with piwho in the past, but I have no idea how to integrate it as piwho expect a single wav and I do not known how to turn that to a stream listener via mqtt. I was wondering if there are ways to save the audio from the wakeword service and from the asr so that piwho could parse it?

Regards,

Richard

rolyan_trauts · June 16, 2021, 9:42pm

I have been keeping a lazy eye on speaker verification and some in process methods do speaker verification but the error rate is high.
I think the last state-of-art method I read was ‘Speaker identification is hard’.

I think like @fastjack that much can be attained by a known KW and that either individual models of speakers runs or a model with multiple KW of speakers.

It is pretty easy to capture the last wakeword as most pass and operate on wavs as numpy arrays so connect to the source and create a ringbuffer so that when its triggered you have a wav sample of the last KW utterance as on KW hit you just transfer that to Piwho.

But if the KWS is custom based on custom training then it can be made to also provide identification or multiple models with best kw hit can.
Tensorflow-lite and google now create highly accurate models that run in less than 20% load of a single core on a Pi3 so there is plenty of scope for multiple models.

You can try GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws as like you I no longer develop anymore but have spent some time thinking of that much can be provided by simply providing custom trained models.
So it prompts on screen words and records for a dataset and augments what is a small number to create the often several thousand used with a model.

The clever bits are tensorflow and the Googleresearch streaming KWS framework that create small footprint, low resource models.
I did a bit of a repo just to help with install GitHub - StuartIanNaylor/g-kws: Google Streaming KWS Arm Install Guide but really all you need is google-research/kws_streaming at master · google-research/google-research · GitHub

There is so much that you can provide by KW and a streaming KWS model as I have also been thinking you can steer a beamformer by KW in a very similar way as identification.
The streaming KWS envelope passes back a pretty accurate hit envelope that can be used to extract a wav from a ringbuffer to pass to a TDOA to get steering co-ords.

You need to record on the capture device of use as any deviation reduces accuracy but custom datasets and training especially for KW is quite easy and relatively quick when compared with the likes of ASR & TTS.

You can create a simple ringbuffer with a numpy.roll and just shift left and tack on the new audio on the right.
But as in python someone with much more skill has created a much better optimised module such as numpy_ringbuffer · PyPI

g-kws/tfl-stream.py at main · StuartIanNaylor/g-kws · GitHub is one of my non pythonic hacks chunking audio to a model.

Sanebow has created a much more elegant and pythonic API wrapper.