Wake word creation -- Snowman anyone?

synesthesiam · June 16, 2021, 1:55pm

Please try and keep discussions civil.

For most users, I expect they will need a KWS system that is well integrated into Rhasspy – where they can just pick it from a list and click a few buttons. This is always the goal, as most people are not Linux command-line ninjas.

But it’s also fair to point out that there are great options outside of what’s currently supported in Rhasspy. And since @greg_dickson didn’t hint at their level of technical expertise, it’s possible that something gKWS could be just what they’re looking for.

romkabouter · June 16, 2021, 2:00pm

There is absolutely no point in this rude answer man.

A question is asked from a USER perspective and you post about something not supported by Rhasspy.
That, imo, is really the thing pointless here.

If you cannot even read what the user is actually asking, please do not respond with this KWS nonsense

rolyan_trauts · June 16, 2021, 2:13pm

@synesthesiam
That is what I don’t get as Rhasspy is web based so that it doesn’t need to be forced with the horrid mess of Hermes audio to be part of the web control.
To keep forcing highly integrated Rhasspy protocol of a few users just negates what could be a more interoperable and modular system that has the advantages of being used by a much larger herd.

G-kws is by google and also the same datasets can run on ESP32 and that is another reason why I keep mentioning it as it pointless to keep supporting a whole load of KWS that all have hardware and specifics whilst a single framework can run from singular datasets and also work on all hardware.

Its the datasets that are important and something we can do quite easily but just as forcing highly integrated code makes a common function more specific to a singular platform and small number of users it offers a rake of alternatives and makes those pools even smaller which is poorer for the user.

greg_dickson · June 16, 2021, 10:46pm

Thanks guys. Sorry to trigger an argument. Yes everyone has their own path.
My level of technical understanding is pretty high however I am new to Voice and computer learning. We all have our own path to learning and our own passions. No one was born with the knowledge,
My goal would be to have a basic wake word that eventually would allow a voice activated introduction.
Having a general but lowish accuracy at first but a training trigger so a new user could train it to their voice.
I thanks you all for your input and will get a better understanding as I travel down this rabbit hole.
Thanks again,
Greg

Daenara · June 16, 2021, 11:08pm

This argument is nothing new and happens in quite a few threads concerning wakewords. Don’t be bothered by having started it again.

As for the original topic. Since you said you have the technical understanding, I suggest training your own model. One of the starting points I can suggest is using Raven for a while, it gives pretty decent recognition but it strongest point is in saving the wakewords if you configure it right. That will help you out with getting a small dataset started, and you can supplement it with the dataset builder mentioned above.

Once you have a dataset to play with you can either go the more involved method of a google tensorflow model like @rolyan_trauts advertises. This has the advantage of being lightweight and portable since it also works on phones, so the resulting model might even work with the rhasspy mobile app, but with the disadvantage of not being supported by rhasspy (yet if I read @synesthesiams post right). The other way would be to play around with mycroft precise which can be used by rhasspy, but the scripts are outdated and badly maintained so if you want to use it it is of uttermost importance to use an outdated Linux to play around with (I use an Ubuntu 18.04 vm, anything newer breaks stuff) and I advise looking the training thread up in this forum, I remember correcting the guide in the first post of it to account for all the problems resulting from the bad maintenance. There is a fork/pull request that says it is w working updated version and while I could use it without encountering errors the model the conversion script returned did not work with rhasspy. The script might be broken, the resulting model to new or I might have messed up but I advise to not put to much time into that fork.

Whatever way you go, building a good dataset is always a good thing to do. To further augment my own dataset I build myself a python script that reads the audio buffer from the mqtt server and saves once the wakeword is detected, so every activation will augment the dataset. If you want to look it up you can find it here but I am pretty sure it could be done better. The disadvantage of the mqtt is, that the audio traffic has to go through mqtt and the ootion of using udp unless the wakeword was detect us ruled out. It would be great if rhasspy could save all wakewords and not just raven but for now, we have to make due.

rolyan_trauts · June 16, 2021, 11:55pm

Raven uses a quite old sort of best fit algorithm that is not very accurate and sort of instantly hits a ceiling in terms of accuracy as more recordings adds load.

What it is great for is to be able to record just a few KW samples and go and then use that to store KW & !KW of use.
Currently it can store KW but unfortunately doesn’t keep the following command sentence that could be used for !KW. Maybe the script @Daenara mentions could supplement that.

Most models based on custom datasets can provide really high accuracy and its the quest to provide a model and KWS that fits all like commercial versions that is prob a quest beyond a small community.

Its all about your dataset getting enough variation to stop false negatives and understanding how the dataset is the ultimate decider on accuracy that can give huge differences whilst model technologies from the early DNN of say Snowboy of approx 87% accuracy to latest and greatest of say transformer models are pushing 98% but still will return dross with a bad dataset.

GitHub - linto-ai/linto-desktoptools-hmg: GUI Tool to create, manage and test Keyword Spotting models using TF 2.0 was the best visual tool introduction for me as its runs through the dataset and provides click links to the sample items providing false positives/ negatives and wasn’t much use but extremely educational.

I champion the google-kws framework because it comes from a technical source far beyond any of our capabilities and is a working framework that just leaves you to provide a dataset.
Tensorflow is a static compiled model as opposed to Pytorchs dynamic model that gets far more research work done with it as unlike tensorflow new functionality doesn’t need to be C hardcoded into layers.
Tensorflow because it is a static model seems to provide smaller more efficient models and because Google & Google research have been specifically focussing on low end embedded hardware for KWS for me it makes sense to use what they have released as opensource.
From ESP32 to a Cuda GPU armed X86 machine the G-kws framework can produce a working model from the same dataset with various model options.

The CRNN streaming model they mention here from the framework google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub for me seems the best bet for Arm64.
That is another consideration as really any project that is heavily AI based really should be 64bit because it can operate on 2x more 16bit tensors for the same clock tick which results in realworld 2-3x speed improvements by just starting with a different image over Arm32.

If you are going to create a custom dataset and custom KW always record on the mic and even device of use and your going the right way.

I think because of the nature of voice AI that for the majority it is idle its very easy to collate a usage datset and train away when idle and provide a constant flow of model updates of improved accuracy through use.

I don’t think any KWS system should be constrained with the protocols and requirements of a ASR/STT/NLU system and is a project that can feed all.
It just needs a Rhasspy connector module to convert an audio stream and metadata so if you wish to stream audio over MQTT you can if its a bad idea or not (bad but hey).
The overall opensource voiceAi herd is already far too diluted in to small pools of specific systems doing the same task with far too small communities repeating and redeveloping the same to do the same with the only real unique feature being the branding applied.

I would have a look at Dataset-builder as it merely uses sox to provide pitch, tempo and reverb augmentation to provide multiple dataset samples from a single recorded sample as that is the main problem for KWS as we have a huge wealth of recorded sentence datasets but very few word based for KW systems.
What I have done is concentrate on matching and mixing background noise to high levels whilst always leaving a predominant foreground sample but really the inbuilt of g-kws doesn’t fair much worse, but any dataset builder might be of interest as recording and augmenting loads of samples is just a pain.

greg_dickson · June 17, 2021, 6:06am

This looks like a nice clean simple forward looking solution.
That just needs work.
I will work with that.
I don’t plan to delve too deeply into this rabbit hole but I will help where I can. Also not wishing to start another argument I will just say I am not a python coder. But once again I will do what I can.
Integration seems to be key here.

greg_dickson · June 17, 2021, 8:46am

Ok just to be clear here and after a very brief look at the system.
If I choose local command in the wake word setup.
I write an executable that simply runs until it gets a valid wake trigger then fully exits.

So rhasspy will startup set itself up run it’s web interface then run my executable then effectivly sleep (maintaining the web inteface).
When my executable exits rhasspy will then move to the next step ie start the recording …

So any Wake word / trigger excutable could easily be incorporated into the system.

Have I got that correct?
That seems a little too easy.
I must have something wrong…

rolyan_trauts · June 17, 2021, 11:49am

Never tried it but looks like so.

Someone will have to prompt how a stream on stdin should be handled as just never done it.

import sys data = sys.stdin.readline()

?

wakewordId to standard out and exit

greg_dickson · June 17, 2021, 9:50pm

Thanks Rolyan.
Yes after a ltttle rtfm
https://rhasspy.readthedocs.io/en/latest/wake-word/#command
Also after looking at the sleep.sh It appears to be that easy.
jq tricked me for a moment but for those that follow.

aptitude show jq

Package: jq
Version: 1.5+dfsg-2
State: not installed
Multi-Arch: foreign
Priority: optional
Section: universe/utils
Maintainer: Ubuntu Developers ubuntu-devel-discuss@lists.ubuntu.com
Architecture: amd64
Uncompressed Size: 90.1 k
Depends: libjq1 (= 1.5+dfsg-2), libc6 (>= 2.4)
Conflicts: jq:i386
Provides: jq:i386 (= 1.5+dfsg-2)
Provided by: jq:i386 (1.5+dfsg-2)
Description: lightweight and flexible command-line JSON processor
jq is like sed for JSON data – you can use it to slice and filter and map and transform
structured data with the same ease that sed, awk, grep and friends let you play with text.

It is written in portable C, and it has minimal runtime dependencies.

jq can mangle the data format that you have into the one that you want with very little effort,
and the program to do so is often shorter and simpler than you’d expect.
Homepage: GitHub - stedolan/jq: Command-line JSON processor

gyppe · June 18, 2021, 2:45pm

i testing Linto thanks is good project!!! How to use the models produced with rhasspy or other external software?

rolyan_trauts · June 18, 2021, 3:05pm

The model is a bit more problematic in that I think a GRU like Precise doesn’t run on tensorflow lite.
But basically run tensorflow you can use my example tfl-stream but you will have to create a mfcc frontend feed as the g-kws emebed mfcc in the model so you can just forward chunked audio.

That linto hmg is great to get a feel of what of how samples can effect the model as its extremely interesting and enlightening to see which fail.
Its through linto hmg I discovered how many bad samples are the the Google command set and then it clicked that the Google command set is a benchmark dataset and not nessacarily a good dataset to create a model on unless you trim out the bad.

The linto project has a MFCC routine and in there repo’s there is info and here is mine with just a chunked audio stream.

I could prob knock you up a MFCC version to test that GRU but after playing with an easy GUI I would suggest going a bit more hardcore with the G-kws models as the tensorflow-lite models run in much less load the difference is pretty huge.

mkdir g-kws
cd g-kws
git clone https://github.com/google-research/google-research.git
mv google-research/kws_streaming .

you can delete the rest of the google-reasearch dir if you wish as dunno why they put all in one repo

#!/bin/bash

# Train KWT on Speech commands v2 with 12 labels

KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models_data_v2_12_labels
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"


$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn_state/ \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 2000,2000,2000,2000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.15 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.5 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1

You just need to create a dataset and do something like I did in tfl-stream.py g-kws/tfl-stream.py at main · StuartIanNaylor/g-kws · GitHub

You can use the google command set or create your own with the dataset-builder GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws or another

Have a read of google-research/kws_streaming at master · google-research/google-research · GitHub as its extremely well documented and contains just about every current state-of-art model for KWS.
The CRNN-state is prob a good start and google-research/base_parser.py at master · google-research/google-research · GitHub contains all the parameters that after a bit of headscratching should get you going.

PS if you have 2 mics I have been playing with the pulseaudio beamformer which I ruled out as you can not steer it.
I have had a rethink and can now steer it so as well as cutting edge model you can prob also add beamforming if interested.

google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub gives pretty good guide and the command set for testing can be downloaded

# download and set up path to data set V2 and set it up
wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
mkdir data2
mv ./speech_commands_v0.02.tar.gz ./data2
cd ./data2
tar -xf ./speech_commands_v0.02.tar.gz
cd ../

or

# download and set up path to data set V1 and set it up
wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
mkdir data1
mv ./speech_commands_v0.01.tar.gz ./data1
cd ./data1
tar -xf ./speech_commands_v0.01.tar.gz
cd ../

But if you have a look on my dataset-builder repo I posted as many datasets & noise files I could find so even if of no use it might be a good source of datasets even if few.

sanebows code is far more polished than anything I have provided

gyppe · June 18, 2021, 3:53pm

Argh! Too advanced for me, I will have to study a lot.

rolyan_trauts · June 18, 2021, 4:08pm

Suck & see is often far more educational.
Just grab the g-kws framework

mkdir g-kws
cd g-kws
git clone https://github.com/google-research/google-research.git
mv google-research/kws_streaming .

Go with command set 2

# download and set up path to data set V2 and set it up
wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
mkdir data2
mv ./speech_commands_v0.02.tar.gz ./data2
cd ./data2
tar -xf ./speech_commands_v0.02.tar.gz
cd ../

save and run this script

#!/bin/bash

# Train KWT on Speech commands v2 with 12 labels

KWS_PATH=$PWD
DATA_PATH=$KWS_PATH/data2
MODELS_PATH=$KWS_PATH/models_data_v2_12_labels
CMD_TRAIN="python -m kws_streaming.train.model_train_eval"


$CMD_TRAIN \
--data_url '' \
--data_dir $DATA_PATH/ \
--train_dir $MODELS_PATH/crnn_state/ \
--mel_upper_edge_hertz 7600 \
--how_many_training_steps 2000,2000,2000,2000 \
--learning_rate 0.001,0.0005,0.0001,0.00002 \
--window_size_ms 40.0 \
--window_stride_ms 20.0 \
--mel_num_bins 40 \
--dct_num_features 20 \
--resample 0.15 \
--alsologtostderr \
--train 1 \
--lr_schedule 'exp' \
--use_spec_augment 1 \
--time_masks_number 2 \
--time_mask_max_size 10 \
--frequency_masks_number 2 \
--frequency_mask_max_size 5 \
crnn \
--cnn_filters '16,16' \
--cnn_kernel_size '(3,3),(5,3)' \
--cnn_act "'relu','relu'" \
--cnn_dilation_rate '(1,1),(1,1)' \
--cnn_strides '(1,1),(1,1)' \
--gru_units 256 \
--return_sequences 0 \
--dropout1 0.5 \
--units1 '128,256' \
--act1 "'linear','relu'" \
--stateful 1

Just make sure the path of the tflite model is right and run or just download

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading


def sd_callback(rec, frames, time, status):
    global input_details1
    global output_details1
    global inputs1
    global kw_count
    global kw_sum
    global kw_hit
    global kw_avg
    global kw_probability
    global not_kw
    global silence_count
    global silence_hit
    
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, 320))
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
     
    if np.argmax(output_data[0]) == 2:
      print(output_data[0][0], output_data[0][1], output_data[0][2], kw_avg, kw_probability, kw_count)
      kw_count += 1
      kw_sum = kw_sum + output_data[0][2]
      kw_avg = kw_sum / kw_count
      kw_probability = kw_avg / 7.5
      silence_count = 0
      if silence_hit == True:
        print('Silence hit')
        silence_hit = False
      if kw_probability > 0.5 and kw_count >= 15:
        kw_hit = True
    elif np.argmax(output_data[0]) == 1:
      not_kw = True
      silence_count = 0
      if silence_hit == True:
        print('Silence hit')
        silence_hit = False
    elif np.argmax(output_data[0]) == 0:
      not_kw = True
      silence_count += 1
      if silence_count >= 100:
        silence_hit = True
      
    if not_kw == True:
      if kw_hit == True:
        print("Kw threshold hit", kw_avg, kw_probability, kw_count)
      kw_count = 0
      kw_sum = 0
      kw_hit = False
      kw_max = 0
      kw_probability = 0
      not_kw = False




# Parameters
word_threshold = 7.5
word_duration = 10
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
kw_avg = 0
kw_count = 0
kw_sum = 0
kw_probability = 0
kw_hit = False
not_kw = False
silence_count = 0
silence_hit = False

sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')


# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="/home/pi/g-kws/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite")
interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    
kw_count = 0
not_kw_count = 0
kw_sum = 0
kw_hit = False



    


# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

To be honest there training script and layers is a complete confusion to me but we don’t need know what google-research do.
The model just fires back a tensor where each classification is just an element in an array where I use argmax() to get the biggest and thats it really apart from building your own dataset.
I have been wondering if there is a better way to get a single probability score from a 1 sec envelope of 20ms steps but the above will do.

greg_dickson · June 19, 2021, 2:24am

So to put it in a Rhasspy context
you could use your python as the External Command by simply exiting the script on a valid responce - printing wakewordId to std out on exit of course-.

Is that correct?
It still sends the wav over http but if the script is on the same machine as rhasspy the overhead is quite small. AFAIK

rolyan_trauts · June 20, 2021, 12:29am

Yeah when it comes to an all-in-one Rhasspy is pretty good so apols for confusing you with opinion about external networked KWS.

I am not sure why Rhasspy focuses on a ‘singular unit’ as to be honest the commercial guys have us beat hands down for much less than we can build.
But if you where looking at creating a single brain with multiple ears of distributed KWS IMO chunking raw wavs over MQTT so its broadcast to all subscribers is just plain bad and that always brings up a contentious argument.
A few don’t like my opinion that mic arrays without DSP are useless and if I call it Herpes audio it says enough of what I think without further tedium.

If you are going to use the host and pipe in the audio and pipe out default on the wakewordId it is super easy.

greg_dickson · June 20, 2021, 2:11am

But that is the real point. I find the cost to my privacy way too high. So any private system. Is cheaper than the alternatives.
It all depends on how far down the particular rabbit hole you wish to travel.
The current invocation of rhasspy works. It is easy to configure and run and can be a complex as you like. The external command opens up some doors. and if you wanted satelites you could simply implement a server on the main host that each satelite connected to independant of rhasspy. The wake word server could be on the same machine as rhasspy so the MQTT stuff is always local. Rhasspy starts the wake word server which then deals with the wake state however it likes.
Like say polling a pin for a push button event.
Polling over another bus say i2c to a movement sensor.
Or even wifi to another distant trigger client like a phone.
The fact that rhasspy sends it’s wav to the server is only one possible input.
You can implement what ever you like and rhasspy stays simple and usable within it’s own little world.
From what I can see rhasspy responds specifically to each specific wakewordId so the posibilities and flexibility are still there.
I am only new to this and have not got all the facts yet but it seems to me that rhasspy offers the best of both worlds.
Even though she might have herpes as you say.
Having herpes myself I can attest to the fact that one can still function normally with the disease. you just have to do things differently.

greg_dickson · June 20, 2021, 2:18am

I have had some problems as for some reason the docker doesn’t match my pi4 with the correct image.
I understand this is still beta but for now I will try another path.

However it is clear to me that rhasspy has a bright future ahead.

Thanks everyone there is some very useful information in this thread alone.
Thanks Daenara I will try Raven just to get me started.
Every time I explore the docs and re-read this thread I learn so much more.
Wow this is so cool.

gyppe · June 20, 2021, 3:06pm

For kws is ok, now I do some tests. Still not clear to me how to interface with rhasspy, does your latest script do this? I am a programmer but at a microcontroller level, with high level languages I am poorly prepared.

rolyan_trauts · June 20, 2021, 4:04pm

Stop right there! Did you say ‘microcontroller level’ !!!

Oh yeah as I will handhold as much as it takes don’t do a CRNN but a CNN microcontroller google-research/kws_experiments_paper_12_labels.md at master · google-research/google-research · GitHub

As if your armed with tensorflow/tensorflow/lite/experimental/microfrontend at master · tensorflow/tensorflow · GitHub then google-kws will train for that front-end or at least there is parameters and code to do so just never tried.

And maybe we could be the 1st to squirt a model down say an ESP32-WROVER with a I2S mic module?

If you look at bemused client https://github.com/rhasspy/bemused-client/blob/master/bemused_client/main.py
Its taking the audio from stdin and the code is in the above.

Why bemused client takes a streaming kws framework and creates a non-stream model is bemusing to me and hence you might realise why it got that name.
Prob sanebows is a better option and just steal the stdin methods.

But the same datasets can be trained by the same google-kws-streaming framework with a choice of models, converted and quantised for tflite.
But also you can take the next step and employ the microcontroller code of Audio “frontend” TensorFlow operations for feature generation and train for that front end and convert the tflite with XNN to a binary for the ESP32.

I keep thinking this so should be done as you can get ESP32-Wrover (It does have the comfort of the 8mb sram) for < $5 and a I2S mic module for <$3

So yeah test away and raspberry and embeded linux (pi) is prob easiest way to get a feel to things.

Bemused client is just another KWS that only has host capability but any KWS code can use that simple stdin method.