Raven in production


After several test with raven, with music and such, I though having something good for hotword detection.
So I finnally decided to test it in production !

Well, it goes less than 30mins before being back to test room, turned off …

I really don’t know what to do.
Everything is working nice regarding intents. LEDs have some problems with mqtt messages and HLC but let’s see.

Problem is that I can’t get a wakeword running fine.
I was around 0,55 for sensitivity, VAD 2 so it can hear me not standing 50cm from mic, average true, min matche 1.

Then, a small normal conversation with three persons trigger the wakeword every minutes. I goes as high as 0.85 for sensitivity and still lot of false positive.

Does anyone use Raven in production environment, and what are your settings ? Can you listen music without false positive ? Trigger wakeword from 4 or 5 meters like snips ?
I’m on rpi4 respeaker 2 mic, three raven custom wakewords.

Actually Rhasspy seems an amazing solution, but still without a good wakeword engine, no way to use it. SNIPS back in production :cold_face:

Ok, more investigation, and I may have found a bug !

"wake": {
        "raven": {
            "keywords": {
                "xxx": {
                "nicolas": {
                    "average_templates": true,
                    "minimum_matches": 1,
                    "probability_threshold": 1
            "probability_threshold": "0.41",
            "udp_audio": "localhost:12240:salle",
            "vad_sensitivity": "1"
        "system": "raven"

3 customs wakewords, but:
sensibilities set to 1 in wakeword params
general raven probability_threshold to 0.41
-> everything trigger always

sensibilities set to 1 in wakeword params
general raven probability_threshold to 1
-> nothing trigger

sensibilities set to 0.4 in wakeword params
general raven probability_threshold to 1
-> nothing trigger

So it sounds like raven general probability_threshold is taken for all custom wakewords sensibility. Should be the wakeword param which would be taken, so we could have different sensibility for each.

Also, general one is string (from ui saving) and wakeword ones is integer (from doc example)

@synesthesiam would it possible to have clarification on this please ?

From the doc:

The wake.raven.keywords object contains a key for each wake/keyword and their individual settings. If you don't specify a setting, the value under wake.raven is used instead.

Which I though the priority was on wake.raven.keyword setting.

1 Like


I can confirm your findings. A few weeks ago when I was training my different wake words, I observed the same issue (but didn’t post it yet).
So in order to get it running, I manually removed the individual settings for each wakeword - my code now looks like below.
Of course that does not solve the problem described above.

“wake”: {
“raven”: {
“keywords”: {
“dagobert”: {},
“default”: {},
“elrond”: {},
“katecholamin”: {},
“oberon”: {}
“minimum_matches”: “1”,
“probability_threshold”: “0.5”,
“udp_audio”: “”,
“vad_sensitivity”: “1”
“satellite_site_ids”: “RhasspySat1”,
“system”: “raven”

Another strange thing.

Here is the log in rhasspy UI

[DEBUG:2020-12-18 11:23:08,154] rhasspyserver_hermes: <- HotwordDetected(model_id='/profiles/fr/raven/nicolas/example-0.wav', model_version='', model_type='personal', current_sensitivity=0.5, site_id='studio', session_id=None, send_audio_captured=None, lang=None)

Note the current_sensitivity=0.5

I have no raven sensitivity, and my raven.wakeword sensibility is set to 0.42 :face_with_monocle:

@synesthesiam is it just a log thing or are all raven wakeword sensitivity just ignored ?

I should be reporting the probability threshold (sensitivity) for each individual wake word. There may be a bug here; I will need to investigate more.


1 Like


I have a Pi4 as master, a Pi3B+ as satellite.
Both Raspbian Buster up to date with Rhasspy 2.5.9, Docker
Both Raven, 3 custom wakewords, average template, min matches 1, vad sens 1, udp audio

The Pi4 popup in less than a second after saying wakeword, the Pi3 takes around 4 to 5 second to turn leds, sound, and into listen state.

Is this something normal ? After dropping Pi0, should I also drop Pi3 for Rhasspy :face_with_monocle:

EDIT: after some times, reven wake as fast as Pi4. Strange, maybe some loading stuff take more time on the Pi3.

Instead of dropping the Pi3, I need to rewrite Raven to be faster. I’ve asked around, but no one has been interested in porting it to C++ (or something like Julia).

I might get to it myself one of these days; I’m currently down the rabbit hole with speeding up Larynx’s text to speech so it’s actually usable :stuck_out_tongue:

1 Like


I don’t think python-speech-features is that performant for mfcc its a vague memory from sonopy perf testing that compared also against librosa and think it was python-speech-features that was the worst.
I never did check the requirements.txt of what version python-speech-features was but it was the slowest.

If you created a mfcc pipeline that was firstly used by vad then passed to raven you would get the benefits of neural net vad that has less overhead than current as its likely a low latency cnn will be less load than WebRtcVad.

There is a duplication in audio processing of raven mfcc and the very similar filterbanks of webrtcvad that could be done once with a mfcc pipeline that both use.
You would get less load and better vad that can also be custom trained.

For mfcc both tensorflow & pytorch now have inbuilt math libs for mfcc and haven’t done any specific perf tests but pretty sure they are much lighter.

pytorch.audio and hopefully this is wrong but last time I tested its either intel_mkl or nvidia math libs and on arm without a jetson your stuffed.
Maybe I was compiling it wrong but for me it kept banging out on a Pi complaining about missing intel_mkl libs.

1 Like


There are some things that still puzzle me as Precise as a KWS is a confusion as it uses the full tensorflow for inference rather than converting and quantising for tensorflow-lite.
In essence its using the full training framework for KW inference whilst the accuracy loss if miniscule compared to load reduction of using tensorflow-lite.

I am also completely bemused why you would port Raven to C++ or Julia as the brightest minds of the biggest corporations and academic institutes of the world shelved the methods of Raven several years ago.
Raven is not a clever Snips method its just something they adopted as I have seen similar methods that predate the Snips era.
Its great to quickly be able record your voice to create an instant profile but in terms of accuracy and load its a cul-de-sac and why generally that style of method has been abandoned.
Its also confusing as the full tensorflow framework does run on a Pi so you can train on a PI and a small dataset of a limited training run still beats the WER (Word Error Rate) of Raven.

You can take say 10 recordings of KW mention and with pitch shift, stretch, padding movement and noise you can quickly create several hundred KW samples.
The time of training for a tensorflow GRU NN to beat Raven is not long at all, but from that point its not a cul-de-sac as the user has choice for more accurate involved training.

Also training is a step process and can be paused and restarted at any time and could be triggered to be a constant idle task that uses usage data for constant improvement.

Its absolutely bat shit crazy to port Raven when the best minds have ported state of the art to C++ in frameworks such as tensorflow and pytorch.
Apols Batman but why would you port Raven is much as a bemusement as Precise running on the full tensorflow whilst running Linux embedded.

@rolyan_trauts When there is a solution that:

  • works better than Raven
  • runs on small devices
  • is as simple to use as Raven

This would make me very happy. Sadly right now we do not have this solution and so we work with the things we do have.
Raven faces a performance problem, that could probably be improved by porting a big part of it to a compiled language -> it is a improvement to the current status quo. I personally do not think it is crazy to improve the current situation just because there might be a better solution at some point in time.

1 Like

An equally easy to use alternative if your using nodered is node-red-contrib-personal-wake-word which wraps the great node-personal-wakeword by @fastjack which itself is based on the snips personal wake word principles. (disclaimer im the maintainer of the node-red integration)

1 Like

Yes Raven is mostly a Python port of node-personal-wakeword

1 Like

i know just wanted to throw it in the mix :+1:t2:

Thats my point there are solutions that work better than raven and its tensorflow-lite or pytorch mobile and they work better than Raven, run on small devices and would work exactly the same without the cul-de-sac of bad WER.
One thing for sure is that Raven will not run on small devices as it would have to be ported to each specific platform whilst the major frameworks already exist on nearly all platforms and all you have to do is create your model.

We do have this solution we all have this solution as the like of Google & Facebook provide it with there frameworks its just not being used.

Its crazy as the methods Raven uses are antiquated and there has been a whole boom of AI technology that is better, more accurate with less load.
So yeah spend time porting something worse and ignoring what is better whilst believing that is rational if you wish.

Mycroft Precise would be far leaner than Raven and all it needs is the model converting to tensorflow-lite and quantising down.
It will not run on TF4Micro as RNNs such as LSTM & GRU are not in the model subset.
Other models will.

PS When it comes to small devices the ESP32-S3 software libs are looking really impressive.

@rolyan_trauts, I agree that a neural net solution is going to ultimately be the best, especially with the performance improvements TensorFlow and PyTorch have gotten on ARM (mostly arm64).

Have you specifically found a TF/PyTorch KWS project that actually trains on a dozen or so examples on a Pi 3/4, and has good performance (both CPU cost and low false alarms)? Everything I see ends up needing one or more of:

  • A large dataset (usually Google Speech Commands)
  • A good amount of “not wake word” examples too (Precise)
  • An hour or more of training for good accuracy on a desktop CPU (Precise)

Maybe Precise has gotten better recently, but training its GRU from scratch needed quite a few examples. I actually Skyped with Mycroft’s CEO and the creator of Precise last year, and they said this was the number one complaint they had with it. Of course, they’re going for “universal” vs. “personal” wake words with it mostly.

I know you were looking into Linto at one point; is this what you were able to train on a Pi quickly with few examples?


Also, if you don’t believe I’m all in on the neural networks, I’m currently installing water blocks on my new and used GPUs :stuck_out_tongue:


Yes I find their libs really impressive as well. Sadly last time I checked it was not possible to create wakewords yourself. (Things might have changed since I looked at it around a year ago)

Yeah that is a big problem as esp-sr, esp-who, esp-skainet are not opensource and that is also batshit crazy and we will have to wait and see.

The vector instructions on the esp32-s3 looks like it will be capable of maybe even running inference simultaneously on both cores, but the lx7 looks like it hits and exceeds a perfect spot for edge KWS.


I think there is something hooky with there dataset or something as with a good unique KW and loss-bias weighted to false negatives you should be easily be able to create accurate KWS.
The problem is people believe the Google Speech Commands dataset makes a good training set as it doesn’t as its specifically a testing set, to test how well a KWS can work that contains approx 10% of bad recordings, trims and padding for purposes of comparison.

That is not the point as that dataset and mycroft try to encompass all users, genders, accents in a single model, but raven you don’t have to do that and you can create a much smaller dataset on specific samples provided.

I have stayed with TF as Pytorch seems to be making some odd vendor lib choices that are Intel & Nvidia biased, at least pytroch.audio seems to be that way.

There is a whole repo from google of just about every KWS model you can think of.
Streaming Aware neural network models

For tensorflow-lite all run as TFL is 100% like for like its tensorflow4microcontrollers that is a subset and so far due to a lack of recurrent layers a DS-CNN is my only tested and working model.
A Pi can run either full tensorflow or tensorflow-lite and recurrent LSTM & GRU are supported.
But a basic CNN from the current tensorflow tutorial can outperform Ravens WER if its presented with a decent dataset which might be a specific recording or the Google Command Set with inference run on its self and bad samples pruned out.
There has also been some really bad methodology of unnormalised noise and KW mixed without any knowledge of the SNR levels it produces and I think yeah much is due to bad methodology of dataset and training has caused many of there results.

Yeah I really like the HMG of Linto because it has a GUI with visual results and tests for false negatives/positives that actually gives you feedback to where you can be going wrong.

90% of the training of a model is eeking out 1-2% increase over what the 1st 10% might achieve but on a desktop Linto HMG will produce a Google Command Set based KWS in 5-10 minutes depending on your hardware.
Keras optimises as it goes along but mainly it keeps retraining and merely picks the lotto winner of best results.
The Mycroft MFCC lib Sonopy is actually broken, always has been so its no wonder.
Run it yourself and play with the parameters as you can get it to beat Librosa by 400% in performance and its not wonder code, its because its broken and you had a wasted Skype call.

Precise is the most oddly named KWS but for training you should balance your KW & !KW in number. Your KW should be fairly unique with a good number of phones 3 is good as 'HeyMycroft, HeyGoogle & Alexa" all are.
Your !KW should try to contain all phones for your language but if you are going to start with personal as you do with Raven you should be able to proceed with and extremely small dataset and the length of training is choice as some might not be so bothered about those last fractions of accuracy.

The WER error rate with Raven is really high and its ability to cope with noise is low and you could quickly train a NN with a small dataset that would be comparable to Raven.
Read out your KW and a couple of phonetic pangrams for your language and word extract for !KW.

Again its down to methodology but if you are doing personal that means both KW & !KW and with the some pitch and padding shifting you can quickly make a dataset to match your KW.

1 Like
(venv) pi@raspberrypi:~/googlekws/simple_audio_tensorflow $ python3 simple_audio_mfcc_frame_length1024_frame_step512.py
Commands: ['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
Number of total examples: 10114
Number of examples per label: 1000
Example file tensor: tf.Tensor(b'data/mini_speech_commands/left/4beff0c5_nohash_1.wav', shape=(), dtype=string)
Training set size 6400
Validation set size 800
Test set size 800
Run time 0.13212920299997677
Run time 0.6761154660000557
2021-02-14 10:06:37.656765: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-02-14 10:06:37.659695: W tensorflow/core/platform/profile_utils/cpu_utils.cc:116] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
Input shape: (30, 13, 1)
Run time 5.733527788999936
Model: "sequential"
Layer (type)                 Output Shape              Param #
normalization (Normalization (None, 30, 13, 1)         3
conv2d (Conv2D)              (None, 28, 11, 32)        320
conv2d_1 (Conv2D)            (None, 26, 9, 64)         18496
max_pooling2d (MaxPooling2D) (None, 13, 4, 64)         0
dropout (Dropout)            (None, 13, 4, 64)         0
flatten (Flatten)            (None, 3328)              0
dense (Dense)                (None, 128)               426112
dropout_1 (Dropout)          (None, 128)               0
dense_1 (Dense)              (None, 10)                1290
Total params: 446,221
Trainable params: 446,218
Non-trainable params: 3
Epoch 1/1000
100/100 [==============================] - 63s 601ms/step - loss: 1.8292 - accuracy: 0.3305 - val_loss: 1.0994 - val_accuracy: 0.6363
Epoch 2/1000
100/100 [==============================] - 30s 299ms/step - loss: 1.1535 - accuracy: 0.5955 - val_loss: 0.8443 - val_accuracy: 0.7175
Epoch 3/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.9190 - accuracy: 0.6740 - val_loss: 0.6876 - val_accuracy: 0.7775
Epoch 4/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.7892 - accuracy: 0.7273 - val_loss: 0.6035 - val_accuracy: 0.7987
Epoch 5/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.6533 - accuracy: 0.7606 - val_loss: 0.5486 - val_accuracy: 0.8100
Epoch 6/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.6117 - accuracy: 0.7831 - val_loss: 0.4823 - val_accuracy: 0.8500
Epoch 7/1000
100/100 [==============================] - 30s 300ms/step - loss: 0.5309 - accuracy: 0.8207 - val_loss: 0.4395 - val_accuracy: 0.8612
Epoch 8/1000
100/100 [==============================] - 30s 300ms/step - loss: 0.4771 - accuracy: 0.8333 - val_loss: 0.4316 - val_accuracy: 0.8612
Epoch 9/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.4371 - accuracy: 0.8485 - val_loss: 0.3950 - val_accuracy: 0.8763
Epoch 10/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3972 - accuracy: 0.8630 - val_loss: 0.3770 - val_accuracy: 0.8850
Epoch 11/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3604 - accuracy: 0.8745 - val_loss: 0.3590 - val_accuracy: 0.8938
Epoch 12/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3476 - accuracy: 0.8784 - val_loss: 0.3630 - val_accuracy: 0.8850
Epoch 13/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3174 - accuracy: 0.8832 - val_loss: 0.3481 - val_accuracy: 0.8888
Epoch 14/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.3106 - accuracy: 0.8928 - val_loss: 0.3483 - val_accuracy: 0.9050
Epoch 15/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2803 - accuracy: 0.9049 - val_loss: 0.3573 - val_accuracy: 0.8875
Epoch 16/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2600 - accuracy: 0.9064 - val_loss: 0.3422 - val_accuracy: 0.9025
Epoch 17/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2419 - accuracy: 0.9138 - val_loss: 0.3672 - val_accuracy: 0.8900
Epoch 18/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2296 - accuracy: 0.9213 - val_loss: 0.3688 - val_accuracy: 0.8900
Epoch 19/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.2125 - accuracy: 0.9234 - val_loss: 0.3620 - val_accuracy: 0.8975
Epoch 20/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1991 - accuracy: 0.9227 - val_loss: 0.3705 - val_accuracy: 0.8963
Epoch 21/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1880 - accuracy: 0.9331 - val_loss: 0.3890 - val_accuracy: 0.9000
Epoch 22/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1780 - accuracy: 0.9355 - val_loss: 0.3813 - val_accuracy: 0.9013
Epoch 23/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1744 - accuracy: 0.9380 - val_loss: 0.3512 - val_accuracy: 0.9087
Epoch 24/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1588 - accuracy: 0.9452 - val_loss: 0.3666 - val_accuracy: 0.8938
Epoch 25/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1539 - accuracy: 0.9453 - val_loss: 0.3481 - val_accuracy: 0.9025
Epoch 26/1000
100/100 [==============================] - 30s 299ms/step - loss: 0.1572 - accuracy: 0.9423 - val_loss: 0.3882 - val_accuracy: 0.9050
Epoch 00026: early stopping
Run time 856.116808603
Test set accuracy: 90%
Predictions for "no"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[9.5505527e-07 7.6688975e-02 4.6016984e-03 8.7443608e-01 6.1507470e-09
 4.1621499e-02 1.2028771e-03 7.7026243e-08 1.4477677e-03 6.7970650e-11], shape=(10,), dtype=float32)
Predictions for "hey-marvin"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[1.4215793e-23 3.6277120e-28 7.7683000e-33 4.2175242e-28 1.0000000e+00
 1.9384911e-25 4.4126632e-22 6.7758306e-20 2.7613047e-26 1.4793614e-35], shape=(10,), dtype=float32)
Predictions for "left"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[6.2980646e-08 4.8209145e-08 7.2652750e-07 7.3871712e-07 3.9954323e-10
 9.1625368e-10 9.9980372e-01 7.3275419e-06 1.8732932e-04 1.6446810e-13], shape=(10,), dtype=float32)
Predictions for "go"
['up' 'down' 'stop' 'no' 'hey-marvin' 'go' 'left' 'right' 'yes'
 'duration.py'] tf.Tensor(
[2.7044329e-11 7.6081965e-06 3.2807577e-05 4.8282389e-03 4.0133222e-12
 9.9512523e-01 5.3841272e-06 2.5338602e-09 6.5998506e-07 1.4413574e-15], shape=(10,), dtype=float32)
Run time 869.929035417

simple_audio_mfcc_frame_length1024_frame_step512.py was just a rough test I did with an interest to the new math libs of tensorflow.

But the above is a CNN runing on a Pi4 and yeah with a 1000 of each label it takes approx 14 minutes to train.
It uses the keras framework and is using accuracy with a patience of 10, meaning if no increase in accuracy over a run of 10 epochs it will choose the last best.

Its a one off training need and yes is 14 minutes of your life but its a lot less painless that the cost of water cooled GPU’s and obviously also much faster on a desktop without GPU.

Once that model is trained it can be converted to tensorflow-lite and quantised down so its extremely efficient and fast on a Pi.
That model can be shipped and reused so it only needs a single training for multiple devices and people could share models if they so wished.

It could use usage data and constantly add to a dataset and retrain as a background idle task.

Its both KW & VAD as both could be neural nets which many advantages from load to accuracy that support universal models supplied or specific custom ones with NN custom VAD not only being able to distinguish voice but ‘your’ voice.

After training with tensorflow convert the model to tensorflow-lite and use tfl for inference.

Official releases are a bit slow and here is another community repo.



https://github.com/linto-ai/linto-desktoptools-hmg/tree/master got updated they dropped Sonopy and used Librosa.
There is also talk they may adopt multiple NN (CRNN) with it but currently its a GRU like precise and its extremely interesting in the difference it creates to Precise.
Also to be added are some noise addition routines that is currently a manual process.

I think you could install on a Pi with a Desktop but its aimed at X86 as I have had it working on Ubuntu 20.04 and also really interesting to get the visual feedback from the Google Command Set.
If you train and test and weed out bad label items its also interesting to be able to click and play and listen to why the model thinks its bad.
You get a real feel for what can be detrimental to accuracy and a few pruning runs will greatly increase overall accuracy.

In the Google Command Set Ver 2.0 there is only the label ‘visualise’ that is 3-phone long but maybe my ‘Hey Marvin’ might be a good choice.
Just like the Google Command set it needs some pruning so don’t assume all is correct I think somehow some are longer than 1sec but easy to correct.



Also the raspberry samples added to Record 'Raspberry' & 'RaspberryPi' for a distributed dataset which would be great if others would add a few more raspberries to the collection :slight_smile:

1 Like