2x Matrix Voice ESP32

romkabouter · March 30, 2021, 8:59pm

I do not see a disconnect in this log actually, I see Rhasspy responding on matrix71 and I also see audiostreams from mv75 and mv71.

Your first log with the timeouts are better logs.
I have two devices esp32, so I will try it shortly

romkabouter · April 4, 2021, 8:27am

I have tried with two satellites, but I do not get any disconnects from the MQTT broker
However, the hotword detection does not work when both devices are connected.
When I disconnect either one, it works as expected.

This seems to be a problem with Rhasspy to me.

htzllzth · April 6, 2021, 12:38pm

hi romkabouter,
good to hear that I am not the only one where 2 matrix satellites with rhasspy do not work. I am already quite desperate. With me also the Hotword detiction does not work but I thought at the beginning it would be a configuration error in the settings or just a communication error.

Did you see anything in your Rhasspy logs? According to the logs I have seen, no wakword arrives at rhasspy. Rhasspy is not responding on my end.

Do you think this is a bug with rhasspy or if you have to include the satellite rents with the rhasspy settings differently? Could perhaps the group_separator remedy? I have ruled this out for now as it is meant for satellites that will be used in the same room.

romkabouter · April 6, 2021, 1:01pm

I think Rhasspy does not seem to handle multiple input sources from hermes correctly, the group_separator is indeed for grouping.
I will try this with 2.5.10 as well, maybe the issue is not occurring anymore

Indeed, no wakeword detection in the logs. As soon as I unplug 1 device, it starts functioning correctly again without any restarts are whatever.

htzllzth · April 13, 2021, 11:03am

I have tested it with the version 2.9.10 unfortunately without success. Still no hotword detection when both satellites are active. Do you think it would be better if I open a new thread in the error section of the forum or is this already correct?
I will also try if it brings something if I mute the microphones of the individual satellites. In the software that is provided and I can also trigger it via MQTT, as you have written on your Github page. I will report back.

romkabouter · April 13, 2021, 8:08pm

I have not tried with 1 esp32 and 1 pi+mic yet.
It seems to work ok for everybody haven two or more sats.
The esp32 software is not much more than a audioFrame pumper

yavilevich · April 14, 2021, 8:32am

Hey guys, currently a single rhasspy instance can only do wake-word detection for only one satallite at a time. I have mentioned this in DIY Alexa on ESP32 with INMP441
Perhaps the developers will improve on this.

For now I found a work-around where I run a dedicated rhasspy instance for each satellite (so it is 1 to 1). All of those report to a single rhasspy instance for actual stt and intent processing. You can put up several docker containers so it shouldn’t be too difficult.

Let me know if this helps.

rolyan_trauts · April 14, 2021, 10:59am

Yeah you did a great job on that but still there is no order by confidence level so the best satellite signal is used as due to the first in nature its more dependent on the better network connection and likely there will always be one that is first in even when its signal is poor in comparison to others.

The length of time it took someone like you to do something as simple as attach a VoiceAI HMI to a VoiceAI system surely must be a WTF to the current methods available.

The total brainfart of pumping uncompressed wavs into a control & messaging protocol that needs to be secure so that it needs to be encrypted becomes an absolute toilet of an idea when you think about multiple mics in multiple rooms.

The esp32 software is not much more than a audioFrame pumper because its resource sapped after encrypted 16khz uncompressed audio and because of that there is no load left for KWS.
With the plethora of audio protocols and codecs that are far in advance of the technological origins of chunking raw pcm around a system that has origins from somewhere between the gramophone and ipad it might be wiser just to adopt what already exists than this awful branding of herpes.

Satelites need to be partinioned out of rhasspy and they should have zero requirements to have any knowledge of proprietary protocols and act as general purpose voice HMI that can work with any system and be free of system update and changes.

There should be a server that allows chunked http connections and runs a separate and isolated MQTT broker for satellite messaging and control where a templating system allows protocol exchange to work with all.
The idea of a satellite should be dropped to the parts of what makes a satellite where display, audio in & out are distinct functional objects sharing a common IP.

A simple Zone system of folders can represent rooms and a satellite is a text file of objects.
You do not need to encrypt command messages from KW to silence as who cares and if they do then its there job to employ stunnel or others but not one to bloat and requirement on a HMI.

Confidence level from a mic object is absolutely essential to make any form of effective multi mic room setup and currently its an afterthought… !!!

Audio out likely should use a protocol such as squeezelite, snapcast, airplay or another of many that have had much development in latency adjustment bandwidth, qos and control.
Then all those great tools and functions those providers give are available to the community and all rhasspby does is use MQTT to set channel & volume.
Audio in has to have a compression codec be it even Opus as a system grows it can not be chunking many streams of uncompressed 16k PCM continously especially when a main carrier is WiFi and has zero need to be broadcasted to all as MQTT.

You should be able to place a satellite on a network enter a mac address or have some form of Bluetooth enrolment, place it it the zone of use and set the object messaging templates and that is it.

Then rather talking about the missing fundamental needs of simple multi zone multi satellite systems we should be talking about local use collection, learning training and satellite OTA model shipping so that a system gets better through use.

Your work-around is brilliant but wow why is it so?!

romkabouter · April 14, 2021, 12:03pm

I think that only goes for ESP32 devices, not for Rhasspy Satellites

yavilevich · April 14, 2021, 12:55pm

That is because Rhasspy Raspi satellites do their own wake word detection. I think that any satellite that doesn’t do wake word detection will have this issue. So it doesn’t have to be ESP32 but yes, it isn’t relevant to Rhasspy Raspi Satellites.
If we can make a micro-controller based satellite that does wake word detection then there will be no need for the work around I mentioned.

rolyan_trauts · April 14, 2021, 1:02pm

KWS ESP32 is possible and exists but KWS + VAD with standard ESP32 is prob no.

yavilevich · April 14, 2021, 1:02pm

I agree on the vision and the desired functionality but it looks like it will require more skills and work time than the community is ready to contribute at this time. So I suggest we take baby steps. Advance one feature at a time. If advancing forward means coming up with an awkward work-around, then that is fine with me.

rolyan_trauts · April 14, 2021, 1:12pm

Its the wrong direction really hence why the harsh words for effect.

I hate you all! Ok I will make a fork that works in manner mentioned and hopefully someone might take my nasty MS hacks and turn it into something Pythonic.

Currently the API and things have been designed down whilst we have a serial audio feed that is up and it just wrong.
Its just wrong to have any system protocol dictate to a HMI mic as any change will cause embedded firmware to stop how ever unlikely change may be.
But wow if that chunked wav MQTT method lasts the test of time you can send me hats and I will eat the lot.

A ‘an awkward work-around’ is not an advance its just further tying us to some extremely bad spaghetti

romkabouter · April 14, 2021, 4:16pm

That is a good one to try indeed!
I am planning on testing with 1 esp32 and 1 “standard” with a Pi running Rhasspy as a sat.

romkabouter · April 14, 2021, 4:17pm

With the previous version, I had WakeNet working. But I am now hoping for porcupine library

Incorrect, there is no good library for it (or I cannot find it). WakeNet was working fine, but only had Alexa or some Chinese wakeword and seems to have stopped developement.
Porcupine now has a lib for microcontrollers, but not for esp32 (yet)

romkabouter · April 14, 2021, 5:59pm

Tried it, indeed you are correct. It seems Rhasspy does not handle wakeword detection on more than 1 incoming MQTT audio.
I am going an try and fix it.

rolyan_trauts · April 14, 2021, 7:27pm

We actually have posts from atomic (esp32 alexa) with a tensorflow for Ml working its just his dataset and classification that is a bit whack and lowers accuracy.

ESP32 has had KWS since Skainet and others which is at least 2 years old.

You just take a tflite model and convert as in the above.

Also not sure as will have to hunt it down but I think Google have done a micro-controller frontend for MFCC.
So you just have to point audio chunks at the front end.

If you use Porcupine you are once again locked in to blackbox models, even if it was me who made that request.

Google have published CNN models with reasonable latency https://github.com/google-research/google-research/blob/master/kws_streaming/experiments/kws_experiments_paper_12_labels.md#cnn

But still the dataset could be better which I am currently working on as following the audio stream 1st, but the current Google KWS can be created in their framework .
xxd -i converted_model.tflite > model_data.cc
Then the model_data.cc is then copied as a binary to the esp32.

The other models benchmark with this command /data/local/tmp/benchmark_model_plus_flex as opposed to /data/local/tmp/benchmark_model so I know that is just tensflow lite and converts via xxd.
I haven’t played but the MFCC front end might be included so all you have to do is forward audio, but not exactly sure as haven’t really looked at the ESP32 implementation but whilst playing with the better Aarch64 models where I can delegate out layers not in TFLite I have noticed specific code for the ESP32 TFlite model.
The plus_flex is the flex delegate that allows nodes to be delegated out of the tflite runtime and use the full TF for those ‘layers’ so the CNNs in the google repo don’t use that so you can guarantee they are 100% running on tflite and should convert without a hitch.

I can not remember if Atomic employed a rolling window so that at least one frame gets a full inference, but the Google-streaming-kws even provide a full streaming CNN for TFL.

Also in the ESP32 IDF AMR-WB is available and reduces bandwidth considerably.
My boxed bribe of goodies to Atomic failed though

Also Atomic uses spectrogram which contains more parameters for slightly less accuracy than MFCC.
In fact really all he did was copy the audio example from Simple audio recognition: Recognizing keywords | TensorFlow Core then jump to the ML tutorial for XXD.

There are far better published models and methods available.

github.com

google-research/google-research/blob/6fab7f9e6e2d02c7678c7f938a1fe4784a130c43/kws_streaming/data/input_data.py#L503


      
            mfcc = audio_ops.mfcc(
                spectrogram=spectrogram,
                sample_rate=flags.sample_rate,
                upper_frequency_limit=flags.mel_upper_edge_hertz,
                lower_frequency_limit=flags.mel_lower_edge_hertz,
                filterbank_channel_count=flags.mel_num_bins,
                dct_coefficient_count=flags.dct_num_features)
            # mfcc: [channels/batch, frames, dct_coefficient_count]
            # remove channel dim
            self.output_ = tf.squeeze(mfcc, axis=0)
          elif flags.preprocess == 'micro':
            if not frontend_op:
              raise Exception(
                  'Micro frontend op is currently not available when running'
                  ' TensorFlow directly from Python, you need to build and run'
                  ' through Bazel')
            int16_input = tf.cast(
                tf.multiply(background_clamp, du.MAX_ABS_INT16), tf.int16)
            # audio_microfrontend does:
            # 1. A slicing window function of raw audio
            # 2. Short-time FFTs

The Google command set is the biggest curve ball as its a benchmark dataset not a working dataset and conatins up tp 10% bad samples deliberately as otherwise all sate-of-art KWS would be producing 100% validation results when presented with clean datasets.
The google-kws-streaming framework does a pretty goog job of creating and mixing in background_noise & as silence but the arbitary range doesn’t take into account its volume in the mix of KW & !KW and each mix where background_noise is the louder adds a bad data item to the dataset.
So in the basic framework the volume of bkackground_noise that can be added is limited as increasing just increases the chance of bad data (volume is higher than foreground).

The great thing about the Google_streaming_kws framework is that you can use the automatic dataset creation paramyers or supply a folder structure and do your own.
You will still get great results with the automatic dataset because of its structure where ‘silence’ ‘kw’ & ‘nottkw’ are badly labelled as really its !kw-nonvoice, kw-voice, !kw-voice and not !kw has be partitioned into 2 classifications which lowers the cross entropy with kw-voice.

Also the streaming interface returns a inference envelope that is far more accurate and tolerant to false activations than single threshold inference.

The Google KWS-streaming repo is Google-research doing a research repo to help with direction of TFlite and flex delegates and TF-M as the push for mobile and edge KWS with tensorflow.
It is opensource and state of art and after use the only downside side is the limits on background noise level but a dataset builder is easy to accomplish but the ‘auto’ dataset is already good.

A model is purely a set of classification filters and if you pour in too much variation it just garners cross entropy and partitioning noise into types increases accuracy as it lowers cross entropy and maybe even needs some further partitioning but this also adds the need for data and model increase and google-research being google-research its likely its already optimum.
Keep voice out of !kw-nonvoice as much as you can and only supply clear voice sample to !kw-voice as it will be mixed with back_groundnoise

If you want the most accurate KWS then only add to KW & !KW_Voice the voices of use and do not add universal datasets such as the google coomand set full of varied voice types the unit will never hear and just increasing cross entropy pointlessly.

Microfrontend

The noise reduction and agc modules are interesting also.

romkabouter · April 14, 2021, 8:43pm

I do not think this topic should be hijacked by that discussion, so I will not reply on your post.
Sorry about that.

romkabouter · April 24, 2021, 8:02am

I have made a fix but it is not ready yet.

The problem is that there is 1 wakeword engine running. But the audiobuffers are coming in from multiple streams and are therefore mixed.
Wakeword detection fails on that.

I have changed this to have a wakeword engine and audiobuffer per site_id.
This works, but I have some cleanup to do

htzllzth · July 18, 2021, 5:00pm

Good evening,

I can confirm that with the preview Rhasspy version 2.5.11 and the ESP32-Rhasspy-Satellite v7.6.1 I can run multiple Matrix Voice satellites at the same time and they respond. Great, after the final version my project can then go into the next phase. Thanks a lot to all of you.