State of Microphones?

To talk about training you can need far more than 200 to get really accurate as just more is better with models especially ones of use.
If you want you can have a go with https://github.com/StuartIanNaylor/Dataset-builder which was just to prove that is put behind a web interface training models could be real easy starting with as low as 20 KW samples & !KW.

It has a record.py, split.py and mix.py to create large datasets by augmented a few provided samples automatically. Its not as good as a large set of usage samples but augmenting to make many is a a good enough second.

The you can train the model with https://github.com/StuartIanNaylor/g-kws which is really just the state-of-art Google stuff that I have done a bit of a write up how to install and get going.

You can create a model on a desktop with or without GPU and ship out tflite models to run on a Pi3A+ a single models runs about 20% of a single core on 64bit of RaspiOS its only the training when any extra muscle helps.

Its not application as this is all about voice what voice commands do you want to use? List them or approx the qty and action of the command as that will make things much clearer.

I think the kernel is limited 2 8 devices on raspi OS but cheaP USB soundcards and unidirectional lavaliere mics as with approx 20% usage of a core you can have multiple inputs and multiple KWS and use the best KW hit channel and you only need to train the model once as its just purely directional instances.

Likely for a bed a stereo pair each covering a side or more would work well with what are relatively tiny Lavalier Microphone

There is a shop on ebay with a large range https://www.ebay.co.uk/str/micoutlet/Microphones/_i.html?_storecat=1039746619 but likely shop around and you can get them for less than $5

The unit can go in or under the bad as you just need to position the mics which are tiny.

The main thing with KW is the more unique and phonetically complex the easier or more accurately you will be able to detect. So ā€˜Raise upā€™ is better than 2 KW of ā€˜Raiseā€™ & ā€˜Upā€™, ā€˜Bed Raise Upā€™ is better than ā€˜Raise upā€™ and so on.

There are loads of ways to do this but many usb mics have software that doesnā€™t run on Linux especially high quality ones such as Blue Yeti.

The quality varies drastically but often purely due to not knowing any better the alsa parameters and mic settings are just pure bad to start with and low volume.

Beamforming and unidirectional are bad expressions as no mic pics up a beam they just have a narrower pickup than from all directions.

http://www.learningaboutelectronics.com/Articles/What-are-cardioid-microphones# even the ā€˜beamformersā€™ are just directional cardioids when in use.

Get any mic the cheaper the better and start training with a frame work as you will soon start to learn the pitfalls and how much noise can affect.
Deciding now before you even try anything is likely going to change.

Get a cheap soundcard and a cheap omni or uni directional from ebay or aliexpress just to test some frameworks it will be no worse than the 2 mic hats that for me have extremely bad recording profiles.

Also you never said is this a universal model for many users or custom for a user of choice as universal models need many voice actors and even though we have many ASR datasets word datasets donā€™t really exist and why often ASR is used for what really is KW capture.

Very interesting. I didnā€™t mean a USB microphone, I meant streaming the processed audio chunks from a ReSpeaker Core via USB to another device, such as a RPi. Naturally I recognize that (almost) all of the smart home products on the market communicate over Wi-Fi, and do not themselves do speech recognition. My motivation for not depending on a Wi-Fi LAN is simply that the main use case for the device Iā€™m working on is for individuals with a disability (namely myself), so I would like to depend only on having power to the speech recognition device and the bed itself, which can both be accomplished by a UPS in the case of a power outage.

At this point, the argument could be made that the primary method of control would be a local smart assistant, which could be set up the way you describe: ReSpeaker Core streaming over Wi-Fi to a separate server which then streams commands to the individual functional devices. This method would allow for much richer and more accurate recognition as well as being able to control a variety of other devices throughout the home besides just the bed. That approach has value to me.

The secondary method of control would be a simpler and less accurate speech recognition on the bed control device itself. This could be used whenever the primary method was unavailable (in cases with no power or loss of Wi-Fi connection).

This response is really in response to both @sskorol and @rolyan_trauts. To answer @rolyan_trauts question about specific speakers versus universal speakers, I would much rather support universal speakers.

Both of you seem to be either professionals in this field, or at least expert hobbyists, and I really appreciate your help so far. I myself am a professional graphics programmer for video games, so although I have software development experience there is much more that I donā€™t know about this field than I know.

What Iā€™m gathering from our discussion so far is that I should probably focus on the secondary method of control (because for the time being the primary method can be satisfied using off-the-shelf products until I have time to implement a local solution). Furthermore the secondary method of control can probably consist of a relatively simple and inexpensive two microphone hat, like the IQAudio one mentioned previously, for example, on top of a RPi. At that point, I would just need to settle on some sort of ASR + NLU system, Iā€™m guessing? I would eventually like to have something that supports a custom wakeword and accepts commands given in a natural language format, like ā€œraise the back of the bed upā€. However, I have no interest in supporting a wake word chosen by the user at this time. Hopefully that makes sense and you guys can steer me in the right direction.

I forgot to mention, I do currently have a working prototype using Rhasspy -> NODE Red -> my custom control software.

If you want universal speakers then go the ASR which collects spoken phonetics and via a dictionary base and some clever context can predict a word or sentence.
They are prebuilt on universal models often and you are sort of ready to go but can be susceptible to noise. Often though they are recorded ā€˜cleanā€™.

If you where going to train a custom KWS then you would not have to have a KW but your simple command sentences would be a collection of KWs and likely to have better noise resilience and be more accurate as you can record on device of use.
Each voice actor would have to be recorded though.

The IQaudio has only a single omnidirectional on board mems as its really pointless without DSP beamforming to have multiples to provide a single stream as apart from the time of speed of sound of the distance between them the input will be identical but there will be out of phase latency so why they should nto be summed and why its a bit pointless having more than one unlesss you do have some DSP algs.
It has a stereo 3.5mm mic jack and stereo Aux in like most of the hats it steals your GPIO and doesnā€™t provide a pass through it does have pads for a SMD header or soldering direct.
Actually the 3.5mm might be mono and its stereo with the other channel being the onboard mic I just have never tested.

Oh, I would like GPIO access, at least via a stacking header. Iā€™m wondering why they sell (or people buy) double microphone or microphone arrays without the various algorithms running to clean up the recordings. I mean, obviously I did because I didnā€™t know any better.

Unfortunately because they sell respeaker has a whole range of relatively useless that sell quite well.
The ā€˜makerā€™ market got profitable and it got lubricated with snakeoil.

I suppose someone was hoping somewhere that the algs might be released but think its unlikely as either you have loads of clock speed or a rtos where extreme close timings can be guaranteed. The Pi is lacking both.
We would of seen at least one manufacturer boast and release software by now as its been a long time gripe but we havenā€™t and that sort of speaks volumes.

But it goes back to fixed hardware platforms where everything is custom trained for and the relatively false idea you can just cheaply DiY odds and sods and compete with the likes of Google or Amazon.

There a are niches but with mics they are much the same and often the advantages of others are not really worth the price difference.
Like I say it the catch-22 of the bring your own open approach to hardware and also a collection of software.

I was browsing @sskorol github and noticed he does https://github.com/sskorol/matrix-voice-esp32-ws-streamer

Which might be better than some others as quite a few on here know I have a utter hatred of raw wavā€™s over MQTT but even broadcasting raw wavā€™s for me seems crazy seeing how 20 years ago the Ipod made a codec of some type sort of mandatory.
ESP32 or over cheap low cost mics that are distributed and you can select the best stream from a array of distributed mics that broadcast from KW to silence or kick.

Maybe he might be interested in getting a CNN running on the ESP and using AMR-WB as a codec as the g-kws has a CNN ready with a MFCC front end with goodies to run TF4MC

Audio ā€œfrontendā€ TensorFlow operations for feature generation

There is the ESP32 Alexa that Atomic made but he used the spectrgram tutorial and also the benchmark dataset Google command set as its deliberately hard with expectations nothing will get 100% as it is a benchmark KWS dataset otherwise its the most lousy trashy piece of work Google have ever released :slight_smile: .

I actually think MFCC is the killer codec as for an ASR its lossless as its what it uses anyway the 16:1 compression can also run through gzip and be absolutely tiny in a similar way Google is boardting about Lyra.

Depending on how comfortable you feel training your own language model you can also look at doing it all in node-red.
I develop and co developed a few speech control related nodes for node-red.
I have had really great experiences using deepspeech recently in my set up. With a domain specific language model/scorer its fast enough to do real time or faster streaming asr on a pi 4.
The good part about deepspeech is that its a lot easier to start training your own language models and add new vocabulary to combine them to a scorer than it is to add vocabulary and train models for kaldi asr (vosk).
You can have a look at this collection of voice related node-red nodes here:

There is also things like a jsgf (the grammar format that is also used as the base for rhasspy rules) permutator which can be used to quickly create a text corpus for language model training that i wrote. This can also be used to create a tagged corpus to do very basic fuzzy intent recognition.

On the microphone side i fully switched to using max9814 electret microphone breakouts connected to a usb soundcard inspired by @rolyan_trauts as i found them much better than the 2mic seeed card or the iqaudio hat. I actually get quite decent range and detection with this set up.

Johannes

Did anyone have a look at Speechbrain as I went on about it but never did give it a try as I think it tries to make Kaldi easier. Its all very new and need to have a look one time.

Yeah I am not really a fan of the IQaudio hat either but as opposed the respeaker its just more flexible with the 3.5mm and aux-in but as it comes with the onboard omnidirectional mems its just 2x the price of the respeaker 2 mic.

I think most of it is those cardioid electrets as they do have reasonable sensitivity and SNR but boy it was many I tested but definitely a preamp with the max9814 especially the one with its own regulator seems the best.

Part of the problem and why I havenā€™t tested the iqaudio hat that much is the absolute overkill and complexity of there alsamixer config which is super complex and would seem to be undocumented.

The Lavalier into a sound card is just the non DIY to get the advantages of a cardioid that doesnā€™t cancel background noise it just picks up better from the front and that is really useful and as audio equipment have been used for decades.

From PS3eye to respeaker I have had to repeat time and time gain the secret sauce is the DSP or otherwise you are purchasing PCB mounted mics that are omnidirectional that often take all the GPIO and also near impossible to acoustically isolate.

I presume though because @JGKK is custom training with the hardware of use that actually the accuracy is quite good but likely doesnā€™t have complete datasets of the hardware of use like the big guys do, but can get very good results.

If I use the term bemused I am sure for some it will bring a grin but yeah I find it completely bemusing that a voice application seems to shy away from initial audio processing and maybe it is so complex or there are conflicting interests and an overestimation of the resolution and results DSP beamforming and Algs can produce.

I can actually beamform with a stereo soundcard and 2x angled cardioid and use the threshold hit of a KWS to select my stream but the project doesnā€™t have any method to select best KW hit and just uses the 1st in.
USB are really handy as they donā€™t steal your GPIO and you are not limited to one as with the above and 2x stereo soundcards I can run a single beamformer on a single core of a Pi3-A+ omnidirectional with 4x 90ā€™ cardioid beams as I have a KWS that can run in less than 20% single core load, or at least Google do.

I am just an oddball though as again and again the request for omnidirectional beamforming and when I look around I never see its need apart from the adverts showing how great it is as a conference mic but in use it never seems to be central and often is on a table or shelf plugged in somewhere close to a wall.

Yes, a lot of your speculation about the proliferation of these devices does sound plausible to me. Unfortunately, much of the rest of your message goes completely over my head. :slight_smile: At this point Iā€™m not seeing a particular reason to stick with the Matrix product line, as it seems to be in the very least in hibernation and by @sskorolā€™s analysis inferior to the ReSpeaker Core, anyway.

Thank you for your input. I was actually looking at moving away from using node-RED. It was expedient to set up my test case, but felt a bit unnecessary in the long run considering it is ultimately driving an executable written in C++.

I honestly have no idea about training my own language model. Iā€™m not against trying it, I just donā€™t really know where to begin.

Thank you for suggesting the microphones. Which USB soundcard did you choose?

Yeah, I donā€™t have a need for beamforming, as far as I know. Noise suppression and accurate recognition from a moderate distance are much more usable to me.

Noise suppresion prob not as software NS can leave artefacts as its sounds like your going to use a universal non custom trained which will of been trained without NS so prob just go without.

You can try a version of NS that comes with Speexdsp and for some reason the asound2-plugins are lagging behind revision on debian but you can do an update here.

Sounds like you have what you need with what you currently have setup as after you do go round the houses you tend to come back to any mic can do or will at least have to.

Honestly I still feel a bit like my head is underwater, Iā€™ve gotten tons of great information from several people, but with my lack of experience itā€™s hard to sift through it all and act on it. Previously you mentioned a preference for a unidirectional microphone connected to a USB soundcard. Can you elaborate on specific microphones and soundcards? What kind of software are you using in your setups?

Just plain rhasspy but a cheapo USB card and unidirectional mic.

I use a BOYA BY-MM1 for testing and stuff as use it also as a desktop mic and its on a little mini camera tripod so its just handy.

Like all china product can be a bit hit and miss as sometimes you never know what is inside as if Intel chipset you lucked out as they are pretty bad.

The white ones seem to be relatively consistently ID 1b3f:2008 Generalplus Technology Inc.
The black ones which build wise are a bit better quality seem to more often be the bad intel chipset than the above.

You always know what your getting with these as its a CM108 and they can not hide that as its not in a case.

For no solder jobs starting with cheap but cardioid (unidirectional) cardioid means heart shape and the bottom of the heart is the front of the pickup pattern.
So very cheap not that sensitive as it works backwards as a theoretical no loss mic has a sensitivity of 0db

-52dB Ā± 2dB 3.5mm TRS (tip ring sleve) 3 pole 3.5mm jack plug (phones are 4 and the contacts are in a diff place, you can get convertors but hey)

The sensitivity on this looks really great but it a TRRS 4 pole that is for a phone as an example

But you can get adapters make sure it says for microphone as the tip & ring are stereo out

It does get quite confusing as often labelled wrong and often ignore the description and go off what you are looking at.

To the Boya I use

Or what looks the same unbranded but they are half mics and bigger than lavalieres where the mount is the camera type which for me works well as have a mini desk tripod.

Or you can go full DiY and get a preamp

Make a 3.5mm lead and connect to a usb sound card

Electrets you can buy from me as it will save you buying x25 but for price and sensitivity + SNR they seem to be the best and you can try to see if you can get elsewhere but seems quite hardwork so why I bunged a few up on ebay.

Its just 2 wires to the preamp board and no more components

Thank you, this gives me a bunch of stuff to try. I will hopefully report back after Iā€™ve had a chance to try some of this stuff.

1 Like

The White ID 1b3f:2008 Generalplus Technology Inc. seem to have the best hardware AGC and gain over any and should say they are worth a punt as even if you get a wrong one they are very cheap and its not a major dent to source elsewhere.

Should say I often scour ebay abd aliexpress but PiHut should guarantee its the correct type.

Yeah, I think I located the same one over at Adafruit here in the States.

1 Like

If lsusb reports 1b3f:2008 Generalplus Technology that is the one they are not quietest but the range of the AGC is really good and negates the need for a preamp.
The CM108 prob gives a cleaner signal but the lower levels make a preamp module preferential for far field but they do add an extra step that allows you to get more gain a bit more cleanly.

If you want a cheap hub then my 1st buy worked out really well but stay clear of these.

I thought hey that is just like the 1st one but also with a header array of all the usb pins which is handy but for some reason seem to disconnect and freeze on the Pi3A+ I have tested on.

Really cheap though and seem to work well which was my 1st buy of a cheap ā€˜boardā€™ hub.
So blue seems OK and purple maybe stay clear but like all these modules they do seem to vary and maybe I was just unlucky.

1 Like

Yeah, I plan to check when they arrive.

PS I gave a bum steer as not a fan of the respeaker USB so my memory is foggy at best.
Its not a beamformer that has to be applied by software its just AEC + AGC.

No worries. I just received the USB soundcard and electret microphone breakout board, but havenā€™t had a chance to test either of them.

1 Like