State of Microphones?

Light · May 7, 2021, 6:15pm

Several years ago, when I first started on my speech recognition project I was just using a Yeti USB microphone. However, then the Matrix Voice came along, that look like exactly what I wanted so I got one and I have been using it for a couple of years now.

But recently, it seems like the community/the project behind it may be dead or dying. So I’m wondering whether it might be time for a change. I have taken a look around, and I have seen that a few people are using Matrix microphones like me, several people are using ReSpeaker microphones, and some people are using USB microphones.

I was hoping to get opinions from people. Of the products that are available now, what do you think are the best ones and why? I would like to have something that can handle recognition from a distance and in the presence of noise, similar to what the big brand assistants can (Amazon Echo, Google Home, etc.). Having LEDs is nice, but not the most important thing to me.

rolyan_trauts · May 7, 2021, 7:37pm

The Yeti is OK if using win / mac but a lot of the goodies and software are missing on linux.

There are a whole rake of Matrix like microphones that have these arrays that are a whole duplication of worthless ness unless you move up to the higher price USB built DSP models.

One of the biggest problems is noise and array microphone such as respeaker with no beaming algs are just that duplicate omnidirectional mics so noise from all angles is treated equally.
Unidirectional have various polar patterns cardioid / suoer cardioid and its all about how much they reject from the rear and sides.

If you have no algs to create a polar pattern then get a mic with a polar pattern as without they are highly susceptible to noise.
Even mics and packages that come with AEC only cancel what is being played through that capture card / mic.

Budget no DIY prob a cheap soundcard & shotgun mic for pc/camera/phone and on Linux it will be better than your Yeti as the mic has a polar pattern and doesn’t need linux software that doesn’t exist.

Budget and a bit of DIY solder a jumper wire to an unidirectional electret to a couple of $ mic preamp module and a soundcard.

There are some really good mics out there that can hear a mouse fart in Africa from your front room but add noise of HiFi and TV and the voice AI recognition goes to hell.

There is the likes of the BOYA BY-MM1+ & BOYA BY-MM1 which is you do some shopping around there are some very similar China imports with same specs for just over $10 and those if all fails can be used with camera / phone or as a desktop broadcast mic as there ‘shotgun’ nature projects forward and rejects noise from side and rear.
BOYA BY-MM1 is cardioid & BOYA BY-MM1+ is super cardioid which is a polar pattern that just rejects more from side and rear.

Sensitivity isn’t really a matter but -36Db is good but -46Db is equal as your likely to need some form of AGC as sound quickly diminishes through distance and you can only set a mic gain so high that it doesn’t clip @ near where without AGC far the audio amplitude will be very poor.

I am hoping rhasspy will support multiple KWS and use the best KW hit as 2x unidirectional mics can point out at an angle and give both a polar noise rejection and wide area coverage but the only advantage is via placement so that depending where you are noise=rear & voice=front and is just incremental improvement over a single mic.

The commcerail products such as Google Home & Alexa have a clear advantage as the hardware is fixed and if we could hear what they receive we would prob think they are pretty crappy but they do enough to attenuate noise and beamform enough that custom models are made specifically for the hardware platform and the hardware profile is in the model so it works more accurately than many of us can achieve.

I just recorded some samples to highlight the catch-22 of a good microphone and even a cheap sound-card and a basic mic can get reasonable results.
Here though is the UK and behind me is a quite noisy fan heater that if I want far field and to use AGC the AGC will ramp up and play that noise.
I can use basic speex noise-reduction and AGC but that gives artefacts as it is noise-reduction and not removal.

So near=.3m with agc on and NS on
https://drive.google.com/open?id=1zvetgGR7ftyJoHV41fk80gRXLLkXKorI

Turn off NS
https://drive.google.com/open?id=1XYsOqSpHi-9B2HqAxT4IXWIlZmVZZKtM

Far with AGC
https://drive.google.com/open?id=1p2_9BYih4oC905aziNHn7vlh5gURAGVX

Far without AGC
https://drive.google.com/open?id=17TX6bfxVNNWxTRgFyysuf8LSTiUejnSr

The catch-22 for the likes of Amazon & Google is that they could training models for specific hardware setups and even though that may sound bad for recognition they provide perfectly good results.
Rhasspy has all sorts of hardware attached as its open to choice so its models are recorded relatively clean.
The only way to have the magic to do all and create relatively clean voice takes much more hardware than Google or Amazon need as RTX Voice on a high end GPU could provide clean voice against all manners of noise without a hint or at least minimal artefacts.

Prob if we all used the same, we all used Speex then the models could be recorded so and the best Mic would be the one the models where recorded with and algs of use.
Likely a really cheap and simple setup could give excellent results as the models are all recorded with those hardware profiles.

But from hardware to software and models included in that software is all different and mainly expects clean noise free which if you have RTX cash then its quite possible.
At a lower $100 scale the DSP mics are great for far field and beamform but with the introduction of noise it all goes to pot.
So its any mic is a good mic or no mic unless your going to splash the cash on some high end GPU based RTX voice like function depending on what you deem fit for purpose.

There is a lot of stuff with arrays and pixel rings that really is no advantage over any cheap mic as we don’t fixed hardware or the software algs that the big guys have.

Light · May 7, 2021, 8:39pm

Thank you for your very detailed response! I’m not going to lie, much of it goes over my head. Are you aware of any array microphone products/projects which do implement beamforming and things like that? I think that this was part of the promise of products like the Matrix Voice, which didn’t come to be.

rolyan_trauts · May 7, 2021, 8:47pm

USB respeaker but not really a fan.
Anker Powerconf doesn’t like Linux.
Acusis S Linear Microphone Array but a bit pricey

Still all suffer with noise though.

I just recorded some samples to highlight the catch-22 of a good microphone and even a cheap sound-card and a basic mic can get reasonable results.
Here though is the UK and behind me is a quite noisy fan heater that if I want far field and to use AGC the AGC will ramp up and play that noise.
I can use basic speex noise-reduction and AGC but that gives artefacts as it is noise-reduction and not removal.

So near=.3m with agc on and NS on
https://drive.google.com/open?id=1zvetgGR7ftyJoHV41fk80gRXLLkXKorI

Turn off NS
https://drive.google.com/open?id=1XYsOqSpHi-9B2HqAxT4IXWIlZmVZZKtM

Far with AGC
https://drive.google.com/open?id=1p2_9BYih4oC905aziNHn7vlh5gURAGVX

Far without AGC
https://drive.google.com/open?id=17TX6bfxVNNWxTRgFyysuf8LSTiUejnSr

The catch-22 for the likes of Amazon & Google is that they can train models for specific hardware setups and even though that may sound bad for recognition they provide perfectly good results.
Rhasspy has all sorts of hardware attached as its open to choice so its models are recorded relatively clean.
The only way to have the magic to do all and create relatively clean voice takes much more hardware than Google or Amazon need as RTX Voice on a high end GPU could provide clean voice against all manners of noise without a hint or at least minimal artefacts.

Prob if we all used the same, we all used Speex then the models could be recorded so and the best Mic would be the one the models where recorded with and algs of use.
Likely a really cheap and simple setup could give excellent results as the models are all recorded with those hardware profiles.

But from hardware to software and models included in that software is all different and mainly expects clean noise free which if you have RTX cash then its quite possible.
At a lower $100 scale the DSP mics are great for far field and beamform but with the introduction of noise it all goes to pot.
So its any mic is a good mic or no mic unless your going to splash the cash on some high end GPU based RTX voice like function depending on what you deem fit for purpose.

There is a lot of stuff with arrays and pixel rings that really is no advantage over any cheap mic as we don’t have fixed hardware or the software algs that the big guys have.

https://www.digikey.com/en/products/detail/antimatter-research-inc./AR-ACS1/13147322

Light · May 7, 2021, 9:02pm

I see, so I’m understanding correctly, we are essentially “screwed” by the fact that the speech models are hardware agnostic. This may be an advantage for the project, which desires to serve many different pieces of hardware, but a disadvantage if you want absolute best recognition. If I’m looking to have a form factor similar to a device like an Amazon Echo or Google Home in the less than $100 price range, it sounds like there is no advantage to having array microphones versus a simple one or two microphone hat?

rolyan_trauts · May 7, 2021, 9:06pm

If you can train your own KWS & ASR on your hardware with noise profiles…

But yeah you hit the nail on the head as to allow choice of hardware & also software with a collection of 3rd party models it does make things a bit mission impossible unless you provide hardware and environ trained models.

@Light the 1/2 microphone hats are omni directional and at a disadvantage to a simple unidirectional old school electret plugged into a sound card.

The Raspberry Zero Codec is prob best choice as it has a single mems omni onboard but also a stereo 3.5mm and aux in and aux out as you may find without beamforming plugging in a unidirectional is better so its gives choice over a fixed hat.

For me the audio quality that the WM8960 hats give via omni directional mems is actually pretty low quality as they pick up everything that you don’t want as well.

I have a preference of a usb sound card and unidirectional mic as you can also isolate the mic to an extent that is near impossible on a pcb mounted hat.

@sskorol managed to get some of the respeaker freeware working for beamforming, aec, ns

Light · May 7, 2021, 10:29pm

Hmmm. I’m not sure where I should go from here. Perhaps I should provide some context.

As many people have done, I am making a device which can control a device in the home. The major difference is that the device being controlled is a hospital-style bed. My current setup with Rhasspy and a Matrix Voice is “acceptable”. I somewhat frequently get false positives on the wake word and falsely recognized intents. I have not extensively adjusted settings, and I understand that some of these issues may be alleviated. However, my primary reason for starting this thread is concern over longevity, as it appears the Matrix Voice days may be numbered. I believe a directional microphone might not be a good choice because it limits the positioning of the device and the use case itself is not fixed (the speaker might be positioned at various locations around the room). There will always be some noise.

So I’m trying to figure out what my best option might be going forward and I really appreciate your advice so far. I’m not against training my own models, but honestly have no idea where to start with that. I took a look at the Raspberry Zero Codec and it has some attractive features for my purposes. I would like to use a 1.2 W speaker, or a pair of them to reduce dependency on large USB speakers for audio feedback.

rolyan_trauts · May 7, 2021, 10:44pm

I really couldn’t make a recommendation for a hospital style bed.

Guess its just price from then WM8960s start @ $10
Zero Codec $20
Respeaker DSP USB Mic array $65
Acusis S Linear Microphone Array $100

Or got the other way

Just running scared of making any suggestions for ‘Hospital style bed’ so just examples more than suggestions.

Light · May 7, 2021, 10:53pm

Ha. I do have the ability to control it through the Echo, but I’m trying to move away from that because, obviously, it’s not local. I can run all of this on a battery backup so even if Internet or power is lost, it is still good to go.

Light · May 7, 2021, 10:55pm

Don’t worry too much about the hospital style bed. My point was more to illustrate that it’s more like controlling lights in your house, but not something you can typically do with off-the-shelf items, so it’s actually serving a need rather than just being a fun experiment.

sskorol · May 8, 2021, 9:54am

I played with both Matrix Voice (ESP32 version) and ReSpeaker Core v2 in a smart home context. The latter gives much better results due to available DSP package from Alango. It’s not perfect, but it’s better than nothing in Matrix.

Note that I don’t use Rhasspy. I use Vosk (https://alphacephei.com/vosk) as ASR and Spacy (https://spacy.io) as NLU engine for my native language + custom sensors firmware.

If you are interested in a websockets server built on top of librespeaker DSP, check my repo: https://github.com/sskorol/respeaker-websockets

Light · May 8, 2021, 4:41pm

Thank you for your reply! So I have taken a further look, and have some follow-up questions for you.

Have you tried using a ReSpeaker Mic Array v2? It claims to do some of the algorithms such as beamforming, noise suppression and AEC on board. I’m also interested in your choice to use a single board solution rather than a Raspberry Pi + voice add-on. Can you discuss the reasons for doing that?

Why did you choose to use Vosk and Spacy over Rhasspy?

rolyan_trauts · May 8, 2021, 5:47pm

does not have beamforming, noise suppression and AEC on board, they do supply librespeaker and @sskorol has done a brilliant job of setting it up which is a 1st as I know.

http://respeaker.io/librespeaker_doc/

Prob Russian lang but will have to ask @sskorol

edit ReSpeaker Core v2 does not the usb does

AEC only cancels what it plays as do most, NS can cause artifacts that can reduce recognition in clean models but the beamforming is pretty reasonable, some say can be a bit hissy @fastjack .

sskorol · May 8, 2021, 7:03pm

No, I haven’t. It’s a USB mic, so you have to buy an additional hardware to work with it. It’ll be more expensive, consume more power and space. Moreover, there are no benefits in pairing a mic array with other boards just for audio pre-processing. It’ll be a waste of resources. I prefer standalone Alexa-like boards for DSP and streaming to a dedicated server such as one of NVDIA Jetson boards. Assuming you are building a brain for your house, which may potentially include not only ASR/NLU engines, but also a computer vision stuff, it makes more sense to consider more powerful hardware with GPU.

I didn’t even consider Rhasspy seriously. But the main answer is flexibility, accuracy, performance and a great customization. I can raise a good offline ASR server for my language with GPU support in several lines of code. I can train custom NER or classifier the way I want and don’t restrict myself with a rule-based approach. I just like to have a full control over everything in my home. So that’s pretty much the reason.

Light · May 8, 2021, 7:23pm

I was just going based off of what the microphone array product page says. Thank you for the clarification on AEC. Still helpful to me, as it my use case has audio feedback, but an important distinction nonetheless.

Light · May 8, 2021, 8:07pm

So if I’m understanding correctly, you are using your ReSpeaker web sockets library as an input to Vosk, but Vosk is running on separate hardware? Is the ReSpeaker Core v2 powerful enough to do Vosk and Spacy on the device? I have no plans for computer vision.

sskorol · May 8, 2021, 10:26pm

Correct, Vosk/Spacy/mqtt broker and the other stuff is running on Jetson Xavier NX / Nano board (I have both). Respeaker Core v2 has only 1Gb RAM + there’s no CUDA (for decreasing ASR latency). Respeaker board is capable for running only a lightweight processing logic. ASR/NLU is too heavy for it. It’s hard to even program there, as VSCode cpp tools require at least 4Gb RAM. So remote development is very limited. That’s why I just deploy a streaming software there to apply DSP and send pre-processed chunks to ASR server via websockets.

Light · May 9, 2021, 3:00am

Okay, thank you for explaining that. It’s good to know what the capabilities are of the device are. I’m guessing you have a couple ReSpeakers positioned around the home, all streaming to a single Jetson which is doing the processing? Would it be possible to send audio data over a physical connection such as USB instead? Where can I go to learn more about setting up a speech recognition system using Vosk + Spacy?

Thank you to both you and @rolyan_trauts . I feel like I’m learning a lot of useful information thanks to your explanations.

rolyan_trauts · May 9, 2021, 3:21am

You might actually not need all that as if you just have simple control then maybe you just need a multiple keyword KWS and some simple logic of your own if functions of control are limited.

We are all different as I found the Nano board quite hard work as it is 2gb but its shared with the GPU unless you get the 4gb and then its not such good value for money.
Also the Nano i didn’t think there was all that much difference with the 0.5 Tops GPU and what I have seen on a Pi4 which guess could have a Coral AI accelerator with is 4 Tops.
I don’t suggest a Coral AI as its very bespoke and unless you run the demo models and need a extremely fast Parrot detector its not much use without a lot of work.

I managed to get a Haswell Nuc for £80 and I like the reuse slant as well and the AVX2 instructions with the clockspeed, 120gb SSD and the 4tops mini pcie £24 parrot detector has a spare slot if I use a USB dongle.

The 1st Gen Intel Core Nucs are approx same as a Pi4 but Haswell generation Nucs can go quite cheap like the D54250WYK I got.
But if you just need simple control then just a KWS means just a Pi3 or Pi4-2gb or you can try the other demo with the Parrot detector which is a voice activated game of snake.

sskorol · May 9, 2021, 9:36am

Yes, several Respeakers stream the audio to Jetson board when KWS is triggered.

Well, if you use a USB mic, you can write a streaming code for the board it’s connected to. If it’s RPi 4, you can even run Vosk there. But w/o GPU. So performance won’t be as good as on devices with CUDA.

Vosk has 2 types of models: big (for server, e.g. PC or Jetson-like boards) and small (for RPi or Android/iOS). Big models are heavier in terms of resources consumption, but more accurate. Technically, you can run big models even on RPi 4. But again if performance matters and you want a super responsive ASR, it’s better to run Vosk on boards with GPU.

If you really want to try it, I’d recommend to play with a desktop version first. Just to check the accuracy for your language. You can find docs and a number of examples for different programming languages on their GitHub: https://github.com/alphacep/vosk-api. Note that small models with dynamic graph allows vocabulary adaptation: you can increase the probability of some words’ appearance while transcribing.

In regards to Spacy, I’d agree with @rolyan_trauts that you probably don’t even need this stuff, if you are building something simple. Yes, Spacy definitely gives a good control and flexibility, but as it’s a low level library, it’ll always be a hard path.

But if you really curious, you can try training a simple NER by example in one of my repos: https://github.com/sskorol/ner-spacy-doccano. The most complicated part is the training data preparation and labelling. Not even complicated but rather boring, as you have to collect about 200 real-life examples (yes, Spacy NER doesn’t require thousands of samples) which might be potentially said by you or the users of your software. Then you should label them via special annotation tool like Doccano or LabelStudio and export for further training in Spacy.

That’s a screenshot of my recent meetup on this topic which describes the process:

Anyway, feel free to ask questions if you want to go this path. But it’s not an easy one.