Is there no more pi microphone array hat?

Matrix Voice is completly dead and for Respeaker there are only relealy old driver from HinTak, that don´t work with the newest kernel. Are there no more alternatives?
Will i have to use an usb mic array? Whitch one works best (with rhasspy) 64 Bit OS & Kernel 6.x and get future updates? Read something about Anker PowerConf S330, but one the website there are no drivers for linux!? Is there no more hardware with official linux support?

I am confussed.

My matrix voice works fine,but on a very old system. I now want go give rhasspy 3 a try and don´t want to start with old Kernel and 32 bit. :frowning:

Thanks a lot, Mili

1 Like

Come on … you all have to use a mic! :wink:

Come on … have you read the previous conversations ?

To repeat once again … if you wast Rhasspy 3 with Home Assistant - that is called HA’s Voice Assistant and support (including setup tutorials) are over in the HA community forum.

I note that HinTak updated his drivers very shortly after the latest RasPi OS was released. Have you tried it ?

My latest satellite (on my test bench) uses a USB mic and regular speaker, and seems to do quite OK without the expense and driver issues (ecxepting that passing USB to Docker can sometimes be a pain).

Thanks for your answer. Witch previous conversaaion do you mean?

HinTak´s last update of the driver is 1 year old. I have try it and get this issue:

This issue is more than 3 mounts old and nothing happen. So I don´t know if it ever will be fixed.

I don´t use Home Assistant and only want to take a look on the version 3.
But also I want a new system for my normal rhasspy, with the newest software.
And HinTak´s work is nice, but what will bring the future? Is there no more company for mic array - where i hope i get a little bit more time of support?

And you wrote you use an USB mic, but not which on? Also the docs are so old and only Matrix and Respeaker are shown there. Is there no list of good supported hardware? A normal mic is no alternativ to me, my Matrix voice works in range of 4-5m, and i don´t want a downgrade.

I have taken to using usb speakerphones for my assistants, one of the main benefits being that they have hardware echo cancellation built in so your assistant won’t hear itself talking. If you use your device to also play music, it can filter that out also so it can still hear you. The drawback is that the sound can be tinny, especially on cheaper models, but all the ones I’ve used have had far-field microphones that work well enough for speech to text. The tinny sound only bothers me when I’m playing music. You can find them for under $20, although you can certainly spend a lot more.

Hintak has previously commented that he is only updating for newer kernel versions - not doing any updates to functionality - yet I have always been surprised that HinTak just copied seeed’s readme including the links to the old OS and seeed’s forum (where they stopped responding to users 5 years ago).

Seeed’s demo includes software features not included in the device driver, and it seems that all companies are viewing these voice processing functions as proprietary :frowning:

As for the reSpeaker recent update, I had simply noticed     Fix /boot/config.txt for bookworm installs     2 months ago

And I have no clue about the kernel panic. Lack of ongoing support is a main reason i don’t recommend the reSpeaker HATs.

You are totally correct that the info on rhasspy website and community is old, and very few updates for about 2 years since Mike started working on rhasspy 3 then got the Nabu Casa job. I am hoping that now HA Voice Assist is established he will be able to spend some time documenting generic Rhasspy 3. I suspect that all the code/components are available - they just need more detailed documentation for people to see how they fit together.

As for currently recommended hardware …

  • My generic USB desk mic + speakers are fine while sitting at my desk half metre away.
  • raspi + reSpeaker is rather expensive for what it actually does.
  • Some conferencing speakers have the extra smarts we want, but too expensive to deploy all around the house - on a pension in Australia anyway
  • Home Assistant Voice Assist are currently recommending the Espressif ESP32-S3-BOX device for voices satellite … though they are using Wyoming protocol.
  • I personally think Nabu Casa could start with the ESP32-S3-BOX and improve the speaker, mic and case acoustic design to bring it closer to alexa/google sound quality.
  • Hopefully @rolyan_trauts has been putting his apparent expertise to use and fill the market gap he sees for good quality affordable off-the-shelf “ears”
1 Like

Thanks for the detailed answer! :slight_smile:

@ aaronc - Which one do you use exactly? How many meters can you go away from it?

@ donburch - Do you know some conferencing speakers thats works fine? I don´t need so many of them, an therefor the price is not so important for me.

PS. I searched for your notice to the reSpeaker update and with that i found:

Befor I only have found:

I will give the x version a try, thanks a lot for this help :slight_smile:

That bit is quite easy and doable for a while but for some reason upstream uses Whisper which that has been trained on a specific type of audio input of monaural near field recording with tolerabilly low level noise.
You can fine tune Whisper but its a very large model and much work and still likely worse than a much smaller trained for model.

I keep repeating that the big guys have a hardware advantage because they dictate the ‘ears’ so the models in the cloud are trained on data processed by them.
You can feed Whisper various great filters and speech enhancement algs and WER actually goes up as the signature those algs create where not in the training dataset.

I never suggested the ESP32-S3-Box as for ears it has much expense that is just bloat, but dev on a just esp32-s3 and mics only, never took off.
There are various projects but really the esp32-s3 is just being used as a wireless mic and has zero pre-processing but just feeds whisper.
As usual I am bemused especially the one that cracks open Google nests rips out a working audio preprocessing system and just sticks in a esp32 as a wireless mic.

If you could get upstream just to commit then likely the answer is yes, but you need to preprocess the dataset and train a ASR to the audios input signature, for greatest accuracy.
When you do that, things work really well, like commercially units have been doing for a number of years.
The bring your own to the party means we have what we have with just single mics feeding Whisper as no-one will dedicate a ASR to specific hardware that has a certain signature.

People are just having a bit of fun with Year of the voice and the projects that you can make, but there is nothing remotely near the dictate you need to even come close to the big guys smart speakers.

Its very simple upstream ASR trained on data preprocessed with the filters and algs of hardware of use, provide masively better results than the quantum task of trying models that can cope with, bring your own audio to the party.
Whisper does an excellent job of this and why its being used even if custom designed solutions of hardware dictate aka Nest, Alexa, Siri units in use work vastly better.

Plugable do a USB stereo soundcard as do Axagon

Respeaker still do the 2 mic.

All else really are totally pointless because we don’t really have any algs to run on them.
I did do a simple delay & sum beamformer for any stereo mic input that because of the simple summing, likely has a very small signature and is similar to single mic and would work well with Whisper (You need to test with Whisper as it may actually not like the speech enhancement of some devices)

Likely a USB soundcard with AGC and mic is as good as the latest and greatest due to how the Whisper model reacts to the audio input.

There is some nice new hardware about.
Radxa Zero3W which OKdo ROCK ZERO 3W 1GB with Wi-Fi/BLE without GPIO - OKdo is an £18 SBC that for ML is faster than a Pi4.
Also the ultra cheap Radxa S0 a £11 ultra low wattage A35 SBC
OKdo Radxa ROCK S0 512MB Single Board Computer Rockchip RK3308GB with Wi-Fi 4 / BT5 - OKdo
Strangely both have onboard audio codecs but don’t think they are on gpio … ! ?

Hi Stuart,

I thought this conversation was about the “ears” - getting high quality audio from the distributed devices - which I expect should be quite separate from the Speech-to-Text phase.

I understand that the ESP32-S3-BOX was designed to demo a variety of things that the S3 can be used for, and so is overkill for our needs - but I’m sure it was you who pointed out the extra instructions in the ESP32-S3 processor should allow more of the audio pre-processing functions (which you are intimately familiar with, and I just refer to as magic) to be done cheaply on-chip. I think that with microwakeword it’s the best off-the-shelf option at the moment.

I was also amazed that anyone would rip all the magic out of a device to replace it with a simple processor. I had expected them to use much of the circuitry on the board by disabling the existing CPU and add their own. I appreciate the audio engineering that must have gone into designing the case to channel the sound from speaker, and to the microphones, to reduce interference - and again to replace it with poorly positioned microphones :cry:

Seeed developed the reSpeaker 2-mic HAT, and several clones hit the market … and it would have been the standard by now if seeed hadn’t lost interest in supporting its driver/firmware … but seeed saw it only as a demo to encourage other companies to incorporate and support. The demo showed it has the capability, and it could probably take off again if someone was motivated to build some of that pre-processing magic into its firmware or driver. But RasPi + reSpeaker + case will always be fairly expensive compared to on-chip magic :frowning_face:

There is definitely a need in the market for a good quality, relatively cheap, satellite device; and I feel confident that it will sell well … which in turn makes it worthwhile for others to generate the upstream models for Whisper, etc. I believe you have the expertise to develop such a device, though I assume you see marketing as being the big challenge (that is my big fear). Certainly I believe Nabu Casa could build an ESP32-S3 based voice satellite device, and they have existing market presence and experience with the HA blue/green/whatever boxes).

In the meantime people can pay for proprietary conferencing mics with the magic built-in; or put up with a standard USB microphone doing as good a job :frowning_face:

I won’t respond to your comments about Whisper because i’m not sure what point you are trying to make. That Whisper is bad ? That the variety of audio input sources makes Whisper less effective ?

Nope unfortunately and it was also a surprise but models such as Whisper are trained to cope with a modicum of noise and room impulse reververberation.

Its sort of set to a near field mic of a couple of prob a max of a meter and the RIR’s that produces and relatively low level noise.
The noise levels are in the opensource docs of Whisper but also other pretrained ASR exhibit similar.
So you emply a farfield mic to strip RIRs and Noise that sound completely perfect to human ears depending on algs used can provide seriously bad results as the close field mic and moderate noise has been trained into the whisper dataset.

The esp32-S3 has some esspressif blobs for source-seperation & aec that could be used on any esp32-s32 coupled to mics with a good analogue AGC.
There is nothing special in the circuitry of the emptied Nest products but its just being used for the shell and for what is just a single channel network mic.

The 2 mic hat due to being a hat dictates how and where you place the mics and is a terribly inflexable design that makes position and insulation of your mics a near impossible chore. Most people have the hat which is a 2 mic broadside array pointing at the ceiling because it is a hat and designed that way.
Best way is to wall mount then at least the mics are pointing your way as casing will provide rear rejection.

People might not be able to pay for proprietory conference mics with magic built in as with ultra hi-tech devices such as AnkerWork S600 Multifunctional Speakerphone with Voiceprint Recognition

Whisper is trained to expect room inpulses of an open mic and the noises that may incur.
Feed it cleaned signals stripped of RIRs and Word Error Rate can climb dramtically.
No Whisper isn’t bad as it was designed for a open standard mic senario, changing that scenario and expecting it to work as well means your knowledge of what you are using is bad.

Nabu Casu likely have market presence, but have a hunch similar to Mycroft where they present a knowledgeable front.
That they are not allowing an opt-in to collate usage data from collaborating users is a massive red flag as they are ignoring dataset gold.
The custom wakeword is another red flag as even big data finds them inferior to models they can dictate and control that return vastly better results.

From: Paulus Schoutsen
Sent: 30 October 2023 20:31
To: Stuart Naylor
Cc: Michael Hansen
Subject: Re: Just read as very short but what has been done or being said makes no sense?

Hey Stuart,

Adding Mike to the CC.

I’ve been reading along with your posts for a while now both in the Rhasspy and the Home Assistant community forums. Although you might intend well, your posts are derailing threads and you need to stop posting. Mike agrees with this.

We are not interested in hearing that everything we did is crap, everything we are working on is crap, or hear about solutions that are unrealistic and unachievable.

Consider this a courtesy warning that these posts need to stop.


I haven’t being saying everything is crap, but everything that has been implemented is fundementally wrong.
The solutions are realistic and achievable and have been available for quite some time and are being done, just not in HA.

I have been warned not to speak about it, so I will not.
I replied because you asked, but there is little else I can say.

When I do reply as I did above and get emails such as I have posted.


Please Stuart, take my comments with a grain of salt. They are simply from my personal point of view, trying to appreciate your point of view.

Regarding Paulus’ email first …

Paulus and Mike, sorry if I have “stirred up the Hornets nest” with my post.

Stuart I have long wondered whether you have been deliberately trolling this group to create argument - but have decided it much more likely that you just have difficulty expressing yourself, between your obvious passion for the topic, and a brain that seems to be always jumping from one idea to another. I also find communicating with people difficult, and have recently discovered that can be a symptom of neuro-divergence (eg autism). Unfortunately many of your previous posts came across as excessively critical and/or confused off-topic rambling … but after I’ve cooled down a couple of days and re-read, I have often found gems of knowledge and new perspectives.

You obviously have a lot of passion and expertise in the audio engineering field, and I suggest that focussing on what you can do right (I’m guessing the “ears” niche of the market) would be productive and satisfying for you.

Over the years i have come to appreciate that different is not always “wrong”, and that there should be enough space in our universe for different points of view.

I do understand that having better quality audio used to train Whisper (well, any STT) and feeding it better quality audio should make the whole STT job easier, and of course google/amazon have the money and marketing to get that large base of higher quality audio from a small small controlled selection of input devices.

On the other hand, Open Source STTs have to make do with the audio data that is available. Such is life. At least by promoting a cheap quality ears device the proportion of quality audio samples should grow.

Are you suggesting that HA needs to collect all the audio that users are generating, so we can build a dataset to compete with big business ? That seems very much opposite to the local control principle. Or are you suggesting that users could op-in to providing this data ? If so, how many TB of data, and wouldn’t too much of it to be low quality because of the wide assortment of mics etc. ?

And this is the thing that puzzles me most…
If HA is going in the wrong direction, and not open to listening … wouldn’t it be more satisfying to forget HA and instead put your time and effort into the other better solutions ?

Its OK as stopped posting and having opinion, I have MS with scar tissue from several exascerbations. My memory and concentration is not what it used to be.
I was using a Pi and Voice tech purely as cognative help.

All models work to the parameters of the dataset and why Whisper works so well with near field single mics and moderate noise is likely because the dataset contained such.
Its a huge unweildly model to start training yourself, but the problem will always turn up with the bring your own to the party that may have sigantures outside a models dataset.

Voice tech is a system and a pipeline all working optimally with each other, not just random opensource losely licenced that you can package and brand.

There is so much that is likely a cull-de-sac as getting the datasets for certain hardware is as easy as running the existing datasets through thoose filters and algs.
This is where the blobs of esspressif get less attractive as there is no opensource to port the code to so on a powerfull computer you can quickly much faster than realtime create hardware and filter alg preprocessed datasets by running them through on a faster computer.
So unless someone can provide opensource AEC/BSS/KWS to esp32-s3 its likely better going victorian engineering and using application SBC’s with more power and use algs we have that can scale up for dataset making. They already exist and I have been repeatedly saying so but they need to be put together as a system and if not dictate strong community guidelines on what to use as hardware.

Also the ears can absorb cost by being multifunction and sugggest Snapcast as a wire audio system and maybe even Kodi that one can plug into a TV and provide more than just ears.

Keep the ASR and LLM central as a brain or a collection of containers routed to.
I have mentioned this many times for some time, but hey.

Maybe someone can try one of these AnkerWork S600 Multifunctional Speakerphone with Voiceprint Recognition with Whisper as if WER goes up then there are big problems recommending to a comunity to buy expensive equipment that works no better and in some cases worse than ‘just a microphone’

Likely the USB chips in the soundcards I often name drop could create a HA branded mic by just adding 2x uni-directional with decent analogue AGC in a package spaced for 48Khz approx 65mm (from memory) as usb leads can be quite long and unlike a hat you can have several of them…
We have had simple solutions for a long time but they are generally ignored…
Anyone can make one but ready assembled to plug in seems to be what people want.

I lost interest after that email, really, so no bother.
I have occasionally replied to existing threads.

Really the KWS should have a preffered KW and a function to capture positive KW. Allow community to submit that dataset gold to HA as opensource with metadata.
Same with Command sentences as from pattern of use its pretty obvious which ones are positive and capture and allow an opt-in to send more dataset gold.

Everything is simple but some real basics such as database collection have been missing even though its been said again and again.

I have no idea why all this has not bring brought together as an optimised system than just what opensource they can package with Python…
It is fairly simple but from what I am seeing the oppiste is true and what is being done, is purely because they can…
Nothing is unrealistic and unachievable, its purely because they can not…

I use a Jabra 410 USB speakerphone. They are about $30 on eBay. It is plug and play and nothing special to set up if using Docker. You will need to reboot after you plug it in the first time. You may also have to modify a Linux txt file that is noted on another post on this site. Its shape is worth while as you can 3d print a cylindrical enclosure to hold a pi zero 2 and set the Jabra on top. I see a lot of comments about cost. I do not want companies listening in on my conversations. That’s why Rhasspy is a great alternative. I like cheap too, but what is your privacy worth? There may be cheaper available speaker options, but the Jabra works and I can spend more time developing my custom skill programs than trying to figure out some cryptic Linux setting for a device that I have no detailed documentation on. Off the soap box……….

1 Like

Failed to mention, my setup is using aplay and arecord for audio recording and playing options.

I am an audio snob and like my Anker Powerconf the speaker on my speakerphone doesn’t really cut it, for a smart speaker.
I am one of those that listens to a Nest mini or Echo dot and can only think ‘Urgh!’
The standard Nest & Echo are a minimum for me as ‘Play some music’ is one of the rare occasions I do use a smart speaker.

We do have 2 excellent wireless audio apps Squeezelite as it will squeeze into a ESP32 or my own fave the almost limitless configurable Snapcast.
Mounting a small amp board onto the back of a bookshelf speaker can quickly surpass Nest and Echo devices.

By using Squeezelite/Snapcast you get great zonal audio and a platform that is pretty easy to add a microphone.
I hacked together a 2 channel delay-sum beamformer GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer that will extend far-field, but there is no signal to focus the beamformer on.
So it acts like a conference speakerphone and will beam to what ever is the predominant noise and be contantly shifting focus.
The methods of targetting a voice or focussing a beam is totally absent from the opensource we have even though there is code available and solutions.

Speakerphones do have AEC but we also have opensource for that even if not implemented.
With a bit of lateral thought can provide huge improvements and not need AEC by not mounting your mic in the same enclosure than your speaker.

A wired microphone can be small and very descrete even a dual mic and you might have more than just one in a room to give much better coverage.
Initially I expected ‘Ears’ to be 2x Mics on a ESP32-S32 in a small flat panels approx 75mm in width that clips onto a wall or stands on a desk.
It still needs to be powered and may contain a pixel bar/ring and even have audio out.
That is where I don’t see much prob with wired mics either as even wireless network mics still need a PSU, so really little difference or advantage.

I think laterial thinking on what a open source smart-speaker is needs some thought as with things like Squeezelite or Snapcast opensource can give Sonos like zonal audio that has a huge array of choice.
That zonal audio and where HomeAssistant can excel and beat commercial systems on functionality, quality and choice is a huge selling point that is likely under implemented and undersold.

For some reason whilst trying to escape Big Data ‘Smart Speakers’ opensource has been blinkered and tries to copy consumer individual product verbatum.
I don’t even think we need a ‘Smart Speaker’ just a zonal opensource microphone system that takes apart the commercial notion of a smartspeaker to working components of wireless audio, microphone and pixel indicators to give choice.
If someone wants to build that in a box, they can, but a modern room could be a very different scenario with a single large pixel indicator, wireless room audio and several dispersed microphones.

I guess its where you want to go but HA in terms smart controls and dashboards represents some of the most cutting edge Smart Home tech whilst USB speakerphones hanging out of mini computers is not much above Raspberry Pi Google AIY Voice Kit…

There is also other commercial equipment that can also fuse function to become more cost effective as a device as my Anker C300 dual mic webcam is great for audio and its far-field isn’t bad.
It can provide Frigate video and be a wireless room mic array and used in conjuntion with a wireless audio system.

I also tried some of the bi-directional audio pan-tilt cams you can get but found on the ones I got the mic audio is pretty awful.

You can still use the Respeaker 2mic that eventually always catches up to the latest release. Likely though again some laterial thought that USB devices are far more compatible and convient than Pi hats limited application.

There still is Pi microphone hats but really only the 2mic is of any use, but there is a whole range of USB devices that work on many platforms that prob could do with a HA mic for those who don’t want to build one.

The hardware, software all exist its just not implemented as a system and so ends up excluded.

Hi Stuart, sorry for that but could you be more concise and make your answer more understandable, and only answer the question ? If you want to share your works and your visions with everybody, it’s fine but I suggest you to open a specific thread. We feel that you have an expertise on some subjects but this should not to be spread around all the subjects
I think also that it’s useless to make the mail from Paulus public. It does not help the project. It’s something between you and Paulus and Michael.
Best regards.

Nope as that was it and is it.
Doubt I will be contributing much more and would prefer not to be tagged in future conv.

No worries, Don :slightly_smiling_face:
I share the same thoughts about Stuart (not tagging him as he asked not to be). In fact, as I delve deeper into building voice satellite hardware for Nabu Casa, I keep coming across posts from Stuart either here or across Github. It’s becoming clear that I wasn’t ready to hear many of his ideas; I’m still trying to catch up.

That’s not to say I 100% agree with all of the negative things that were said, and I believe it’s that negativity that pushed me and others away. Different people (and organizations) have different priorities and different strengths. Despite being in the voice space for a few years now, I only have a rudimentary understanding of audio engineering. But we’ve made a lot of progress around Rhasspy and other projects despite not having audio expertise, though I do think we could’ve made it easier on ourselves with the right knowledge (but when is that not true? :smile: )

Regarding the topic of this thread, we are discussing satellite hardware over here. While that discussion is focused on an ESP-based satellite, I believe very similar hardware could make a great Pi HAT or USB mic. Specifically, the XMOS XU316 with two microphones, audio out (echo cancelled), and some LEDs.

If anyone knows enough circuit design to build a prototype of such a thing, I’d love to chat with you :slightly_smiling_face:


You have to be wary of any hardware or closed software blobs that you can not run data through faster than realtime. As with the size of the datasets that is some process time.
You need to be able to preprocess the datasets of use all the way up the chain with the algs and hardware of use, which is easy enough with opensource software that you just run on a fast computer.

The Speechbrain Sepformer is an example as for it not to work worse than normal they had to finetune Whisper with a Sepformer dataset.

You where involved when Mycroft finally created a XMOS PI hat that was 2 mic. Surely anyone with a Mycroft II and surprised you don’t have one, being a Dev at the time.

I do have a few Mark II’s with the SJ201 daughter board (containing an XMOS XVF3510). It works pretty well for audio, but there are a few things I would want to change. The non-right-angle header and 12V power plug alone will probably put a lot of people off :smile: