Hardware that can rival Google Home/Echo in terms of voice recognition and speaker sound if money is not an issue

Hey guys,

a couple of days ago I started my self hosted voice assistant project. Again. I’m currently experimenting with a PI4 and a matrix creator and I want to get some feedback from the more experienced users here on this forum. I have to admit that I’m not familiar with how the linux sound system works, mainly because I never bothered to understand it. While I’m fairly competent linux user, I try to avoid dabbling with the sound settings. I wrote ansible playbooks to setup lxc containers on my proxmox homelab, but I don’t really know the difference between alsa and pulseaudio. I hate sound on linux, but only because I don’t understand it. With that being said, I’m willing to learn.

I think I’ve read every thread in the hardware category and my current summary is: There is no good hardware available, that works from across the room with a bit of background noise. At least that was my impression, but I can’t imagine that is true, since there are a couple of users of rhasspy and I guess some of them live with significant others and sometimes they’re not as tolerant with tech not working on a day to day basis. Sometimes I’m not even that tolerant. For me, the benchmark is slightly below a google home or amazon echo. And even these are shit sometimes. So I’m willing to accept draw backs and make compromises to a certain degree. If I have to stand right in front of the mic with no other sound running, well, then I think I’m better of with a tablet than a voice assistant. And I’m not bashing rhasspy or the efforts that are going into this project, quite the opposite. I’m really impressed with what I’ve seen so far.

So I guess my question is: If money is not an issue, or rather total hardware cost up to 250 bucks per satellite. Is there hardware available that could rival a google home (mini) or echo device? I’m aware that hardware is only one side of the coin, but just to understand where we currently stand in terms of hardware. Is there something available right now, that works with rhasspy and works at least so good that a non technical person is not going to murder me?

My ideal scenario would be

  • hardware that runs rhasspy as a satelitte
  • mic that can detect a wakeword and command from across the room with ambient background noise
  • an attached speaker that sounds ok for casual listening of radio (and music in the kitchen), which I would be willing to place further away from the mic to increase the chance of voice recognition.
  • alternatively/additionally a good speaker that can be used in a living room for music.
  • runs a snapcast client to play music in a multiroom scenario. This is more of a software question, I know. But if this only works with HDMI, but not with bluetooth, then it becomes relevant for the hardware question.

From everything I’ve read so far, when it comes to speakers, these are the options:

  • use the J2C port which is only ok for voice respones, not music
  • use the jack. Most likely sounds not good as well, but better.
  • use a speaker that connects via HDMI to the PI. These are either VERY expensive (Klipsch The Fives) or are soundbars and are taking up to much room. Only real alternative I’ve found so far are the ZR5 from sony (not cheap either). HDMI would then mean the speaker has to stand relatively close to the mic though since I’m not running a 15m HDMI through the living room.
  • bluetooth speaker. Plenty available, but ones that can be woken up via BLE seem to be also rare. And then there is the issue of using bluetooth on the PI4, which is a pain in the butt itself.

So, in essence: If I want to have a good speaker in the living room with rhasspy and snapcast and smaller ones for casual listenting in the bathroom and kitchen. What are my options? Do I have options?

Thanks for reading, if you’re still with me at this point.

PS: I’m willing to invest time and money into this to play the guinea pig. Most importantly: If we find this combination of hardware, document the hell out of it and create images based on this set of hardware that is ready to be installed by new users. Like I mentioned, the normal sysadmin stuff is where I’m fairly competent and willing to help out. I currently already own a couple of PIs (3,4, zero), matrix creator hat, respeaker 4 mic, respeaker 2 mic and PSEye camera.

2 Likes

The infrastructure and architecture layout of rhasspy is just wrong for this currently as likely you would have a single central processor sharing a GPU that also acts as a whole house media server (airplay, snapcast …)

Rooms vary in size and mic placement varies where a singular source is always prone to problems where noise=near / voice=far.
Multiple mics (per room) where best signal is selected should be easy but as far as I know not supported.

Really wifi mics microcontroller mics should be able to broadcast KWS hit as that alone can allow best signal selection with discrete, low cost devices.
https://speechbrain.github.io/ is the only toolkit that currently includes beamforming routines and is currently undergoing private beta testing (you can email them and join).
I think you are stuck with a singular device and I think one of the best beamformers is https://antimatter.ai/acusis-s there are others such as the https://shop.pimoroni.com/products/respeaker-usb-mic-array but all still need playback & recording on device to produce AEC. (This is why a single high power central unit can be so much better, as if you use latency adjusted network audio such as snapcast then its very likely you do AEC centrally and remove extremely high levels of audio played)

Many KWS are quite resilient to noise but the ASR command is less so and a single hi end mic that isn’t doing playback is likely to fail when noise=near / voice=far as its simple physics.
So even if you do force with high end DSP you can still get similar results with low end simple USB

As for a Pi there are no opensource beamforming libs that work but there are many pointless multi mic devices for some reason (guess they just sell to those who don’t know)
For mono there is a whole load of pointless as a USB soundcard can suffice.
I can get better results using a uni-directional electret with a high quality preamp and USB soundcard than any of the Hats I know.
The only Hat of interest is the Raspberry Codec Zero due accepting on board and external mics and looks a great card but as to the rest go USB so you can use a better preamp and directional mic.

HDMI there are low cost devices $10 but irrespective of cost I have noticed that many go to auto sleep that on playback wake from a fraction of a second to a couple.
It completely kills notifications and even a tiny bit missing from a word can cause much confusion and for me HDMI is no.

Sure make some really good quality amp modules https://store.sure-electronics.com/product/AA-AB31184
But to be honest the Sanwu boards https://www.ebay.co.uk/itm/DC-12V-24V-TPA3116-Dual-Channel-Stereo-2x50W-BTL-Mono-100W-Audio-Amplifier-Board/272406367754 are not much behind and far less.

With amp modules the ratings are fubar as often 10% THD into 2ohms whilst 4 ohms will be your minimum speaker choice and you certainly don’t want to be listening to 10% THD 50watt in reality is 30watt with headspace and that sort of ratio goes across the board.

Sanwu also do the CM108 usb boards which seem a popular choice https://www.ebay.co.uk/itm/CM108-USB-Drive-Free-USB-Sound-Card-Laptop-Computer-External-Sound-Card-Module-M/152231802823

The Pi3A+ bang for buck or Pi4-2gb for Pi boards are the best choice.

A unidirectional electret from me will save you buying x25 from aliexpress https://www.ebay.co.uk/itm/324462527996

That sits on a high quality preamp with AGC https://www.ebay.co.uk/itm/MAX9814-Electret-Microphone-Amplifier-AGC-Function-Module-Board-DC-For-Arduino/152293733901 and plug into a soundcard (USB)

Speakers either reuse as there are some absolute satelite speaker bargains on ebay and with a Pi3A+ via 24v & Ampboard you can wall mount and start your Wifi audio on a budget.
30watt cones can start at about $10 https://www.visaton.de/en/products/fullrange-systems/frs-8-4-ohm if you search around and like amp modules the ratings need some headroom or the reality will be poor sound.

As a last note my fave USB is the £7 48Khz S243LE stereo usb as it has specs that are usually prosumer price and stereo ADC.

The Google Home/Echo really showcase economies of scale and what you can do with custom DSP microcontrollers backed by a cutting edge server farms and like 4 like its an impossible ask.
In a different infrastructure of a central AI with distributed mics and Wifi Audio it is possible even maybe to do something better.

PS there is another card but the above edimax also supposedly supports spdif these also have an array of inputs and outputs https://www.ebay.co.uk/itm/For-PC-USB-Channel-5-1-External-Optical-Digital-Sound-Card-Audio-Adapter/114498545473 SPDIF for capture is already digital and so should be lossless (The edimax is supposedly 96Khz S243LE via SPDIF)
What you can do is share cost and use these as capture cards from items such as TV & HiFi and then broadcast to a room WiFI audio system so you can share the cost of AMP + Speaker + WiFi.
(Snapcast is a great latency adjusted network time RTP that do suggest a look see)

1 Like

Jesus Christ man, that is a LOT to digest. My hats of to you for this vast amount of information for me to understand and recap. I appreciate that.

I just wanted to give you a quick answer to thank you for your answer, although I have to say that a lot of it went over my head. Like they say, I understood some of these words. But this is actually how I learn and prefer to learn. Give me a starting point, something that I have to research further and on my own. If I’m not able to recap these information, or at least some of the information with my own words, than I did not understand it.

Give me some time to do a bit of research on the items you mentioned in your post and I’ll come back to this thread.

Again, thank you!

There is quite a bit but actually easiest is to dip your toes get a respeaker 2 mic hat put it on a Pi and play.
Prob not what I would recommend for use but a cheap way to get your bearings.

1 Like

I already started with this, but it was not easy. raspberry pi os recently upgrade to kernel 5.10 and both drivers for matrix and respeaker currently only work with 5.4. So upgrade all packages, downgrade kernel to 5.4 and then I was good to go. Well, that is, arecord worked, but not from the rhasspy GUI. Triied wakeword just for fun and that worked. Still scratching my head about that, but in the end, wakeword gets detected, but only if I sit directly next to the creator and talk towards it. Word recognition works more or less from across the room, if nothing else is playing.

I’m still seeing what’s the issue and reading up on it here on the forum. Still, I wanted to see what the choices are in terms of hardware.

But since you mentioned it, is the respeaker 2 actually a better mic than the matrix creator for example? Now that I know how to get rhasspy up and running quickly, I’ll try it out as a satellite system.

The respeaker drivers are fubar. amixer -cX contents > contents.txt if I remember correct.

Use that driver and amixer -CX cset id so its set up as was the respeaker.

They are all multimic omnidirectional without beamforming so just use a channel which is a single mic!

github seems to be down!!! for me as would post a copy of settings

-cX X is the aplay - l index of your respeaker

Doh I am having a memory lapse its amixer controls or contents but trying both will get you there
Yep contents

My DNS has come back to life you can just paste this after install

amixer -c "wm8960" cset numid=1 34,34
amixer -c "wm8960" cset numid=26 3
amixer -c "wm8960" cset numid=27 4
amixer -c "wm8960" cset numid=30 5
amixer -c "wm8960" cset numid=32 5
amixer -c "wm8960" cset numid=33 5
amixer -c "wm8960" cset numid=34 25
amixer -c "wm8960" cset numid=35 on
amixer -c "wm8960" cset numid=9 3
amixer -c "wm8960" cset numid=8 3
amixer -c "wm8960" cset numid=49 on
amixer -c "wm8960" cset numid=51 on
amixer -c "wm8960" cset numid=37 0
amixer -c "wm8960" cset numid=38 0
amixer -c "wm8960" cset numid=39 5
amixer -c "wm8960" cset numid=48 on
amixer -c "wm8960" cset numid=50 on
amixer -c "wm8960" cset numid=54 on
amixer -c "wm8960" cset numid=16 5
amixer -c "wm8960" cset numid=15 4
alsactl store

This driver doesn’t need kernel freezes its just the respeaker guff but after install you need to set the basic defaults as above.

3 Likes

PS if using docker add the file share to the docker run command.

1 Like

Thanks for the links and explanation @rolyan_trauts. Here are my findings so far:

  • the wm8960 drivers from the repository can not be compiled directly with the current kernel.
  1. Install raspberry pi OS on a SD
  2. update the packages (sudo apt update && sudo apt upgrade)
  3. install kernel headers and git (sudo apt install raspberrypi-kernel-headers)
  4. reboot
  5. download, patch and compile the drivers:
  6. git clone https://github.com/pguyot/wm8960.git
  7. replace the existing wm8960.c file with the patched one (wget https://raw.githubusercontent.com/pguyot/wm8960/363165814741a8df528d0be1e5960bef9bc6a4d8/wm8960.c)
  8. make
  9. sudo make install
  10. apply your amixer settings
  11. download and install rhasspy

I’m using the respeaker 2 mic hat with a PI4 (did start with a pi zero at first, but boy is that one slow) and so far I’m not impressed I have to say, but I think there is room for improvement.

Tests in a relatively small room with no background noise was OK for porcupine wake word detection. about 75% hit on the first try I’d say. for some reason, the actual STT detection seems to be better, less error prone. If I use the wakeword button in the GUI, I could stand about 4 meters away and give the command in a normal speaking voice. Wake word seems to need a louder voice in order to be picked up ( I could check the debug log from across the room on my monitor).

I then placed the pi approximately 60cm away from my laptop where I played a song in a different language and reduced volume. Wake word detection was basically impossible this way. the actual command was recognized, but only after running into a time out, which is the nature of the voice recognition beast I guess.

This confirms my findings with the matrix creator. Wake word detection is finicky. Is there a way to improve it, apart from the sensitivity? I played with the sensitivity and it doesn’t seem to pick up the wake word any better.

Not sure what to make of these findings to be honest. It seems to be tricky even without background noise and in a “controlled” environment. I’m not getting my hopes up to much for a real world scenario to be honest.

Not really sure with porcupine but its highly likely the audio is normalised so level is less of an issue.
If your signal is really low then when normalised noise becomes a big part of that gain so if its really low then it can be a problem.

Best thing to do is actually see what your getting for audio and arecord a wav from set distances and load up in audacity and take a look and listen.
If you scroll through this thread.


It should give you some tips.
Apols should of told you to install the kernel headers but there is no need to be locked to a kernel.

omnidirectional mics without the beamforming algs of Google Home/echo with 3rd party noise will just flood the signal and nothing is in place to remove.
Its why I prefer unidirectional electrets as its not a huge amount but they have natural inbuilt directionality (beamforming in the direction you point them and noise reduction).

But 3rd party noise is problematic to all when it reaches a ratio to voice especially when without dsp algs.
If rhasspy is playing the audio then we can set up speex AEC which does a good job as we have that signal so we can cancel from the mic.
3rd party we don’t have that signal or the clock sync of having input and output on the same audio card.

Don’t know much about the matrix creator to be honest.

The VAD isn’t the greatest so with 3rd party noise its likely it will not detect silence and time out.