I am an audio snob and like my Anker Powerconf the speaker on my speakerphone doesn’t really cut it, for a smart speaker.
I am one of those that listens to a Nest mini or Echo dot and can only think ‘Urgh!’
The standard Nest & Echo are a minimum for me as ‘Play some music’ is one of the rare occasions I do use a smart speaker.
We do have 2 excellent wireless audio apps Squeezelite as it will squeeze into a ESP32 or my own fave the almost limitless configurable Snapcast.
Mounting a small amp board onto the back of a bookshelf speaker can quickly surpass Nest and Echo devices.
By using Squeezelite/Snapcast you get great zonal audio and a platform that is pretty easy to add a microphone.
I hacked together a 2 channel delay-sum beamformer GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer that will extend far-field, but there is no signal to focus the beamformer on.
So it acts like a conference speakerphone and will beam to what ever is the predominant noise and be contantly shifting focus.
The methods of targetting a voice or focussing a beam is totally absent from the opensource we have even though there is code available and solutions.
Speakerphones do have AEC but we also have opensource for that even if not implemented.
With a bit of lateral thought can provide huge improvements and not need AEC by not mounting your mic in the same enclosure than your speaker.
A wired microphone can be small and very descrete even a dual mic and you might have more than just one in a room to give much better coverage.
Initially I expected ‘Ears’ to be 2x Mics on a ESP32-S32 in a small flat panels approx 75mm in width that clips onto a wall or stands on a desk.
It still needs to be powered and may contain a pixel bar/ring and even have audio out.
That is where I don’t see much prob with wired mics either as even wireless network mics still need a PSU, so really little difference or advantage.
I think laterial thinking on what a open source smart-speaker is needs some thought as with things like Squeezelite or Snapcast opensource can give Sonos like zonal audio that has a huge array of choice.
That zonal audio and where HomeAssistant can excel and beat commercial systems on functionality, quality and choice is a huge selling point that is likely under implemented and undersold.
For some reason whilst trying to escape Big Data ‘Smart Speakers’ opensource has been blinkered and tries to copy consumer individual product verbatum.
I don’t even think we need a ‘Smart Speaker’ just a zonal opensource microphone system that takes apart the commercial notion of a smartspeaker to working components of wireless audio, microphone and pixel indicators to give choice.
If someone wants to build that in a box, they can, but a modern room could be a very different scenario with a single large pixel indicator, wireless room audio and several dispersed microphones.
I guess its where you want to go but HA in terms smart controls and dashboards represents some of the most cutting edge Smart Home tech whilst USB speakerphones hanging out of mini computers is not much above Raspberry Pi Google AIY Voice Kit…
There is also other commercial equipment that can also fuse function to become more cost effective as a device as my Anker C300 dual mic webcam is great for audio and its far-field isn’t bad.
It can provide Frigate video and be a wireless room mic array and used in conjuntion with a wireless audio system.
I also tried some of the bi-directional audio pan-tilt cams you can get but found on the ones I got the mic audio is pretty awful.
You can still use the Respeaker 2mic that eventually always catches up to the latest release. Likely though again some laterial thought that USB devices are far more compatible and convient than Pi hats limited application.
There still is Pi microphone hats but really only the 2mic is of any use, but there is a whole range of USB devices that work on many platforms that prob could do with a HA mic for those who don’t want to build one.
The hardware, software all exist its just not implemented as a system and so ends up excluded.