2023 - Year of Voice

This is my first post and hopefully it’s in the right place.

With the changes coming to v3 is something like the ESP32-S3-BOX-LITE a good option for satellite hardware?

I have a dedicated server room with a TureNas Scale server that has tons of compute (AMD EPYC, as well as Tesla cards for acceleration) that can be used to run the main Rhasspy app on.

I’m not sure what kinda hardware is best for satellite nodes in each room though. My house currently has 10 Google home devices. Some are low end mini’s while others are dual speaker setups for music.

With hardware becoming more affordable I’m curious on what will end up being the best hardware for 2023.

As are we all :frowning:

Google and amazon have put lots of money into developing their own excellent hardware devices - which are locked to use their cloud service.

Raspberry Pi with reSpeaker 2-mic HAT seems to have been the most popular non-proprietary option - not because it is particularly wonderful, but because they were cheap, readily available and easy to program. Then came covid, and chip shortages :sob: There is also the fact that (despite having 2 mics in hardware) neither the driver or Rhasspy provides any of the Digital Signal Processing (DSP) to take advantage of both mics on the reSpeakers.

I know that some forum members are already using S3 devices as satellites. I understand that the ESP32-S3 will be particularly suitable with some AI capability built in; and up in message #17 above @synesthesiam suggested that ESP32-LyraT could be a go. It will of course be very dependant on getting the software. My fingers are still crossed.

As for Rhasspy v3, Michael hasn’t let anything slip, so far.

Hello all. I am new here but have extensive experience with raspberry pis and, separately, with voice interfaces. I am planning to use Home Assistant with voice this year and perhaps can contribute to the documentation by sharing my “journey”? I’d propose that I would start by writing a diary of what I am trying to do and what I discover, and then discuss my diary with someone who is taking responsibility for documentation. What I am trying to do is not an uncommon starting point, but I know that it does not fit comfortably with the wake-word/intent model made popular by the Echo.

The standard esp32 which the Lyra-T is really tight with the models you can use whilst that is what the esp32-s3 is about as the lx7 microcontroller is very similar apart from the additional vector instructions that boost ml close to x10 over a LyraT esp32.

Esspressif often make demo boards rather than dev boards where they throw the kitchen sink and everything else as a demo of what you could do than actually really have good use.
It has all the audio codecs, even mics on board and 2x 3watt speaker out, but its a demo as actually due to postioning and spec they are not that usefull and very toy like.

Even the esp32-box-lite is a demo and toy like where again they have thrown the kitchen sink of kws, audio processing, asr, tts & screen all on one esp32-s3 as a demo and gives you something to start with and hack away that likely a custom board or standard dev kits and modules would be more appropriate.
The esp32-box-lite does a huge amount not very well as a demo to demonstrate potential load so that if you made a specific KWS you have the 2x I2S ports of the S3 and those vector instructions to make something really low cost and modular that can run less models but better models.

Espressif have created some interesting audio processing libs and also ML where the demo products of Lyra-T & esp32-box-lite are a great dev resource but likely not end product and its only the S3 that is capable of running the more advanced libs, but as product you probably wouldn’t want what is the esp32-box-lite.

Also those demo boards are not that well priced, as if we where able to get Pi02W then likely they would not get a look in.
But if the community was up for it then a custom S3 KWS that is far more capable than the resource restrained one in the esp32-box-lite likely could hit the $10 mark if you get the qty’s as its ADC & mics that is all that is needed or I2S mics, but then there is the dev curve where the board design is likely the easier.

When you first mentioned ESP32-S3-box I was impressed at their demo - but noted that it is a demo, not a consumer product. But that is the nature of the beast. Neither Seeed or Espressif are in the business of selling direct to end users - they require system integrators to add value and marketing.

Espressif are at least showing that they can put the hardware combo we want onto a small board for a good price - we just need to add the software - communications, wake word detection, and as much DSP that is available and fits in memory - and a pretty box with speaker and power. Sounds simple, but way over my head technically :frowning:

And that was behind my asking @synesthesiam about discussions at nabu casa. Without cheap “ears” users cannot be expected to move from the cloud-based solutions to any local voice assistant. I was hoping that with the demand for low-cost “ears” and ESPHome as a base, maybe nabu casa might take the plunge to develop a S3 into a end-user product - as they already have with Blue and Yellow. I think its almost inevitable, but will take time.

esp32-s3-box is really impressive as they have all that running on what is a microcontroller! Its quite an accomplishment but in use its not going to be the next alexa.

I have been meaning to take the plunge as I don’t know how effective the audio processing chain is and would it provide much better results purely as KWS apportioning more resources to a model.
Esspresif do have

Espressif Audio Front-End AFE integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection), BSS(Blind Source Separation) and NS (Noise Suppression).

Part of there GitHub - espressif/esp-skainet: Espressif intelligent voice assistant but I find the binary blobs a bit off putting as not so sure how much you can hack around tweak & config and keep meaning to but haven’t progressed much further than there.

There are https://www.aliexpress.us/item/2251832644497273.html I2S ADC stereo modules for a few $ that could be used with any esp32-s3 dev kit https://www.aliexpress.us/item/2255800678628772.html

So you don’t have to go full custom, one is more suited to the job than the other in terms of easy interface but as per usual forgot which one and would have scratch my head with datasheets once more.

I went through several learning curves getting my Home Assistant /Rhasspy Satellites + Base setup working, and tried to document as i went. At 30 pages / 1Mb of tutorial I started to think a different approach might be warranted. I am happy to share my document, especially to someone with fresh eyes - but email is probably better than posing here :wink:

I think the problem is mostly that (like all FOSS projects) documentation is written by developers for other developers, and so is rather technical. By stating that “Rhasspy is intended for savvy amateurs or advanced users that want to have a private voice interface to their chosen home automation software” they conveniently avoid mere “users”; despite non-technical people also wanting a private voice interface.

Honestly i think Michael’s Rhasspy documentation is better than Home Assistant’s; and i am hoping that (when he comes out of his cave developing Rhasspy v3) Michael will allow some of us users to get involved in developing a proper user-oriented website for Rhasspy project.

1 Like

Hi @Petr, welcome :wave:
I’d be curious to hear about what you’re intending to do.

Thanks! Not much to report on v3 right now. Most of my time is going towards the intent recognition for Home Assistant. Fortunately, that will be reusable for Rhasspy too.

As I work on v3, I’m doing my best to keep “mere users” in mind :slight_smile:
Automatic installation of the configured tools is a tough nut to crack. Rhasspy v2’s solution was to try and pack it all (pre-built) into the Docker image. I’d prefer not to do this as the image just gets bigger and more difficult to build over time. I’ve had some success building self-contained binary packages with guix, but it remains to be seen if the ML runtimes can be folded in.

I’m planning to order one of the non-lite versions to test. From what I’ve read, the lite doesn’t support acoustic echo cancellation (AEC), so it would have a harder time hearing you if it’s playing audio.
But the ESP32-S3 is definitely the current target for satellite devices. The big questions I have are:

  1. Can we use our own wake word system with Espressif’s audio framework? (you can’t train a custom wake word with their framework without paying them)
  2. How well does their audio processing work in practice?
  3. How much RAM does their framework take with speech to text turned off?

If the Pi02W can get back down to a reasonable price, I’d much rather go with that. We’d have to roll our own I2S in that though, wouldn’t we?

So would I as for $15 for 4 core perf it really has no competitor and its my fave raspberry product, so I have been tuning in for raspberry mention which sadly its had a lack of, it could even be 2024 before we see stocks.
I still have x2 on order with Farnel Available to manufacturer lead time of 373 days sort of backs up the sad state of affairs.

The company has been prioritizing its commercial customers, with the 100,000 units for enthusiasts containing “Zero W, 3A+ and the 2GB and 4GB variants of Raspberry Pi 4”.

Pi02W has had zero mention for a while there is the 3A+ which really is near the same but $10+ more. The orignal Zero has had a price hike to $10 and the ZeroW is $15 which likely means there isn’t a chance of seeing a Pi02W for $15 and for now they seem to of shelved it.

We have the 2 mic hats and plugable do a stereo ADC usb soundcard if you want to wire up a pair of mics.
There are fakes as with my luck I have one on my desk next to the one I purchased direct Plugable USB Audio Adapter – Plugable Technologies

I think one of the above I2S ADC also works as a standard slave just like the adafruit I2S mic driver but with only a single available I2S port on a PI the 2mic or USB prob have preference.
If anyone can state if the other respeaker still have the random channels due to TDM mode sync problems there is those whilst generally USB versions act like conference mics rather than a smart assitant targetted voice solution. I have the 4 & 6 mic hat on my desk and just despair at the driver status that seems to break on each kernel update.

As said the Pi02W is my fave product but as well as stock when it comes to ML raspberries top end Pi4 is landing short of a sweet spot of quite a varied range of models from all aspects of KWS, ASR, NLU, TTS and there long term partner Broadcom is trying desparately to acquire IP after the fallout of being the biggest backer of the Nvidia takeover and what seems more of a fallout with RS who used to manufacture under licence, who has switched to Rockchip.
Raspberry at the moment is a complete no-mans land of if and when, Pi02W has had zero mention and you have chipsets like the RK3588 kicking raspberries butt.
The OrangePi5 I recently got delivered for £86 for 4gb the CPU alone runs ML x4 Pi4 speed and that doesn’t even include the MaliG610 that with ArmNN has about 90% the ML perf as the CPU and we still haven’t mentioned its got a 3 core 2TOPs NPU which all in all if it was all utilised maybe a possible x20 ML boost over a Pi4 to demonstrate a gap that is quite huge irrespective of fan base.
Also slowly we are seeing some very capable 2nd user hardware come down in price where state of art home AI likely will be very much a thing.

I would prob be more likely to say there is better chance for the community to provide models that can run on a esp32-S3 via a model zoo for KWS with purchased in hardware such as a ZL38042LDF1 from microsemi as for me that solves the esspresif blobs as the models can run Opensource via Tensorflow4Micro.
If you where going to build something custom the total solution likely be around the price of the equivalent Pi Hat with ref designs and dev kits to clone such as ESP32-LyraTD-MSC Overview | Espressif Systems where opensource can be both fabbed and distributed via the likes of Seeed, as a S3 is pretty much a drop-in replacement for the lower ESP32.

Thats why the 2mic or USB soundcard maybe not optimal, but avail and I have working beamforming code, just no Pi’s available to purchase or at least the one I would prefer Pi02W.

The R in Rhasspy is currently both restrictive and not really available and have been honestly wondering if it is still viable as stuck for clear cut solutions.

I can half answer that as there wake word system is commercial but Tensorflow4Micro is avail on ESP32 it just has less layer types that it supports most notable are recurrent layers such as GRU or LSTM but a CNN or DSCNN model should run quite well on a ESP32-S3.
So you don’t use the Esspressif KWS as it seems extremely heavily quantised to fit anyway and custom is pay4 unless your prepared to run with ‘Hi ESP’ and run TF4Micro instead with there Audio Front End SR.

The ‘Lite’ version has a 2 channel ADC rather than 4 channel of the ‘Non-Lite’ and this is why AEC doesn’t work as a 3rd channel is used as a hardware loopback fed from the DAC as the AEC Ref Channel.

I still think there is a misnomer here in the term satelite as the community is being blindsighted as what is envisaged as commercial units as there is no such thing as magic AEC as the construction of smart speakers has some really clever structural methods to isolate microphones from speaker to give AEC a chance.
Have a look at the Google Nest Audio teardown.
I actually have a ESP32-S3-Box & ESP32-S3-Box-Lite and the AEC didn’t seem to be that great as I think its just a port of Speex-DSP that attenuates but doesn’t cancel and the plastic case supplied is nice but actually acts like a resonant box aka guitar like.

There is some simple lateral thought needed here and its just don’t stick your Mics and speakers in the same box unless you have the resources of big data and create more friendly reusable maker product via seperates such as active wireless speaker and wireless mic/kws.
By doing that you give yourself a huge advantage as the magnitude of the SNR of the resonance through a single case is absolutely huge and I keep desparately trying to explain that we shouldn’t tunnel vision on a ‘satelite’.

You can run both client & server audio system be it LMS, Snapcast or Airplay and clients and it will run quite happilly on a Pi3/4 with Rhasspy in a box with a amp or speaker or my personal favorite for ease is in free air screwed to the back of a 2nd user Bookshelf speaker(s).

There are so many advantages to employing a modern wireless audio system where the maker space can actually compete and hook up a home to give that vital reference signal and some really cool solutions that work…

The best people to talk to would be Phillipe & Sebastion from GitHub - sle118/squeezelite-esp32: ESP32 Music streaming based on Squeezelite, with support for multi-room sync, AirPlay, Bluetooth, Hardware buttons, display and more

So my understanding is that the usual way to do this is a microphone → VAD → WakeWord detection → Intent Identification → dialog management / action.

The challenge is when dealing with multiple rooms there are multiple microphones and, assuming there is just one voice server, does each microphone get its own VAD and WWD (and possibly Intent identification) or does this stuff go on the server. A “satellite” is a mic or 2 and some mix of the rest to go in a room.

I’d like to try out an “always listening” system (hence no cloud) with a large(ish) set of wakewords, and dynamic intents. I am also curious about how a speech recognition “dictionary” is trained.

There isn’t really a set way Petr as very dependent on hardware that VAD could be before or after Wakeword higher up in the system.
Also you could have microphones where beamforming and blind source seperation help with far-field reverberation and signal extraction, but is something often lacking.

Usually its more convienient to train and utilise a single wakeword that is “always listening” purely for that wakeword then transmit on recieve a command sentence to ASR even if its only “Stop”.
Some ASR have a phonetic lexicon as a “dictionary” whilst later models tend to do a beam search and try to use a sentence context like OpenAi’s whisper where occasionally it will get things totally wrong but the sentence will still be logical but overall tends to be more accurate with sentences.

There are a whole manner of ways you can do it but with phonetic lexicons, dictionary sparsity can lead to more accuracy and why I think Wakeword → Skill Router → Skill ASR could be more scalable.
Its more load and latency that way but the Skill Router is a predicate ASR looking for “Play”, “Show”, “Who”, “Turn”, “Set” that may pass both intent and wav to a secondary Skill Server to partition subjects that could be many controls “Light”, “Heater” where certain types of skill like a audio player could quickly rack up a huge subject dictionary so split them into skill servers, that could be a mixture of lexicon and context types suited and trained for a certain skill type.

wav2vec is a approx ‘phonetic’ type and can be fairly appalling without dictionary backup but compared to Whisper is extremely performant.

Is a good example where the addition of a language model greatly reduces spelling and errors that if was a specific skill ASR maybe even could be more accurate than Whisper and far faster or lighter.
It could even be possible to create a language model from a multivoice TTS that say the band names & track names of audio media skill that can be really hard to recognise with a general model you maybe could train a specific subject content model as you could just the controls you have.

You can compare against as both repo’s super easy installs especially whisper.cpp

As rolyan says, there is no set way. Rhasspy was intended as a toolkit and framework so we can mix and match and use the bits we want. They recognised 6 stages in processing a voice command, and provide multiple options at each stage.

A while back Rhasspy was refactored to support Satellite + Base operation - multiple satellites with microphones around the house + a more powerful central computer doing the heavier processing tasks - and leaving it up to the user to decide what stage is performed where.

The most popular configuration seems to be a Raspberry Pi Zero with microphone and speaker performing the Audio Recording and Wakeword detection (so all sound isn’t constantly being streamed over the LAN) and Audio Output for any response to the user. The Speech-To-Text , intent recognition, and Text-To-Speech all require more processing power and are more suitable for a more powerful computer (a RasPi 4 works fine, but a used PC is better).

But the framework is just as effective running all the stages on one computer, or any combination as @romkabouter explained recently here.

Unfortunately the documentation appears to have had Satellite + Base bolted on, making it rather hard to find the important information in the documentation :frowning:

@donburch I think you might find the single core 32bit Zero even a struggle for KWS as I was rubbing my hands with glee when the Pi02W came out @ $15 and there is still the more expensive Pi3A+ that does seem to have stock or if you can a 2nd user Pi3 of any denomination.

PS with it being year of the voice anyone else got any benchmarks / articles appraising last years best of?

I have a habit of going on hugging face and seeing what the most downloaded models are for as certain application type, from time to time.

So the key here is that satellite are in another room with microphone. I’d suggest that this is NOT mentioned in the “getting started” section of the manual, but is a separate use-case with a semarate entry. Of course it wants to be easy to do, so the “getting started” set-up should have clearly defined “modules” that can be taken off the voice server and put on a satellite machine. Similarly, multiple microphones for beam forming should be an extension style section of the manual. Finally, given the popularity of pis, I’d suggest the getting started part of the manual is all about a Pi3B. The mic-in-another-room(s) bit uses Pi Zero 2w - even if they are hard to get, because the migration of modules should be easy. And I’d suggest the beam forming uses the respeaker hat for a pi because they will have an interest in getting it working with flashing lights and so on.

Dunno about Pi Stock Petr even if they are faves but prob deserves a discusssion of its own so I created

Then…

Alas, while Rhasspy was re-factored to support the Base+Satellite model, I am of the opinion that the documentation was “bolted on”. I did find all the information in the documentation … somewhere … after at least 3 reads through the whole thing. The first mention of base+Satellite is tucked away down in the Tutorials section, and even there it isn’t fully explained (e.g. both approaches show Intent Handling as “disabled”).

Unfortunately I think at the time it would have been a major exercise to restructure the documentation to better reflect the base+satellite approach, while still satisfying the existing users. Michael also strikes me an excellent developer, and I doubt he considers documentation fun or even a core competency.

And now…

Well I made notes and tried building into a tutorial for my combination (RasPi ZeroW satellite + Base running HAOS with Rhasspy add-on), and at 30 pages started feeling that there is probably a better approach than one huge document … such as a website (allowing links to audio troubleshooting tips without bogging down the main flow). The other thing is that rhasspy is more a toolkit, with many ways it can be used … so really wanting multiple tutorials … and it would be great if they could be fitted into one framework.

It looks now as though 90% of users have adopted the base+satellite model, so easy to justify re-doing the official documentation, and with lots of new users maybe all that good technical information can be moved a bit to the back ?

Except that with v3 in the pipeline, is there much point ?


BTW, a few days back I tried sending you a private message offering to email my 30-page tutorial. Did you receive it ?

Sorry Don I am travelling at the moment and working off a phone and shared machines. I would love to see your 30 pager but can I leave it another 2 weeks when I will have a real machine and more time.

Of course !

There seems to be several new users asking for help setting up, so maybe I should buckle down and finish my tutorial now.

1 Like