2023 - Year of Voice

When you first mentioned ESP32-S3-box I was impressed at their demo - but noted that it is a demo, not a consumer product. But that is the nature of the beast. Neither Seeed or Espressif are in the business of selling direct to end users - they require system integrators to add value and marketing.

Espressif are at least showing that they can put the hardware combo we want onto a small board for a good price - we just need to add the software - communications, wake word detection, and as much DSP that is available and fits in memory - and a pretty box with speaker and power. Sounds simple, but way over my head technically :frowning:

And that was behind my asking @synesthesiam about discussions at nabu casa. Without cheap “ears” users cannot be expected to move from the cloud-based solutions to any local voice assistant. I was hoping that with the demand for low-cost “ears” and ESPHome as a base, maybe nabu casa might take the plunge to develop a S3 into a end-user product - as they already have with Blue and Yellow. I think its almost inevitable, but will take time.

esp32-s3-box is really impressive as they have all that running on what is a microcontroller! Its quite an accomplishment but in use its not going to be the next alexa.

I have been meaning to take the plunge as I don’t know how effective the audio processing chain is and would it provide much better results purely as KWS apportioning more resources to a model.
Esspresif do have

Espressif Audio Front-End AFE integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection), BSS(Blind Source Separation) and NS (Noise Suppression).

Part of there GitHub - espressif/esp-skainet: Espressif intelligent voice assistant but I find the binary blobs a bit off putting as not so sure how much you can hack around tweak & config and keep meaning to but haven’t progressed much further than there.

There are https://www.aliexpress.us/item/2251832644497273.html I2S ADC stereo modules for a few $ that could be used with any esp32-s3 dev kit https://www.aliexpress.us/item/2255800678628772.html

So you don’t have to go full custom, one is more suited to the job than the other in terms of easy interface but as per usual forgot which one and would have scratch my head with datasheets once more.

I went through several learning curves getting my Home Assistant /Rhasspy Satellites + Base setup working, and tried to document as i went. At 30 pages / 1Mb of tutorial I started to think a different approach might be warranted. I am happy to share my document, especially to someone with fresh eyes - but email is probably better than posing here :wink:

I think the problem is mostly that (like all FOSS projects) documentation is written by developers for other developers, and so is rather technical. By stating that “Rhasspy is intended for savvy amateurs or advanced users that want to have a private voice interface to their chosen home automation software” they conveniently avoid mere “users”; despite non-technical people also wanting a private voice interface.

Honestly i think Michael’s Rhasspy documentation is better than Home Assistant’s; and i am hoping that (when he comes out of his cave developing Rhasspy v3) Michael will allow some of us users to get involved in developing a proper user-oriented website for Rhasspy project.

1 Like

Hi @Petr, welcome :wave:
I’d be curious to hear about what you’re intending to do.

Thanks! Not much to report on v3 right now. Most of my time is going towards the intent recognition for Home Assistant. Fortunately, that will be reusable for Rhasspy too.

As I work on v3, I’m doing my best to keep “mere users” in mind :slight_smile:
Automatic installation of the configured tools is a tough nut to crack. Rhasspy v2’s solution was to try and pack it all (pre-built) into the Docker image. I’d prefer not to do this as the image just gets bigger and more difficult to build over time. I’ve had some success building self-contained binary packages with guix, but it remains to be seen if the ML runtimes can be folded in.

I’m planning to order one of the non-lite versions to test. From what I’ve read, the lite doesn’t support acoustic echo cancellation (AEC), so it would have a harder time hearing you if it’s playing audio.
But the ESP32-S3 is definitely the current target for satellite devices. The big questions I have are:

  1. Can we use our own wake word system with Espressif’s audio framework? (you can’t train a custom wake word with their framework without paying them)
  2. How well does their audio processing work in practice?
  3. How much RAM does their framework take with speech to text turned off?

If the Pi02W can get back down to a reasonable price, I’d much rather go with that. We’d have to roll our own I2S in that though, wouldn’t we?

So would I as for $15 for 4 core perf it really has no competitor and its my fave raspberry product, so I have been tuning in for raspberry mention which sadly its had a lack of, it could even be 2024 before we see stocks.
I still have x2 on order with Farnel Available to manufacturer lead time of 373 days sort of backs up the sad state of affairs.

The company has been prioritizing its commercial customers, with the 100,000 units for enthusiasts containing “Zero W, 3A+ and the 2GB and 4GB variants of Raspberry Pi 4”.

Pi02W has had zero mention for a while there is the 3A+ which really is near the same but $10+ more. The orignal Zero has had a price hike to $10 and the ZeroW is $15 which likely means there isn’t a chance of seeing a Pi02W for $15 and for now they seem to of shelved it.

We have the 2 mic hats and plugable do a stereo ADC usb soundcard if you want to wire up a pair of mics.
There are fakes as with my luck I have one on my desk next to the one I purchased direct Plugable USB Audio Adapter – Plugable Technologies

I think one of the above I2S ADC also works as a standard slave just like the adafruit I2S mic driver but with only a single available I2S port on a PI the 2mic or USB prob have preference.
If anyone can state if the other respeaker still have the random channels due to TDM mode sync problems there is those whilst generally USB versions act like conference mics rather than a smart assitant targetted voice solution. I have the 4 & 6 mic hat on my desk and just despair at the driver status that seems to break on each kernel update.

As said the Pi02W is my fave product but as well as stock when it comes to ML raspberries top end Pi4 is landing short of a sweet spot of quite a varied range of models from all aspects of KWS, ASR, NLU, TTS and there long term partner Broadcom is trying desparately to acquire IP after the fallout of being the biggest backer of the Nvidia takeover and what seems more of a fallout with RS who used to manufacture under licence, who has switched to Rockchip.
Raspberry at the moment is a complete no-mans land of if and when, Pi02W has had zero mention and you have chipsets like the RK3588 kicking raspberries butt.
The OrangePi5 I recently got delivered for £86 for 4gb the CPU alone runs ML x4 Pi4 speed and that doesn’t even include the MaliG610 that with ArmNN has about 90% the ML perf as the CPU and we still haven’t mentioned its got a 3 core 2TOPs NPU which all in all if it was all utilised maybe a possible x20 ML boost over a Pi4 to demonstrate a gap that is quite huge irrespective of fan base.
Also slowly we are seeing some very capable 2nd user hardware come down in price where state of art home AI likely will be very much a thing.

I would prob be more likely to say there is better chance for the community to provide models that can run on a esp32-S3 via a model zoo for KWS with purchased in hardware such as a ZL38042LDF1 from microsemi as for me that solves the esspresif blobs as the models can run Opensource via Tensorflow4Micro.
If you where going to build something custom the total solution likely be around the price of the equivalent Pi Hat with ref designs and dev kits to clone such as ESP32-LyraTD-MSC Overview | Espressif Systems where opensource can be both fabbed and distributed via the likes of Seeed, as a S3 is pretty much a drop-in replacement for the lower ESP32.

Thats why the 2mic or USB soundcard maybe not optimal, but avail and I have working beamforming code, just no Pi’s available to purchase or at least the one I would prefer Pi02W.

The R in Rhasspy is currently both restrictive and not really available and have been honestly wondering if it is still viable as stuck for clear cut solutions.

I can half answer that as there wake word system is commercial but Tensorflow4Micro is avail on ESP32 it just has less layer types that it supports most notable are recurrent layers such as GRU or LSTM but a CNN or DSCNN model should run quite well on a ESP32-S3.
So you don’t use the Esspressif KWS as it seems extremely heavily quantised to fit anyway and custom is pay4 unless your prepared to run with ‘Hi ESP’ and run TF4Micro instead with there Audio Front End SR.

The ‘Lite’ version has a 2 channel ADC rather than 4 channel of the ‘Non-Lite’ and this is why AEC doesn’t work as a 3rd channel is used as a hardware loopback fed from the DAC as the AEC Ref Channel.

I still think there is a misnomer here in the term satelite as the community is being blindsighted as what is envisaged as commercial units as there is no such thing as magic AEC as the construction of smart speakers has some really clever structural methods to isolate microphones from speaker to give AEC a chance.
Have a look at the Google Nest Audio teardown.
I actually have a ESP32-S3-Box & ESP32-S3-Box-Lite and the AEC didn’t seem to be that great as I think its just a port of Speex-DSP that attenuates but doesn’t cancel and the plastic case supplied is nice but actually acts like a resonant box aka guitar like.

There is some simple lateral thought needed here and its just don’t stick your Mics and speakers in the same box unless you have the resources of big data and create more friendly reusable maker product via seperates such as active wireless speaker and wireless mic/kws.
By doing that you give yourself a huge advantage as the magnitude of the SNR of the resonance through a single case is absolutely huge and I keep desparately trying to explain that we shouldn’t tunnel vision on a ‘satelite’.

You can run both client & server audio system be it LMS, Snapcast or Airplay and clients and it will run quite happilly on a Pi3/4 with Rhasspy in a box with a amp or speaker or my personal favorite for ease is in free air screwed to the back of a 2nd user Bookshelf speaker(s).

There are so many advantages to employing a modern wireless audio system where the maker space can actually compete and hook up a home to give that vital reference signal and some really cool solutions that work…

The best people to talk to would be Phillipe & Sebastion from GitHub - sle118/squeezelite-esp32: ESP32 Music streaming based on Squeezelite, with support for multi-room sync, AirPlay, Bluetooth, Hardware buttons, display and more

So my understanding is that the usual way to do this is a microphone → VAD → WakeWord detection → Intent Identification → dialog management / action.

The challenge is when dealing with multiple rooms there are multiple microphones and, assuming there is just one voice server, does each microphone get its own VAD and WWD (and possibly Intent identification) or does this stuff go on the server. A “satellite” is a mic or 2 and some mix of the rest to go in a room.

I’d like to try out an “always listening” system (hence no cloud) with a large(ish) set of wakewords, and dynamic intents. I am also curious about how a speech recognition “dictionary” is trained.

There isn’t really a set way Petr as very dependent on hardware that VAD could be before or after Wakeword higher up in the system.
Also you could have microphones where beamforming and blind source seperation help with far-field reverberation and signal extraction, but is something often lacking.

Usually its more convienient to train and utilise a single wakeword that is “always listening” purely for that wakeword then transmit on recieve a command sentence to ASR even if its only “Stop”.
Some ASR have a phonetic lexicon as a “dictionary” whilst later models tend to do a beam search and try to use a sentence context like OpenAi’s whisper where occasionally it will get things totally wrong but the sentence will still be logical but overall tends to be more accurate with sentences.

There are a whole manner of ways you can do it but with phonetic lexicons, dictionary sparsity can lead to more accuracy and why I think Wakeword → Skill Router → Skill ASR could be more scalable.
Its more load and latency that way but the Skill Router is a predicate ASR looking for “Play”, “Show”, “Who”, “Turn”, “Set” that may pass both intent and wav to a secondary Skill Server to partition subjects that could be many controls “Light”, “Heater” where certain types of skill like a audio player could quickly rack up a huge subject dictionary so split them into skill servers, that could be a mixture of lexicon and context types suited and trained for a certain skill type.

wav2vec is a approx ‘phonetic’ type and can be fairly appalling without dictionary backup but compared to Whisper is extremely performant.

Is a good example where the addition of a language model greatly reduces spelling and errors that if was a specific skill ASR maybe even could be more accurate than Whisper and far faster or lighter.
It could even be possible to create a language model from a multivoice TTS that say the band names & track names of audio media skill that can be really hard to recognise with a general model you maybe could train a specific subject content model as you could just the controls you have.

You can compare against as both repo’s super easy installs especially whisper.cpp

As rolyan says, there is no set way. Rhasspy was intended as a toolkit and framework so we can mix and match and use the bits we want. They recognised 6 stages in processing a voice command, and provide multiple options at each stage.

A while back Rhasspy was refactored to support Satellite + Base operation - multiple satellites with microphones around the house + a more powerful central computer doing the heavier processing tasks - and leaving it up to the user to decide what stage is performed where.

The most popular configuration seems to be a Raspberry Pi Zero with microphone and speaker performing the Audio Recording and Wakeword detection (so all sound isn’t constantly being streamed over the LAN) and Audio Output for any response to the user. The Speech-To-Text , intent recognition, and Text-To-Speech all require more processing power and are more suitable for a more powerful computer (a RasPi 4 works fine, but a used PC is better).

But the framework is just as effective running all the stages on one computer, or any combination as @romkabouter explained recently here.

Unfortunately the documentation appears to have had Satellite + Base bolted on, making it rather hard to find the important information in the documentation :frowning:

@donburch I think you might find the single core 32bit Zero even a struggle for KWS as I was rubbing my hands with glee when the Pi02W came out @ $15 and there is still the more expensive Pi3A+ that does seem to have stock or if you can a 2nd user Pi3 of any denomination.

PS with it being year of the voice anyone else got any benchmarks / articles appraising last years best of?

I have a habit of going on hugging face and seeing what the most downloaded models are for as certain application type, from time to time.

So the key here is that satellite are in another room with microphone. I’d suggest that this is NOT mentioned in the “getting started” section of the manual, but is a separate use-case with a semarate entry. Of course it wants to be easy to do, so the “getting started” set-up should have clearly defined “modules” that can be taken off the voice server and put on a satellite machine. Similarly, multiple microphones for beam forming should be an extension style section of the manual. Finally, given the popularity of pis, I’d suggest the getting started part of the manual is all about a Pi3B. The mic-in-another-room(s) bit uses Pi Zero 2w - even if they are hard to get, because the migration of modules should be easy. And I’d suggest the beam forming uses the respeaker hat for a pi because they will have an interest in getting it working with flashing lights and so on.

Dunno about Pi Stock Petr even if they are faves but prob deserves a discusssion of its own so I created

Then…

Alas, while Rhasspy was re-factored to support the Base+Satellite model, I am of the opinion that the documentation was “bolted on”. I did find all the information in the documentation … somewhere … after at least 3 reads through the whole thing. The first mention of base+Satellite is tucked away down in the Tutorials section, and even there it isn’t fully explained (e.g. both approaches show Intent Handling as “disabled”).

Unfortunately I think at the time it would have been a major exercise to restructure the documentation to better reflect the base+satellite approach, while still satisfying the existing users. Michael also strikes me an excellent developer, and I doubt he considers documentation fun or even a core competency.

And now…

Well I made notes and tried building into a tutorial for my combination (RasPi ZeroW satellite + Base running HAOS with Rhasspy add-on), and at 30 pages started feeling that there is probably a better approach than one huge document … such as a website (allowing links to audio troubleshooting tips without bogging down the main flow). The other thing is that rhasspy is more a toolkit, with many ways it can be used … so really wanting multiple tutorials … and it would be great if they could be fitted into one framework.

It looks now as though 90% of users have adopted the base+satellite model, so easy to justify re-doing the official documentation, and with lots of new users maybe all that good technical information can be moved a bit to the back ?

Except that with v3 in the pipeline, is there much point ?


BTW, a few days back I tried sending you a private message offering to email my 30-page tutorial. Did you receive it ?

Sorry Don I am travelling at the moment and working off a phone and shared machines. I would love to see your 30 pager but can I leave it another 2 weeks when I will have a real machine and more time.

Of course !

There seems to be several new users asking for help setting up, so maybe I should buckle down and finish my tutorial now.

1 Like

Also, perhaps a crazier idea (maybe Rhasspy v4 or v5), would be using plugins compiled to WASM. Honestly, I don’t know much about how it actually works, but from reading/trying to stay on top of it - it seems like you could not rely on the OS to call out to cli and just execute them directly.

Weird benefits are the plugins could be easily downloadable and cross platform and maybe cross architecture. The biggest downside is that there are only a few languages that can currently compile to WASM (C and Rust, and i think Ruby), and I have no idea what interacting with WASM object from python looks like.

The tech might be a little to new atm - so it might be more of a future endeavour. It just seems to be the current hype machine around application packaging.

1 Like

Websockets would be great! It would be a lot easier to use for satellite communication.
Also external training would fix the issue that a rasspberry pi is sometimes too slow and times out.

Could be just a Socket but the ease of use and guarantee of order and delivery of TCP vs UDP is a big bonus.
Websockets has already been written and neatly differentiates text and binary packets which also makes it super easy to separate binary audio and text protocol messages.
Websockets is low latency and its 1to1 so it doesn’t broadcast across networked nodes just the destination.

Its included in SDK’s such as Arduino and ESP32 and scales all the way as if pretty much universally adopted, there are others but often microcontroller libs have less support such as gRPC.
Websockets really should of been a no-brainer.

2 Likes

I have been testing hardware and updating my simple beamformer code and really we have very few choices that are worthwhile.

Adafruit voice bonnet is good but for a ‘soundcard’ prob too expensive as also the respeaker 2mic is also OK but $10

The keyesstudio and other 2mic clones are often noisey and definately ewaste and not sure where it is as just got another, but be careful where you get your respeaker from as the 1st one I got was the same as the clones. This 2nd one is working fine.

I have GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer /tmp/ds-out is the current TDOA
/tmp/ds-in if it exists sets the beam to the integer in the file, to clear just delete.
If anyone has even a touch of C/C++ finese then feel free to clone and tidy.

There is also Plugable USB Audio Adapter – Plugable Technologies with 2 channels or any el cheapo usb with a mic if you are not going to beamform.