2023 - Year of Voice

I picked up on rolyan’s comment a while back that he abandoned Rhasspy several years ago and has no experience with rhasspy satellite. Yet based on this total lack of actual experience he is stuck vehemently repeating allegations that only he seems to believe (like that Rhasspy is inextricably linked to Raspberry Pi), about software that is long since history.

Personally I don’t see much conceptual difference between cheap devices with mic and speaker spread around the house which listen for a keyword that are called “ears”, and the same device with same purpose called a “satellite”. Sure a Rhasspy satellite has the same user interface, but I don’t consider calling modules on a server to do all the cpu intensive processing as “bloat”. Similarly that rhasspy’s modular client-server architecture somehow does not allow KWS to be done on a separate shared server if one so wishes, or on the client so the audio doesn’t have to go through the LAN. He seems so fixated on using his own terminology that he can’t see that Rhasspy is conceptually pretty much what he is promoting. :frowning:

I freely admit that, while Rhasspy’s documentation does contain all the required information, it is not arranged in a way that makes base+satellite configuration clear. I guess there must have been quite a bit of confusion at the time of the transition. And the confusion continues, resulting in new users needing to ask for help on the forum; often having struggled to piece together the necessary pieces of information spread through the documentation. Please Michael don’t take this as an attack - I don’t like writing documentation either, and at the time you were adding satellite to existing documentation.

I suggest that rearranging the current documentation to make base+satellite the default configuration (and all-on-one as the advanced option) would help. And a comprehensive tutorial for new users … which I started and got to 30 pages before deciding I needed to re-think my approach. Now I’m not sure whether v3 will make it a waste of effort.

Bottom line, I really am puzzled that rolyan spends so much time on the Rhasspy forum, given his extreme prejudice against it. I suspect rolyan could have developed his own system with half the time and effort he has spent trolling Rhasspy.

rolyan I don’t understand why you could feel responsible for other people’s effort; more so because you are the only one that considers it a waste. If you really believe it to be a waste, why not just move on ?

1 Like

Beside the logo i dont like the name rhasspy either. In my language it sounds the same as raspi which leads to so much confusion i nearly never are able to use the word rhasspy. I need to use something like speech thing or such.

Light is “lámpa”, but in turn on the lights it’s “lámpát”, so the original word is changing as well, so it’s not just “lámpa[t]”. Turning on the lights IN the living room (“nappali”) is “nappaliban”, but in the bathroom (fürdő) it would be “fürdőben”, so the ending is different. Gender is not important for grammer. Verbs are sometimes attached to conjugation sometimes not, e.g. “turn off the lights” is “kapcsold ki a lámpát”, but “turn off lights” may be “lámpát kikapcsolni”, so where templates would work normally in English may not work with the same placeholders in Hungarian. I could circumvent some of that with variants like [entity](ban|ben), but that doesn’t cover everything. Oh, and we have “a” and “az” articles (this is for “the”, not for “a”/“an” which is simply “egy”), depending on if the word starts with a vowel or not. Yes, it can be put as “(a|az)” but then a single longer sentence ends up having 6-7 of such variants, makes it difficult to write/read/understand and still not grammatically perfect.

The translation web UI was really difficult to use with mycroft, it was really slow and made it difficult to see context, understand how single-word entries would be eventually used. (I was even considering doing it straight in git instead.) Ok, there is a word “start”, but in what context and term is it used? Makes a difference considering how endings are applied. Start is “indít”, but “start the timer” would “indítsd el az időzítőt” and “start a timer” would be “indíts egy időzítőt”.

I found other difficulties when doing translation for mycroft but can’t recall from the top of my head. I know there is plenty of supported stuff, like being able to customize numbers (e.g. we say “two” in two ways depending on context and that I think is handled), but I remember it all being so complicated to get my head around it. Perhaps this is not an issue if the core can cover most stuff and a selected few :slight_smile: can cover that. Other services (like chatgpt) that have language engines (or whatever) that knows grammar works better than having templates in this regard - I guess it’s no wonder that many language related services only support a selected core list of languages.

Not sure how to PM, though I was trying to find it, lol. :slight_smile:

I like this idea, but I would prefer it to be a networked option, so a plugin can be placed on multiple devices according to their resource needs.
The current MQTT solution would mean too much upload and download.
The audio port feature seems like a better solution, where there is a direct connection between the plugins.

Sorry Michael but I’m confuzzed again :frowning: Are you meaning that the satellites are part of the Base station ? And that the satellite’s Audio Output can be a streaming audio service ?
I was thinking that Rhasspy GUI should be a separate web server running in its own container, and uses API calls to Base and Satellite processes.

Looking at the recent posts on this forum, many are people struggling to configure their Base + Satellite rhasspy.
Moreover I realise that I often struggle to help because they only give the Base part of their configuration … as though they expect all the configuration to be in one place. I guess using the same UI can also add to confusion for new users.

I note that you have previously suggested auto-discovery of Satellite units, which I assume implies that they be controlled (or at least configured) centrally. I can see that it would be easy to provide a simplified HA Rhasspy Junior user interface using multiple tabs (as suggested here) for Base and each Satellite since only limited options would be provided for each unit.

In theory one web UI for Rhasspy 2.5 or v3 could also use device discovery and call Satellite API routines to provide remote configuration - though all the extra options will make the UI more confusing. I guess there’s not much point going down this path until we see what v3 brings ?

This is my first post and hopefully it’s in the right place.

With the changes coming to v3 is something like the ESP32-S3-BOX-LITE a good option for satellite hardware?

I have a dedicated server room with a TureNas Scale server that has tons of compute (AMD EPYC, as well as Tesla cards for acceleration) that can be used to run the main Rhasspy app on.

I’m not sure what kinda hardware is best for satellite nodes in each room though. My house currently has 10 Google home devices. Some are low end mini’s while others are dual speaker setups for music.

With hardware becoming more affordable I’m curious on what will end up being the best hardware for 2023.

As are we all :frowning:

Google and amazon have put lots of money into developing their own excellent hardware devices - which are locked to use their cloud service.

Raspberry Pi with reSpeaker 2-mic HAT seems to have been the most popular non-proprietary option - not because it is particularly wonderful, but because they were cheap, readily available and easy to program. Then came covid, and chip shortages :sob: There is also the fact that (despite having 2 mics in hardware) neither the driver or Rhasspy provides any of the Digital Signal Processing (DSP) to take advantage of both mics on the reSpeakers.

I know that some forum members are already using S3 devices as satellites. I understand that the ESP32-S3 will be particularly suitable with some AI capability built in; and up in message #17 above @synesthesiam suggested that ESP32-LyraT could be a go. It will of course be very dependant on getting the software. My fingers are still crossed.

As for Rhasspy v3, Michael hasn’t let anything slip, so far.

Hello all. I am new here but have extensive experience with raspberry pis and, separately, with voice interfaces. I am planning to use Home Assistant with voice this year and perhaps can contribute to the documentation by sharing my “journey”? I’d propose that I would start by writing a diary of what I am trying to do and what I discover, and then discuss my diary with someone who is taking responsibility for documentation. What I am trying to do is not an uncommon starting point, but I know that it does not fit comfortably with the wake-word/intent model made popular by the Echo.

The standard esp32 which the Lyra-T is really tight with the models you can use whilst that is what the esp32-s3 is about as the lx7 microcontroller is very similar apart from the additional vector instructions that boost ml close to x10 over a LyraT esp32.

Esspressif often make demo boards rather than dev boards where they throw the kitchen sink and everything else as a demo of what you could do than actually really have good use.
It has all the audio codecs, even mics on board and 2x 3watt speaker out, but its a demo as actually due to postioning and spec they are not that usefull and very toy like.

Even the esp32-box-lite is a demo and toy like where again they have thrown the kitchen sink of kws, audio processing, asr, tts & screen all on one esp32-s3 as a demo and gives you something to start with and hack away that likely a custom board or standard dev kits and modules would be more appropriate.
The esp32-box-lite does a huge amount not very well as a demo to demonstrate potential load so that if you made a specific KWS you have the 2x I2S ports of the S3 and those vector instructions to make something really low cost and modular that can run less models but better models.

Espressif have created some interesting audio processing libs and also ML where the demo products of Lyra-T & esp32-box-lite are a great dev resource but likely not end product and its only the S3 that is capable of running the more advanced libs, but as product you probably wouldn’t want what is the esp32-box-lite.

Also those demo boards are not that well priced, as if we where able to get Pi02W then likely they would not get a look in.
But if the community was up for it then a custom S3 KWS that is far more capable than the resource restrained one in the esp32-box-lite likely could hit the $10 mark if you get the qty’s as its ADC & mics that is all that is needed or I2S mics, but then there is the dev curve where the board design is likely the easier.

When you first mentioned ESP32-S3-box I was impressed at their demo - but noted that it is a demo, not a consumer product. But that is the nature of the beast. Neither Seeed or Espressif are in the business of selling direct to end users - they require system integrators to add value and marketing.

Espressif are at least showing that they can put the hardware combo we want onto a small board for a good price - we just need to add the software - communications, wake word detection, and as much DSP that is available and fits in memory - and a pretty box with speaker and power. Sounds simple, but way over my head technically :frowning:

And that was behind my asking @synesthesiam about discussions at nabu casa. Without cheap “ears” users cannot be expected to move from the cloud-based solutions to any local voice assistant. I was hoping that with the demand for low-cost “ears” and ESPHome as a base, maybe nabu casa might take the plunge to develop a S3 into a end-user product - as they already have with Blue and Yellow. I think its almost inevitable, but will take time.

esp32-s3-box is really impressive as they have all that running on what is a microcontroller! Its quite an accomplishment but in use its not going to be the next alexa.

I have been meaning to take the plunge as I don’t know how effective the audio processing chain is and would it provide much better results purely as KWS apportioning more resources to a model.
Esspresif do have

Espressif Audio Front-End AFE integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection), BSS(Blind Source Separation) and NS (Noise Suppression).

Part of there GitHub - espressif/esp-skainet: Espressif intelligent voice assistant but I find the binary blobs a bit off putting as not so sure how much you can hack around tweak & config and keep meaning to but haven’t progressed much further than there.

There are https://www.aliexpress.us/item/2251832644497273.html I2S ADC stereo modules for a few $ that could be used with any esp32-s3 dev kit https://www.aliexpress.us/item/2255800678628772.html

So you don’t have to go full custom, one is more suited to the job than the other in terms of easy interface but as per usual forgot which one and would have scratch my head with datasheets once more.

I went through several learning curves getting my Home Assistant /Rhasspy Satellites + Base setup working, and tried to document as i went. At 30 pages / 1Mb of tutorial I started to think a different approach might be warranted. I am happy to share my document, especially to someone with fresh eyes - but email is probably better than posing here :wink:

I think the problem is mostly that (like all FOSS projects) documentation is written by developers for other developers, and so is rather technical. By stating that “Rhasspy is intended for savvy amateurs or advanced users that want to have a private voice interface to their chosen home automation software” they conveniently avoid mere “users”; despite non-technical people also wanting a private voice interface.

Honestly i think Michael’s Rhasspy documentation is better than Home Assistant’s; and i am hoping that (when he comes out of his cave developing Rhasspy v3) Michael will allow some of us users to get involved in developing a proper user-oriented website for Rhasspy project.

1 Like

Hi @Petr, welcome :wave:
I’d be curious to hear about what you’re intending to do.

Thanks! Not much to report on v3 right now. Most of my time is going towards the intent recognition for Home Assistant. Fortunately, that will be reusable for Rhasspy too.

As I work on v3, I’m doing my best to keep “mere users” in mind :slight_smile:
Automatic installation of the configured tools is a tough nut to crack. Rhasspy v2’s solution was to try and pack it all (pre-built) into the Docker image. I’d prefer not to do this as the image just gets bigger and more difficult to build over time. I’ve had some success building self-contained binary packages with guix, but it remains to be seen if the ML runtimes can be folded in.

I’m planning to order one of the non-lite versions to test. From what I’ve read, the lite doesn’t support acoustic echo cancellation (AEC), so it would have a harder time hearing you if it’s playing audio.
But the ESP32-S3 is definitely the current target for satellite devices. The big questions I have are:

  1. Can we use our own wake word system with Espressif’s audio framework? (you can’t train a custom wake word with their framework without paying them)
  2. How well does their audio processing work in practice?
  3. How much RAM does their framework take with speech to text turned off?

If the Pi02W can get back down to a reasonable price, I’d much rather go with that. We’d have to roll our own I2S in that though, wouldn’t we?

So would I as for $15 for 4 core perf it really has no competitor and its my fave raspberry product, so I have been tuning in for raspberry mention which sadly its had a lack of, it could even be 2024 before we see stocks.
I still have x2 on order with Farnel Available to manufacturer lead time of 373 days sort of backs up the sad state of affairs.

The company has been prioritizing its commercial customers, with the 100,000 units for enthusiasts containing “Zero W, 3A+ and the 2GB and 4GB variants of Raspberry Pi 4”.

Pi02W has had zero mention for a while there is the 3A+ which really is near the same but $10+ more. The orignal Zero has had a price hike to $10 and the ZeroW is $15 which likely means there isn’t a chance of seeing a Pi02W for $15 and for now they seem to of shelved it.

We have the 2 mic hats and plugable do a stereo ADC usb soundcard if you want to wire up a pair of mics.
There are fakes as with my luck I have one on my desk next to the one I purchased direct Plugable USB Audio Adapter – Plugable Technologies

I think one of the above I2S ADC also works as a standard slave just like the adafruit I2S mic driver but with only a single available I2S port on a PI the 2mic or USB prob have preference.
If anyone can state if the other respeaker still have the random channels due to TDM mode sync problems there is those whilst generally USB versions act like conference mics rather than a smart assitant targetted voice solution. I have the 4 & 6 mic hat on my desk and just despair at the driver status that seems to break on each kernel update.

As said the Pi02W is my fave product but as well as stock when it comes to ML raspberries top end Pi4 is landing short of a sweet spot of quite a varied range of models from all aspects of KWS, ASR, NLU, TTS and there long term partner Broadcom is trying desparately to acquire IP after the fallout of being the biggest backer of the Nvidia takeover and what seems more of a fallout with RS who used to manufacture under licence, who has switched to Rockchip.
Raspberry at the moment is a complete no-mans land of if and when, Pi02W has had zero mention and you have chipsets like the RK3588 kicking raspberries butt.
The OrangePi5 I recently got delivered for £86 for 4gb the CPU alone runs ML x4 Pi4 speed and that doesn’t even include the MaliG610 that with ArmNN has about 90% the ML perf as the CPU and we still haven’t mentioned its got a 3 core 2TOPs NPU which all in all if it was all utilised maybe a possible x20 ML boost over a Pi4 to demonstrate a gap that is quite huge irrespective of fan base.
Also slowly we are seeing some very capable 2nd user hardware come down in price where state of art home AI likely will be very much a thing.

I would prob be more likely to say there is better chance for the community to provide models that can run on a esp32-S3 via a model zoo for KWS with purchased in hardware such as a ZL38042LDF1 from microsemi as for me that solves the esspresif blobs as the models can run Opensource via Tensorflow4Micro.
If you where going to build something custom the total solution likely be around the price of the equivalent Pi Hat with ref designs and dev kits to clone such as ESP32-LyraTD-MSC Overview | Espressif Systems where opensource can be both fabbed and distributed via the likes of Seeed, as a S3 is pretty much a drop-in replacement for the lower ESP32.

Thats why the 2mic or USB soundcard maybe not optimal, but avail and I have working beamforming code, just no Pi’s available to purchase or at least the one I would prefer Pi02W.

The R in Rhasspy is currently both restrictive and not really available and have been honestly wondering if it is still viable as stuck for clear cut solutions.

I can half answer that as there wake word system is commercial but Tensorflow4Micro is avail on ESP32 it just has less layer types that it supports most notable are recurrent layers such as GRU or LSTM but a CNN or DSCNN model should run quite well on a ESP32-S3.
So you don’t use the Esspressif KWS as it seems extremely heavily quantised to fit anyway and custom is pay4 unless your prepared to run with ‘Hi ESP’ and run TF4Micro instead with there Audio Front End SR.

The ‘Lite’ version has a 2 channel ADC rather than 4 channel of the ‘Non-Lite’ and this is why AEC doesn’t work as a 3rd channel is used as a hardware loopback fed from the DAC as the AEC Ref Channel.

I still think there is a misnomer here in the term satelite as the community is being blindsighted as what is envisaged as commercial units as there is no such thing as magic AEC as the construction of smart speakers has some really clever structural methods to isolate microphones from speaker to give AEC a chance.
Have a look at the Google Nest Audio teardown.
I actually have a ESP32-S3-Box & ESP32-S3-Box-Lite and the AEC didn’t seem to be that great as I think its just a port of Speex-DSP that attenuates but doesn’t cancel and the plastic case supplied is nice but actually acts like a resonant box aka guitar like.

There is some simple lateral thought needed here and its just don’t stick your Mics and speakers in the same box unless you have the resources of big data and create more friendly reusable maker product via seperates such as active wireless speaker and wireless mic/kws.
By doing that you give yourself a huge advantage as the magnitude of the SNR of the resonance through a single case is absolutely huge and I keep desparately trying to explain that we shouldn’t tunnel vision on a ‘satelite’.

You can run both client & server audio system be it LMS, Snapcast or Airplay and clients and it will run quite happilly on a Pi3/4 with Rhasspy in a box with a amp or speaker or my personal favorite for ease is in free air screwed to the back of a 2nd user Bookshelf speaker(s).

There are so many advantages to employing a modern wireless audio system where the maker space can actually compete and hook up a home to give that vital reference signal and some really cool solutions that work…

The best people to talk to would be Phillipe & Sebastion from GitHub - sle118/squeezelite-esp32: ESP32 Music streaming based on Squeezelite, with support for multi-room sync, AirPlay, Bluetooth, Hardware buttons, display and more

So my understanding is that the usual way to do this is a microphone → VAD → WakeWord detection → Intent Identification → dialog management / action.

The challenge is when dealing with multiple rooms there are multiple microphones and, assuming there is just one voice server, does each microphone get its own VAD and WWD (and possibly Intent identification) or does this stuff go on the server. A “satellite” is a mic or 2 and some mix of the rest to go in a room.

I’d like to try out an “always listening” system (hence no cloud) with a large(ish) set of wakewords, and dynamic intents. I am also curious about how a speech recognition “dictionary” is trained.

There isn’t really a set way Petr as very dependent on hardware that VAD could be before or after Wakeword higher up in the system.
Also you could have microphones where beamforming and blind source seperation help with far-field reverberation and signal extraction, but is something often lacking.

Usually its more convienient to train and utilise a single wakeword that is “always listening” purely for that wakeword then transmit on recieve a command sentence to ASR even if its only “Stop”.
Some ASR have a phonetic lexicon as a “dictionary” whilst later models tend to do a beam search and try to use a sentence context like OpenAi’s whisper where occasionally it will get things totally wrong but the sentence will still be logical but overall tends to be more accurate with sentences.

There are a whole manner of ways you can do it but with phonetic lexicons, dictionary sparsity can lead to more accuracy and why I think Wakeword → Skill Router → Skill ASR could be more scalable.
Its more load and latency that way but the Skill Router is a predicate ASR looking for “Play”, “Show”, “Who”, “Turn”, “Set” that may pass both intent and wav to a secondary Skill Server to partition subjects that could be many controls “Light”, “Heater” where certain types of skill like a audio player could quickly rack up a huge subject dictionary so split them into skill servers, that could be a mixture of lexicon and context types suited and trained for a certain skill type.

wav2vec is a approx ‘phonetic’ type and can be fairly appalling without dictionary backup but compared to Whisper is extremely performant.

Is a good example where the addition of a language model greatly reduces spelling and errors that if was a specific skill ASR maybe even could be more accurate than Whisper and far faster or lighter.
It could even be possible to create a language model from a multivoice TTS that say the band names & track names of audio media skill that can be really hard to recognise with a general model you maybe could train a specific subject content model as you could just the controls you have.

You can compare against as both repo’s super easy installs especially whisper.cpp

As rolyan says, there is no set way. Rhasspy was intended as a toolkit and framework so we can mix and match and use the bits we want. They recognised 6 stages in processing a voice command, and provide multiple options at each stage.

A while back Rhasspy was refactored to support Satellite + Base operation - multiple satellites with microphones around the house + a more powerful central computer doing the heavier processing tasks - and leaving it up to the user to decide what stage is performed where.

The most popular configuration seems to be a Raspberry Pi Zero with microphone and speaker performing the Audio Recording and Wakeword detection (so all sound isn’t constantly being streamed over the LAN) and Audio Output for any response to the user. The Speech-To-Text , intent recognition, and Text-To-Speech all require more processing power and are more suitable for a more powerful computer (a RasPi 4 works fine, but a used PC is better).

But the framework is just as effective running all the stages on one computer, or any combination as @romkabouter explained recently here.

Unfortunately the documentation appears to have had Satellite + Base bolted on, making it rather hard to find the important information in the documentation :frowning:

@donburch I think you might find the single core 32bit Zero even a struggle for KWS as I was rubbing my hands with glee when the Pi02W came out @ $15 and there is still the more expensive Pi3A+ that does seem to have stock or if you can a 2nd user Pi3 of any denomination.

PS with it being year of the voice anyone else got any benchmarks / articles appraising last years best of?

I have a habit of going on hugging face and seeing what the most downloaded models are for as certain application type, from time to time.