2023 - Year of Voice

110% behind this direction and for me been a no-brainer for some time.

This will make things far more manageable and the simpler modules are just building blocks for what is essentially the serial chain of voice modules.
This should of always been decoupled from skill servers and all that is needed is a skill router that allows for the simple or the highly complex as you merely add more skill servers without need to maintain or understand the controls and methods of a skill but just pass inference.

A voice system is merely a set of applications / containers / instances that queue and pass to the next module in what is essentially a serial queue.
The less that is embedded into rhasppy means a bigger choice of implementation that is also more scalable.
The metadata needs for a voice system are extremely simple and that simplicity creates a building block system where complexity is choice.

It will be more manageable, offer more modules, be more scaleable and if it done right we could start to see plug & play linux inference based skill servers that can gather bigger herds because they are interoperable and not limited to a single system.

Its as simple as queue → routes that connect to the next stage that just advertises if busy or free.

What you have posted is Intents for Home Assistant and there is absolutely no need in a voice system as that should happen in a HA skill server that is routed and passed an inference?

  • What if Rhasspy didn’t come with a web UI, just HTTP/Websocket/etc. APIs?
    I’m a little confused, too. I think it would be very important to stay on an easy GUI. I can understand, that it’s not your point to focus on, but it’s an important entry point for all new users.
    It’s also useful if you just want to change a small thing, without remote connection. Or a computer without the right setting.

  • What if Rhasspy had no “plugins”, but only ever called external programs?
    I’m not sure if i have an opinion to that point.

  • What if training in Rhasspy was separated into its own standalone application?
    That would be nice, if we can train our models on more powerful processors with the ability of sharing.
    Would be also cool to train only sentence-files for a specific skill, but i think thats not the way it works. Or a Server to expand the ability of understanding.

2 Likes

There isn’t really enough info to go on but its such a radical change it doesn’t mean necessarily there will not be a webui even if how currently implemented massively insecure.

There is a problem with the current infrastructure of an all-in-one in what is not just the fastest evolving tech scene its one that is evolving at unprecedented speed.
Already much of what is contained in Rhasspy is obsolete where better open source SOTA boasting models exist freely that are aimed at various platforms from mobile to GPU.
Then we have hardware that in this scene is seeing almost as fast rapid evolution from Apple < 7watt idle RTX2080ti + ML perf to RK3588, NPU accelerators and problems with the Pi supply chain.

The current choice of all-in-one means it gives a few choices of certain modules and elsewhere specific modules and a protocol that is specifically rhasspy.
This means the current system is relatively locked in to a very narrow spectrum that also provide 100% support needs by a small (singular) dev team.

The OP that has questioned current infrastructure has been posed has been well overdue for sometime and 2023 and technology in general is adopting voice methods at a fast pace and the current all-in-one is just a huge constriction to choice, scalability and security.
The current voice scene is so fast moving that current modules are already relegated to a toy base.

Likely there will be a HaSkillServer but the current system and protocol is applied to all modules and is massively over complex as a voice system is not a control system and currently there are huge swaithes of control protocol on modules without need purely because they are part of an all-one.

The training of rhasspy currently only works due to low command volume and relatively unique phonetic collections as the ASR and NLU methods are quite old and even with fairly modest additions of ‘subject’ and ‘predicate’ accuracy will plummet.
It works on low volume predicates like ‘turn’ and subjects such as ‘light’ but a common skill such as a audio server with a modest library could flood an all-one-one with subjects and decimate how the current system garners accuracy.

If Rhasspy and Hass have any ambitions to be more than a toy system it needs a complete rethink in terms of voice control and like CISC vs RISC complexity can be built by reusing simple building block modules that scale.
A model doesn’t get better when you train it on a more powerful processor it is locked because a model is what it is and you can just train it faster and currently we are not really training a model just reorganising phonetic catchment.
The models we use are part of an all-one-one that has a hardware specific of raspberry pi and that is why we have the models we have which is also hugely restrictive.
It doesn’t get better with better hardware because we have specific models aimed at specific hardware.
Accuracy can be maintained or even increased by partitioning into predicate and subject domains whilst an all-in-one at any level will do the opposite and why the current infrastructure was and is deeply flawed.

But your worries are also misplaced because we never needed a front-end voice complexity that we currently have.
A very simple simple zonal, channel based system of KWS->KWS/Audio processor->ASR->Skill router->TTS is all we need and its a very simple serial chain.
The complexity under the hood to create a working voice system was never needed as it confused control with voice and partitioning this should give choice of hardware, model, scale and complexity and reuse of software from larger herds will reduce maintenance and increase support availability than pointlessly refactoring code to a smaller pool and embedding system specifics.

1 Like

@donburch Hopefully I can clear up some confusion :slight_smile:

I’m not saying Rhasspy will be dropping the web UI, just that it should be optional. Like Hermes/MQTT, having so much baked into Rhasspy’s core has made it difficult keep up with the pace of change in the voice space (as @rolyan_trauts mentioned).

Regarding Rhasspy 2.5 vs 3.0, I believe for many users that the internal workings are less important than their sentences, slots, and profile settings. I will do my best not to break things unnecessarily, but it may take me a while.

I agree! As we’ve talked about with Rhasspy Junior, I think it’s possible to layer a user-friendly interface on top of something that more advanced users also enjoy. My plan is (loosely):

  • Voice services as regular programs that can still be used independently of Rhasspy
  • Small HTTP/websocket servers that wrap the voice services for satellites
  • Rhasspy’s core, which configures and coordinates the voice services into voice loops (wake → spech to text → etc.)
  • Web UI and other protocols like Hermes on top of the core

One feature the new parser has is that you can embed template pieces into words. In English, for example, you can have turn on the light[s] for both “light” and “lights”.

This helps with matching, but responses are usually more difficult. I’d be very interested to hear about what sorts of information needs to be tracked for Hungarian (gender, case, etc.). Please PM me or reply here :slight_smile:

In the year I was at Mycroft, things changed so much! The best idea I’ve had is to lower the barrier to entry for adding a service to Rhasspy. Something as simple as: if your program takes a WAV file and returns text, you can be a speech to text service. No Python, no MQTT, just a program with arguments, standard in, and standard out.

But I do want there to be an “easy button” for users, which selects the programs based on some constraints (Pi 4 vs. GPU) and installs them.

And a logo that doesn’t look like it was drawn by a programmer would be nice :grin:

If you look at some of the models that BigAI are doing such as GPT3/ChatGPT or Whisper things are moving at unprecedented speed.
You could do something extremely simple by sharing a host folder that is the output of one container and the input of another and a simple Inotify folder watcher to run a command.
The reciprocal could happen to state that has been cleared as an ‘I am free’.
But basically inotify-simple · PyPI and whatever is the run command.
My preference would be Unix sockets as they can be both file and net based and the same inter-process queue-bridge could be used at each step in the chain. As a file socket would act the same as above but with web based you can have multiple instances to scale to needs.
The only conf would be a filename or host:port for the chain to connect to next.

There are SOTA models now that if you wish and have intent on buying the hardware Large models such as Whisper, Hi Fidelity TTS and GPT style NLU is a valid option as selecting much lesser models to run on PI.
So I don’t think especially with HA that you can provide specifics just the queue/bridging models to link them as if you provide for one you exclude another or have to provide all.

If you take Whisper the install is
pip install git+https://github.com/openai/whisper.git
It uses ffmpeg sudo apt install ffmpeg
It runs via whisper audio.flac --model medium

It really doesn’t need a web page to be setup… Its support is on its webpage and its herd is much larger than rhasspy with multiple how-to’s and alternative refactored code.

Michael, you have indicated that web development isn’t your thing … understood, and i agree that your effort is much better spent on the technicalities (what I think of as “the back end”). So I am seriously considering giving the Junior UI a go myself.

I am particularly suspicious of things like wi-fi which are sold as “it just works” like magic - because invariably they dont.

So separate, but definitely not optional - especially for new users. People need to check that their audio devices are working, setup friendly names for HA devices, and check/edit the values for the arguments in intents, and see error messages.

EDIT: body of post moved to a new topic: Home Assistant Rhasspy Integration GUI

Well, not by me then either :wink:

1 Like

2023 sounds amazing !

About web UI, if there is a good web api, UI will follow. There is a big community, someone (me ?) will develop a ui if there is a web api.

Manage it with command line would be a really great plus.

2 Likes

Another thought …

I appreciate that Rhasspy Satellite is a fairly recent concept which required a significant refactoring not so many versions ago … and so at the time it was considered an advanced option … but has it now proved itself as the best logical approach for Rhasspy going forward ?

Experience has shown that Rhasspy satellites use only audio input, wake word detection, MQTT and audio output modules. By packaging just these modules (yes, I strongly believe it should still be modular) the overhead is reduced.
Would this be a reasonable subset to implement on a cheap ESPHome platform ? Have you had discussions with Nabu Casa’s ESPhome team about audio options ?

How many users have only an all-on-one Rhasspy ? And in these cases would it be reasonable to run separate instances of Rhasspy Base and Rhasspy Satellite on the same machine ? To run Base it must have reasonable CPU, so would the extra overhead be significant ?

So… am I really suggesting splitting Rhasspy core into 3 or 4 separate but closely tied projects - Rhasspy Satellite, Rhasspy Base, Rhasspy GUI, and Rhasspy training ?

1 Like

Rhasspy satelite is an absolute terrible bloat and a ridiculous idea akin to making a module for a Rhasspy Keyboard for input that uses a net based MQTT network to receive its key strokes.

We have always been missing a module which is a KWS server / audio processor that sits and queues KWS to an ASR and also contains further filters, VAD or AEC if that is how you wish to setup your initial audio stream.
Rhasspy talks to KWS server / audio processor and a KWS server contains modules that likely the only preferential constraint is that a single zone (room) contains the same model of KWS so that argmax is comparative but even that is not essential.
KWS are just ears that are extremely simple input devices that are set up as channels in a zone for input audio that simply mirror the same system of many of the current wireless audio systems available.
Its a very simple premise but audio in on a zone provides audio out on that zone…
It has a minimal number of commands which is not much from start and stop and it doesn’t even have a pixel ring as a pixel ring is a standalone Ha device where a zone may only have a single shared pixel ring whilst KWS might even be hidden, whilst a pixel ring could be prominent and central.

There is no such thing as a Rhasspy satellite as all was ever needed was wireless audio and wireless KWS in a simple zonal system.
We do need a Rhasspy KWS server just as RaspiAudio SqueezeLite has LMS server to cordinate or Snapcast, Airplay or even Sonos (Not that I know much of that system).

Or MQTT Rhasspy keyboards it is…

A KWS will stream to a KWS server that may filter and apply Rhasppy metadata of the zone and channel of origin so that TTS output is a simple mapping to the same, where is purely a bridge so any KWS device can work with Rhasspy.
Streaming from that point is a strange one as all the latest and best ASR uses quite long CTC and uses a mixture on phonetics and sentence context to make highly accurate results as does say OpenAi’s Whisper and actually trying to stream to such models it causes a hike in load and lowers accuracy as often the context width is reduced and from playing the latency is not all that much different.

If you are going to copy consumer ewaste from the likes of Google & Amazon where each unit is this all-in-one then a streaming mode of older smaller models because that is all will fit and run then maybe streaming mode is a thing.

If you are going to have a modern multi zone Sota voice system you would have a single brain fed by distributed KWS and models would not be streaming to garner context but run far faster than realtime so latency of return is not noticeable but also so they don’t lag on multiple requests.
You only have to get to 2/3 zones and the investment cost of a central well powered single brain starts to become more cost effective as the only addition is KWS ears and Audio out cost can be discounted because it is already encompassed as that rooms wireless audio system. If you went for a PI4 with constrained models the 2nd only needs a Pi02W for audio in & audio whilst processing happens on the 1st and here is where argmax comes in as the kws-server could pick the best stream or a preferential default.

You can still put a centralised system KWS & Audio in a box and use it as an all-in-one but the all-in-one peer-2-peer type control network of Rhasspy satelite is an absolute thunderclart of unnecessary and complexity as why are we copying Google & Amazon when there are clearly better less ewaste infrastructures that can be easily accomplished where opensource can excel.
Its even a copy of a single enclosure but even Google & amazon worked out client server is the most efficient way and that should of been copied as a home server not a single box.

KWS are generic devices that just need a ‘driver module’ installed in the KWS server make a brand of one yourself by all means but the are just an auto broadcast on KW mic with start and stop commands and literally that is all that is needed.
Voice commands are highly sporadic and voice system that spend much time idle its absolutely text book centralised server and for some reason we have gone peer2-peer and lost all the advantages of cost and load that can provide via a single home server where the only clients needed are audio in and out and absolutely kick the ass of Google & Amazon.

From this comment, it honestly doesn’t seem like you really used Rhasspy satellites much. We all want things to be improved, but there’s no need to be so negative about something a lot of people here worked hard on. Especially when many of the complaints were addressed ages ago.

Most satellites used an internal MQTT broker with local KWS and VAD, and just did HTTP calls out to the base station for speech to text, etc. with their siteId (which could contain a “zone”). So no, “keystrokes” were not going over the network. And the satellite didn’t even have to be running anything related to Rhasspy, as long as it could HTTP POST some WAV data.

Again, there is a lot of room for improvement here. Streaming raw audio from satellites over MQTT/UDP is obviously not going to scale with many satellites. And setting up satellites in Rhasspy is unnecessarily complex since it was bolted on later, rather than part of the original design.

As @donburch said, a lot of Rhasspy users are probably using it in base station/satellite mode, so this needs to be at the forefront for designing v3. And I absolutely agree that “design” here should not entail Rhasspy-flavored versions of already existing standards!

This is what I’m thinking, though “Rhasspy Satellite” could just be a configuration in the base station relating an existing streaming audio service to a zone (as @rolyan_trauts has alluded to).

Yes, I’ve talked to Jesse (the ESPHome maintainer) about this some. Paulus has a contact over at Espressif, so I think the plan would be to get their audio framework involved. Espressif has a number of two mic boards based on the ESP32 that could form the basis of a fairly cheap satellite (that dev board is $20 on mouser). I don’t know what would be involved with getting ESPHome onto it, and if it would be possible to still do local KWS and AEC.

1 Like

You know very well from multiple comments and from the very start of dev on the ‘satellite’ I was totally opposed to the bloat and complete lack of need for it.
I can not help that fact people spent a lot of wasted time developing something without functional need and my objection has always been the unnecessary was developed and still is unnecessary whilst a crucial part of audio processing has always been missing.
The satellite mechanism is completely pointless and is just wasted load on what a satellite needs as its purely audio-in & out and its not my fault the dev continued whilst I was ignored.

Non of the complaints where ever addressed and there is a constant stream of confusion in the forum history on how to handle very simple multple KWS zonal systems.

Yes and it has never been fixed and I have repeatedly posted for a long time how simple the fix is and you just contradicted yourself in the next sentence, its not a fix its a badly fitted bandage.
There is no value or IP to what has been developed the ‘satellite’ dev veered off at a acute and complex direction to the detriment of the simple addition of a KWS server where a hugely important large load of audio processing could be shared that could allow even simple micro-controller to be satellites and I have been constantly bemused to why?
VAD can be central all you need is to be able to tell a KWS mic to start and stop and once more the fix is that simple. VAD should be able to reside on the satellite or central but currently it can not and the supposed fix forces so much unnecessary as a peer2peer client style architecture when a simple client-server would of sufficed.

Is it not time to get it right and fix it?

I will write it here in a relatively brief explanation but if you partition elements in basic lowest common denominator building blocks you can just collect those together to create any form and complex but there is choice of all.

If you embed function without need you will always be shackled providing for ill placed function and create a confusing and complex infrastructure and exclude certain choice.

There are only 2 types of interaction in a voice system Instructions and Responses.
A instruction is the OP and a response is prompted by a TTS question.
A instruction just needs the zone/channel and audio, whilst a response which turns on a mic has a skill server that got the original instruction and merely returns that zone/channel metadata but includes the skill server it is so the response audio can be returned.
A KWS server receives that and turns on the corresponding mic and the next response audio is shipped and returned to where needed because the skill server data is there.

Thats it that is also how simple the protocol could work because a voice system does not need to know about control.
It merely ships and routes what a voice server should do whilst skill servers do control.

There has always been 2 elements missing in the chain firstly a KWS Server and secondly a skill router.

The skill router is an intermediary fed by ASR that uses the attached metadata to do some very simple routing.
It forwards on predicate to the matching predicate skill server and if that skill server requires a response it returns to the skill server as there is only need for a single 1to1 simple low latency connection.
The Skill router sends TTS text to TTS and awaits a completion and then tells the KWS to turn the mic on.

Its the same for any type of voice interaction and everything is just a repetition of the above and its really simple and partitions the modules into basic function and those simple methods can be reused to create whatever needs and the complex but that is choice.

Keep Rhasspy as it as and restart anew V3 ( as new and seperate) with a simple, uniform API to local open source voice tools as what is needed is exceptionally simple and huge swathes of current has really no functional necessity apart from that it is.
Then if people want to use what exists because they developed it then they can, but don’t shackle once-more to what in the majority is functionally unnecessary for a voice system and even worse still not implement crucial elements such as audio processing.

No it is not. That is just your, as always, totally unnecessary negative opinion.
If you would have put as much positive energy in helping Rhasspy to become what your vision is, Rhasspy would now be very close to that.
Instead you have chosen to only put a huge amount of negative energy into complaining and whining again and again into what you think is all so terrible.

Why is that? Why do you only choose the negative path on this instead of putting that same effort into actually changing things to the way you would like it so see? I have asked that a couple of times, but still no answer.
I really do not understand this and when Rhasspy as a whole is so terribly you still keep posting your negative comments instead of just finding other systems you dó like.
Most of the time I skip your lengthy and incoherent posts, but that question always pops up when I scroll past them.

2 Likes

I picked up on rolyan’s comment a while back that he abandoned Rhasspy several years ago and has no experience with rhasspy satellite. Yet based on this total lack of actual experience he is stuck vehemently repeating allegations that only he seems to believe (like that Rhasspy is inextricably linked to Raspberry Pi), about software that is long since history.

Personally I don’t see much conceptual difference between cheap devices with mic and speaker spread around the house which listen for a keyword that are called “ears”, and the same device with same purpose called a “satellite”. Sure a Rhasspy satellite has the same user interface, but I don’t consider calling modules on a server to do all the cpu intensive processing as “bloat”. Similarly that rhasspy’s modular client-server architecture somehow does not allow KWS to be done on a separate shared server if one so wishes, or on the client so the audio doesn’t have to go through the LAN. He seems so fixated on using his own terminology that he can’t see that Rhasspy is conceptually pretty much what he is promoting. :frowning:

I freely admit that, while Rhasspy’s documentation does contain all the required information, it is not arranged in a way that makes base+satellite configuration clear. I guess there must have been quite a bit of confusion at the time of the transition. And the confusion continues, resulting in new users needing to ask for help on the forum; often having struggled to piece together the necessary pieces of information spread through the documentation. Please Michael don’t take this as an attack - I don’t like writing documentation either, and at the time you were adding satellite to existing documentation.

I suggest that rearranging the current documentation to make base+satellite the default configuration (and all-on-one as the advanced option) would help. And a comprehensive tutorial for new users … which I started and got to 30 pages before deciding I needed to re-think my approach. Now I’m not sure whether v3 will make it a waste of effort.

Bottom line, I really am puzzled that rolyan spends so much time on the Rhasspy forum, given his extreme prejudice against it. I suspect rolyan could have developed his own system with half the time and effort he has spent trolling Rhasspy.

rolyan I don’t understand why you could feel responsible for other people’s effort; more so because you are the only one that considers it a waste. If you really believe it to be a waste, why not just move on ?

1 Like

Beside the logo i dont like the name rhasspy either. In my language it sounds the same as raspi which leads to so much confusion i nearly never are able to use the word rhasspy. I need to use something like speech thing or such.

Light is “lámpa”, but in turn on the lights it’s “lámpát”, so the original word is changing as well, so it’s not just “lámpa[t]”. Turning on the lights IN the living room (“nappali”) is “nappaliban”, but in the bathroom (fürdő) it would be “fürdőben”, so the ending is different. Gender is not important for grammer. Verbs are sometimes attached to conjugation sometimes not, e.g. “turn off the lights” is “kapcsold ki a lámpát”, but “turn off lights” may be “lámpát kikapcsolni”, so where templates would work normally in English may not work with the same placeholders in Hungarian. I could circumvent some of that with variants like [entity](ban|ben), but that doesn’t cover everything. Oh, and we have “a” and “az” articles (this is for “the”, not for “a”/“an” which is simply “egy”), depending on if the word starts with a vowel or not. Yes, it can be put as “(a|az)” but then a single longer sentence ends up having 6-7 of such variants, makes it difficult to write/read/understand and still not grammatically perfect.

The translation web UI was really difficult to use with mycroft, it was really slow and made it difficult to see context, understand how single-word entries would be eventually used. (I was even considering doing it straight in git instead.) Ok, there is a word “start”, but in what context and term is it used? Makes a difference considering how endings are applied. Start is “indít”, but “start the timer” would “indítsd el az időzítőt” and “start a timer” would be “indíts egy időzítőt”.

I found other difficulties when doing translation for mycroft but can’t recall from the top of my head. I know there is plenty of supported stuff, like being able to customize numbers (e.g. we say “two” in two ways depending on context and that I think is handled), but I remember it all being so complicated to get my head around it. Perhaps this is not an issue if the core can cover most stuff and a selected few :slight_smile: can cover that. Other services (like chatgpt) that have language engines (or whatever) that knows grammar works better than having templates in this regard - I guess it’s no wonder that many language related services only support a selected core list of languages.

Not sure how to PM, though I was trying to find it, lol. :slight_smile:

I like this idea, but I would prefer it to be a networked option, so a plugin can be placed on multiple devices according to their resource needs.
The current MQTT solution would mean too much upload and download.
The audio port feature seems like a better solution, where there is a direct connection between the plugins.

Sorry Michael but I’m confuzzed again :frowning: Are you meaning that the satellites are part of the Base station ? And that the satellite’s Audio Output can be a streaming audio service ?
I was thinking that Rhasspy GUI should be a separate web server running in its own container, and uses API calls to Base and Satellite processes.

Looking at the recent posts on this forum, many are people struggling to configure their Base + Satellite rhasspy.
Moreover I realise that I often struggle to help because they only give the Base part of their configuration … as though they expect all the configuration to be in one place. I guess using the same UI can also add to confusion for new users.

I note that you have previously suggested auto-discovery of Satellite units, which I assume implies that they be controlled (or at least configured) centrally. I can see that it would be easy to provide a simplified HA Rhasspy Junior user interface using multiple tabs (as suggested here) for Base and each Satellite since only limited options would be provided for each unit.

In theory one web UI for Rhasspy 2.5 or v3 could also use device discovery and call Satellite API routines to provide remote configuration - though all the extra options will make the UI more confusing. I guess there’s not much point going down this path until we see what v3 brings ?

This is my first post and hopefully it’s in the right place.

With the changes coming to v3 is something like the ESP32-S3-BOX-LITE a good option for satellite hardware?

I have a dedicated server room with a TureNas Scale server that has tons of compute (AMD EPYC, as well as Tesla cards for acceleration) that can be used to run the main Rhasspy app on.

I’m not sure what kinda hardware is best for satellite nodes in each room though. My house currently has 10 Google home devices. Some are low end mini’s while others are dual speaker setups for music.

With hardware becoming more affordable I’m curious on what will end up being the best hardware for 2023.

As are we all :frowning:

Google and amazon have put lots of money into developing their own excellent hardware devices - which are locked to use their cloud service.

Raspberry Pi with reSpeaker 2-mic HAT seems to have been the most popular non-proprietary option - not because it is particularly wonderful, but because they were cheap, readily available and easy to program. Then came covid, and chip shortages :sob: There is also the fact that (despite having 2 mics in hardware) neither the driver or Rhasspy provides any of the Digital Signal Processing (DSP) to take advantage of both mics on the reSpeakers.

I know that some forum members are already using S3 devices as satellites. I understand that the ESP32-S3 will be particularly suitable with some AI capability built in; and up in message #17 above @synesthesiam suggested that ESP32-LyraT could be a go. It will of course be very dependant on getting the software. My fingers are still crossed.

As for Rhasspy v3, Michael hasn’t let anything slip, so far.