Rhasspy is Joining Nabu Casa

In case people missed the announcement at the State of the Open Home 2022 conference today, I will be joining Nabu Casa at the end of November to work on Rhasspy full time :partying_face:

Paulus, the founder of Home Assistant, has declared that 2023 will be the “Year of Voice”. We will be meeting in a few weeks to create a roadmap for exactly what that means, but at the core will be a great “out of the box” experience for Home Assistant (HA) users [1]. This will include auto discovery of Rhasspy base stations and satellites by HA (through the Rhasspy integration), and auto discovery of HA devices/entities by Rhasspy (so you can say “turn off the living room light” with no configuration).

[1] I plan to make this framework general, so other home automation frameworks can also benefit.

For all of us in the Rhasspy community, this will finally mean I have time for updates to Rhasspy. In the short term, my plans are to add Whisper and Coqui STT for speech to text, Mimic 3 and Coqui TTS for text to speech, and upgrade many of the existing services (like Porcupine).

Longer term, I would like to do many different things like:

  • Work with folks like @rolyan_trauts to create a DIY satellite mic with quality audio input (looking at the ESP32-S3 and some I2S mics ATM)
  • Extend language support with more African, Asian, and European languages. Really anything non-English :stuck_out_tongue: The most important part of this is getting the training tools released so community members can start training models.
  • Add wildcards and recursive rules to sentence templates. This would let you have templates like “add * to my shopping list” and make it possible to have commands like “do X and Y and Z and
”
  • Re-architect Rhasspy to ensure that all of the services (wake word, STT, TTS) can be easily reused by other open source projects. This is still a bit fuzzy in my head, but I would like to use something like Guix to build and package each service so everything is still usable outside of Rhasspy.

Let me know your thoughts and questions below, and thank you for your patience over the last year!

25 Likes

That’s great news.

Thanks for your great work.

Looking forward to new releases of rhasspy!

1 Like

This is awesome news ! :relaxed: Home Assistant and Rhasspy will be a great match :+1:

After a lot of work on my own voice assistant services I think the last missing piece is background noises filtering. I found out that although the microphone(s), wakeword and STT services are now pretty good and reliable, the surrounding noises (tv, children, home appliances, etc) almost always prevent the utterance from being identified correctly.

Google have an amazing ML model called Voice Filter Lite that seems to really solve all the audio input noise issues (better than all the AEC and noise filtering currently available and without any use of a mic array and with no CPU load when idle). I do not have the required skills to tackles something like that but maybe you (and others) can :wink:
https://google.github.io/speaker-id/publications/VoiceFilter-Lite/

I think @rolyan_trauts previously mentioned it somewhere in another post.

Maybe this can also be on the wish/todo list of future improvements.

Congratulation on the new job :relaxed: Cheers.

3 Likes

FANTASTIC NEWS!! Congratulations!! :heart_eyes:

1 Like

Neural target speech extraction - YouTube have target speech separation repo’s with GitHub - BUTSpeechFIT/speakerbeam

Google are holding theres tight to their chest as they are with the tiny ASR models they use on the Pixel6/7

Still though due to Raspberry turning their backs on retail the R in Rhasspy is looking ever more dubious as you just can not buy one with crazy ETA’s of Sept23.
Also when it comes to DSP which Voice processing has a lot of it means the same hardware can be running in the region of x50 slower than if written in say C/C++/Rust and the Py suffix at times needs lesser presence.

So that leaves with Hass but really all is needed is a Hass Skill Server that a Linux voice system can querry an entity list so it knows what skill server to route inference to.
So that really plays havoc with the naming and maybe time to adopt a Prince like symbol.


:slight_smile:

There are also 2 relatively lite audio processing models based on GitHub - breizhn/DTLN: Tensorflow 2.x implementation of the DTLN real time speech denoising model. With TF-lite, ONNX and real-time audio processing support. that sanebow took to Pyhtons limits would be so much more efficienct with some fairly simple porting to a better language for audio DSP.

I don’t think that is going to happen as there is no reason for :-

This will include auto discovery of Rhasspy base stations and satellites by HA (through the Rhasspy integration)

There is no rationale or reason apart from Branding and IP ownership and your statement of intent is already going a wrong direction. Yeah we get it who pays your dev time but its actually detrimental to build in dependency when it absolutely doesn’t need it when you still own the IP with none dependent opensource that you can still prioritise customisation and additions for Hass specifics.

The more you partition each moldule into standalone functionality that purely routes audio and inference over a choice of link layers, the more it increases interoperability and basic simplicity.
The more you embed system dependency the oppisite happens to interoperability and basic simplicity.
You partition complexity so the complex can be simple or you end up with a bloated and confusing system.

The most secure way for satelites is button press reboot bluetooth pairing as this includes simple user finger push and short range bluetooth security that has no rationale to dilute that by broadcasting over a Hass network.

It would be great to see Hass satelites but also we need to partition the satelite into what it really is. Its a network KWS, AudioRTP & Indicator from basic LED to Display screen.
They are all completely seperate functions and the term satelite is a misnomer and again forcing dependencies will only reduce interoperability and increase satelite complexity whilst we could be increasing it and making complexity simple.

I suggest you start with a Network KWS that is a basically a command sentence audio broadcast on detection of Keyword that will run happily on either ESP32 or if avail PiZ2 (which is an awesome amount of $15)
On the PiZ2 the cheap 2mic Hat clones are as good as any other but maybe start thinking of forking the drivers to take control of quality.

Its very important this time to develop in the direction of the audio processing stream or otherwise it will likely end up as before with a system that is null on void on even minimal levels of 3rd party noise as initial audio processing was totally absent from design.
You do not build a house until you build a foundation and maybe this time some relevance maybe attached to the input audio stream.

So it maybe great news for some but from what I am reading its still absent of all the needs that generally caused low adoption and use in a forum that became very quiet on the realisation of produced results.

This is wonderful news. Welcome to the HA family :partying_face::partying_face:
You will help moving HA into the future in big ways :heart_eyes:

1 Like

Back at HA 2022.8 release I was surprised and pleased to see Rhasspy become an official HA integration. I use Home Assistant with Rhasspy (base + 3 satellites), MQTT and node-RED; so I personally welcome closer integration with HA - but I also know that plenty of us use Rhasspy with other platforms, and so Rhasspy needs to remain an independent product.

I wonder, what percentage of us use Rhasspy with HA ?

The 2022.8 announcement also mentioned Rhasspy Junior 


Rhasspy comes in two flavors. If you want to just try out, you can connect a microphone and speaker to the machine running Home Assistant and install Rhasspy Junior as an add-on. Once installed, you can say any of these pre-defined sentences.

This suggests to me a simplified Rhasspy for testing purpose. Is Rhasspy Junior effectively just a more HA-specific installation script and simplified user interface ? The predefined sentences and automatic linking with HA domains/entities suggests a more radical departure from mainstream Rhasspy. There is still almost zero information and documentation about Junior.
Michael, what is your vision for Rhasspy junior ?

My 2c worth:

  • add Rhasspy to the Home Assistant Analytics database. Rhasspy integration currently shows it’s used by 401 active installations - but I guess that’s only new installs since it became official HA in 2022.8,
  • Default Rhasspy operation should be base & satellites. (Yes rolyan, i know you think the word “satellite” means “bloat” whereas I believe everyone else just thinks “remote”. It shouldn’t matter the hardware, and Rhasspy already is modular enough to support different communications protocols). Mainly I see it as rearranging the Rhasspy documentation to bring base-satellite configuration to a more prominent position. Possibly the install starts by asking if server (base), satellite (ears), or standalone (all options on the menu) ?
  • Auto discovery of Rhasspy satellites would be good, and then allow base and satellites to be managed from the base’s configuration page (get rid of web interface from satellites). Is there a use case for multiple base systems ? Splitting STT and TTS to separate servers ?
  • Incredibly keen for low-cost satellite units, such as ESP32-S3. Hard to recommend a RasPi with reSpeaker HAT
  • Maybe Rhasspy Junior as a distinct sub-project (or even fork?) with specific goals:
    1. focus on easy installation and simplified user interface for new users - probably removing some of the more advanced features.
    2. close integration with HA, automatically linking the HA entities with pre-defined verbal commands. I think that Michael already did this in Rhasspy Junior.
    3. When someone wants advanced features they can flip a switch to run a conversion from the Junior configuration to full Rhasspy.
2 Likes

Auto discovery of Rhasspy satellites would be good actually when it comes to a broadcast mic that is very bad. There is a reason why Google & Alexa use a local button press and an intermediary mobile application to co-ordinate a secure link, because its a very secure method that users know and trust.
A satelite has no reason to know Hass as Hass has no reason to know a satelite as control is via inference not broadcast audio.
I don’t mind a new repo for new KWS triggered broadcast mics on a range of platforms, but there is zero reason for Hass or auto discovery than what BlueTooth already provides.

GitHub - nymea/berrylan: Raspberry Pi WiFi setup already exists GPL-3.0 license and just needs data exchange extending for all platforms.

Esp32-S3 I2S mics is no different to to 2mic hats or I2S on a pi the stock availabity of Pi is the only consideration as what we can do on Esp32-S3 has far more limits. There was zero KWS integration or quality control of drivers of settings of the 2 mic and lack of community knowledge, but in operation no noticeable difference is likely if implemented.

Personally Rhasspy should continue as is and it would be far better to start again at the start of the audio processing chain with a series of seperate dockers containers for each function with zero of the current Rhasspy methods as none is needed.
There are just far better, far simpler and more secure ways to do it that require a different approach.

One thing I do aggree on is there is absolutely a case for splitting each module to seperate containers/servers. Satelites->Satelite server->ASR->Skill router->Skill server->Audio Server are singular instances purely queued and routed to.
There is absolutely no need for close integration to HA and exclusion of interopability as all is needed is a HA Skill Server that merely returns an entity list so the skill router knows where to direct inference.
A HA Skill server obviously has close integration as its a skill server that provides control on recieved inference.
Skill servers process inference and provide entity info so that the voice system is completely inteoperable to the addition of skill servers for anything you can think of.

2 Likes

That’s great and thank you for this project as it is now.

Wildcards would benefit my work, currently, I’m experimenting with adding to slots files via a cron to give us a slightly more living vocabulary.

1 Like

I just watched the VOD. That is some great News, I am super thankful that your work gets funded.

1 Like

@fastjack As well as the other repo’s there is GitHub - Rikorose/DeepFilterNet: Noise supression using deep filtering which already has a rust implementation and can be loaded as a LADSPA plugin say with Pipewire.

Its realtime but have never checked and run on what the load is and forgot to post.

Also because KW accuracy is far higher than ASR, KW may still work without certain heavy load speech enhancement but the command sentence can be processed centrally before ASR.
Even non linear acoustic echo cancelation can take place centrally by passing the reference signal with captured audio so it is in sync for processing.

1 Like

Congrats to the new job!

Will you only focus on rhasspy or do you work on voice support for Home Assistant in general? I ask, because I was very happy when I read that you joined MycroftAI and bought the Mark II because I was very optimistic that it would work nicely with Home Assistant. Will you still support the Home Assistant integration of Mycroft?

That is great news. :+1:

This will include auto discovery of Rhasspy base stations

Just to be clear, this is MQTT discovery, right? Not some weird flaky get-in-the-way mDNS/broadcast based stuff.

so you can say “turn off the living room light” with no configuration

That’ll be optional as well I hope.

create a DIY satellite mic with quality audio input (looking at the ESP32-S3 and some I2S mics ATM)

Lightweight satellites would be a really good feature.

In general the main improvement I’d love to see in Rhasspy is more reliable wakeword detection, in terms of both false detections and missed detections. Whether the best approach is better filtering, beam forming, having more satellites, a different wakeword system, or even just recommending specific microphone hardware, I don’t know. If this improves recognition accuracy as well, even better.

  • Add wildcards and recursive rules to sentence templates. This would let you have templates like “add * to my shopping list” and make it possible to have commands like “do X and Y and Z and
”

I think I have proposed this before, if Rhasspy could simply switch between training sets, a lot of dialog could already be implemented externally via MQTT. Like “add to shopping list” - speak “what shall I add?” - switch training set to groceries - “rhubarb”.

1 Like

All of that and what was missing was quite simple as there was no integration and co-ordination between KWS & Beamforming as the KWS should set a target for that command sentence. The hardware beamformers merely jumped around to promenient noise with no guidance, in a conference style.
There are better filters the DTLN is one of the lightest and actually is really good but isn’t a chance on microcontroller.
Better wakeword system is to employ a better model as they are documented but much currently employed is antiquated.
One of the biggest increases in accuracy would be to provide on device training where a small model is trained on captured data that has the signature of user and device that is used to shift the weights of the larger globally trained model.
From there more powerful filters and processing can create a far better speech enhancement stream for ASR but can use a central more powerful processor.

The reason why it was unrealable was because it was either not very good or missing.

The main problem is not only removing noises (current ASR models are already pretty good despite background noises) but other voices (tv, children, etc).

I’m currently using Google KWS streaming CRNN Tensorflow model. The accuracy is pretty good and the false positive are now almost non existent (without any audio DSP). It allows me to know who trigger the wakeword and handle action rights accordingly. With this info it would be possible for something like Voice Filter Lite to isolate the voice that triggered the wakeword from other noises and background voices just before the ASR processing. No need for noise filters, AEC, etc.

Google Voice Filter (first version) is not streaming. Voice Filter Lite on the other hand is streaming so I’m confident that this kind of DSP software will solve all noise issues without adding too much delay or CPU load.

Unfortunately I’m rubbish regarding ML
 :sweat_smile:

You are not going to get Voice Filter Lite its Googles crown jewel and locked away.
It is without doubt the defintive method as rather try to negate the unknown and leave what should be voice on a simple enrolement it does the oppisite and just extracts that targetted voice.

There is so much you could do here in a system as if you capture media such as TV then you have the ref signal for non linear AEC, especially you you have a complete audio system.
Even Beamforming its not standalone it needs to be integrated as the KW TDOA envelope gives you the direction of spoken KW so that is held until end of command sentence.

The only Voice Filter Lite is embedded into Googles Nest smart speakers and Pixel phones whilst the only opensource ‘target speech extraction’ extraction I know is GitHub - BUTSpeechFIT/speakerbeam but I have doubts it is Lite.

There was so much missing with basic beamforming massive improvements could be made and even though models maybe not Lite the command sentence can still be processed centrally on more capable hardware than the satelite.

Because of the nature of speech control of sparodic commands its a perfect fit for the diversification of use of a central shared device that just needs the KWS ears for room coverage. Where the argmax of distributed devices can be used to select the best stream.

But also its very likely as do the Pixel phones the capture usage localy and training a small model to add weight to the globally shipped one and custom training of usage adds much accuracy.

The great thing about Google KWS streaming is that the same framework can create models for both Arm application Socs and Microcontrollers and my fave is a CRNN but no recurrent layers exist for microcontrollers and also recuurent layers aka GRU & LSTM don’t support quantisation optimisation.
This is why you need to build up from the ground up as there are models, libs and methods available, but the devices with the least options set the dev path so they are not excluded.
A DSCNN rather CRNN but likely much should suround what can be done on an ESP32-S3 as Home-assitant devices often are even though we are using the bigger S3 model as it does have ML vector instructions.

Personally I groan at the idea with C on an ESP32-S3 but we can pick faster dev paths with python on a Pi as proof of concept as long as those models and methods are microcontroller included and there is a clear path.

Nice work Mike! I hope both Rhasspy and HA gets a boost from this :slight_smile:

2 Likes

Thank you everyone!

I do plan to create a Rhasspy-based image for the Mark II. I really like the hardware, though I do wish they’d also sell the Pi hat board separately. I think a Mark II image that has Rhasspy pre-installed, connects to Home Assistant, and lets you simply display a web page via an MQTT message would be very convenient.

This is an important question. I don’t want Rhasspy to have to be all things to all people. Instead, I’d rather have a foundational set of pieces that can be composed into useful things. And by foundational, I mean likely creating C/C++ or Rust libraries/programs that can be faster and used almost anywhere, and then wrapping them with Python/MQTT/Hermes/etc.

For the HA side of things, it will probably be mDNS. But Rhasspy natively could discover satellites over MQTT, and there’s no reason the Rhasspy HA integration couldn’t do both.

This absolutely. With a relatively small dataset from the user, we could simultaneously improve wake word detection, silence detection, and ASR with a little on-device training. If the base station doesn’t have to be a Raspberry Pi, there are many more options available too. A dirt cheap PC with a previous generation’s GPU would run circles around the Pi.

2 Likes

I think we might be able to do on-device training for small models of opensource, but ASR as it increases in accuracy and complexity that looks extremely doubtful or that there might be need.
If you take OpenAi’s Whisper 680 000 hours of 16kHz sampled multilingual audio and even a small model is likely far beyond available hardware as its just massively huge compared to a KWS.

The investment in accuracy of the KWS is important because of the low power devices that may be chosen means certain speech enhancements that can run centrally are likely absent as KWS runs locally.
The devices you run have hardware specifics and create a signature that will be more accurate with specific model training and near all inaccuracies and many noise problems have been a total lack of initial audio processing, which in the above seems to of been skipped once more?

Also once more there is zero reason or need to integrate Rhasspy & HA as the HA skill server needs to Integrate with HA as a Media server may integrate with a Audio server and both are purely driven by inference.
Each particular skill server has its own way of processing inference be it Almond, Web search, Audio, Video, Security and Control as in HA , or whatever you can think of and it should be obvious when providing a rich set of skills from the previous the dependencies and complexity needs partitioning into containers so as a system its manageable and simple.
A Linux Voice system is a series of queues and routes and a simple conveyor system of incoming voice that ends in inference that the final queue and route is purely to which skill server the inference is sent to.
What decides that is a simple skill router that use rudimentary NLU/NLP to extract the predicate and subject of the inference contents to which skill server it belongs to based on that predicate and subject.
Which are just skill actions and entities that are returned from the skill servers.
Skill servers process dialog not the voice system a voice system as in the name ‘voice’ either converts from voice to inference or from text to voice via the 2 servers of ASR & TTS.

An array of KWS will connect to a KWS server that assigns to zone(s). The process is serial as the best stream is taken from a zone and with zones the 1st in takes precident, where it is queued to see if the speech enhancement server instances are available before doing the same to ASR.
If concurrent users start to clash you just provide more instances so its scalable.
Its a voice system and has no need for the complexity of system integration as that is what skill servers do.
There are also only x2 types of queues and routes for either audio or text input where at any point you can set to debug and leave a history of the actual files without the need of a complex programatic abstraction such as Hermes that turned an extremely simple serial workflow of connected process servers into something complex for no reason.
There is need for such a thing as Rhasspy just the same repeatable queue routers for any voice module you choose to put in a container.
Many high quality ones already exist and its just wasted dev time to be refactoring code for branded versions of the same whilst there are steps in the chain of poor quiality or certain process servers just don’t exist.

Hi,

Congrats on the new job. I’m new here but from what I can see it is well deserved. You built a nice product and community here.

Some things I hope to see in the future:

I hope we still maintain the ability to deploy Rhasspy independent of Home Assistant. I am a HA user and love it but it also am a firm believer that sometimes a mix is a good thing.

I hope the Rhasspy addon for Home Assistant will enable it as a media player device in HA. Being able to cast any audio to it would be killer.

Skills management! going back to my previous thing about still being able to keep things separate. I am finding the ability to build skills to hook directly to the hermes mqtt bus very useful for things that aren’t directly HA, my first thing is Jellyfin, but also thinking calDAV soon. These probably can be done through HA but doing them through the apps native API is probably easier and more powerful. But I am finding once I have the skills created finding a standard way of hosting one or multiple is difficult at best.

Loving the project so far though. Hoping to be able to contribute some useful hermes apps soon!

1 Like