The Future of Rhasspy

Congratulations Michael on the new job - it’s great you are working at something you love, getting appreciated for it financially, and closer with your family !

I’ve had a quick look at Mycroft, and am also of two minds. Both projects will benefit from a closer association. I just hope that it continues the way you hope.

What is your relationship with Nabu Casa ? If they can dedicate a programmer to ESPhome (15% of their user base), then why are not doing more to provide offline voice assistant to the 75% who currently use the big commercials ?

1 Like

Oh nice. Didn’t know about it. Will give it a try when If ind some time.

Just a suggestion: I think it would be gerade to add somewhere early on the site that Home Intent is based on Rhasspy (and a link to this community) and is for use with Home Assistant. Or that it is the bridge between Rhasspy and Home Assistant or something like that. As it is now I would have thought “oh a new alternative to Rhasspy” if I randomly found the website.

Do have any plans to integrate something like “Apps” oder Scripts for functions that are more then just Home Assistant Intents? (Maybe just add a simple ui to enter python scripts that are using some the APIs developed by others in this community).

Edit: I should have read more in the documentation before askig about the script/app issue. The Component feature is for that. Then my question would be: What about CustomComponents and some form of repository for those.

Ah, I do actually mention it right away on the GitHub page. I’m actually not sure where people would head to first, but yeah, I’ll update the homepage of the docs to indicate it’s based on Rhasspy!

I do like the idea of a repository of custom components that are accessible via the UI! I don’t know of anyone who has written a custom component yet, but it’s definitely a consideration for down the line.

My pet beef with Mycroft was an uneasy feeling it was the same as it presented itself far beyond of what it was capable.
Its only in the MKII that they actually process the incoming audio with DSP echo cancellation and beamforming and Rhasspy suffered the same as really in the presence of noise it was unusable.
The delay and cancelled crowdfunders and this thing of ‘patent trolls’ just had my alarm bells going.
They have a new CEO and we will have to see which way it goes but there is a strong possibility you could be right.

synesthesiam like all needs $ and has to work and if you have got to work you have to work.
I am not a fan at all of Mycroft because to me it still seems the goal is to present the semblance rather than the real thing and hence why I am dubious.
As for working for them his core skills are bang on the button and the guys got to work even though I am not a fan of the company as likely 99% of us are not doing what we want but doing our best to earn $

I have told synesthesiam what I think of Mycroft but don’t and shouldn’t have any opinion on what someone needs to do to earn $

2 Likes

I’ve spoken with Paulus (creator of Home Assistant) a few times. He definitely sees the value in local voice assistant tech for HA. I’m hoping to collaborate with Nabu Casa through Mycroft; who knows what will become of that :wink:

Looking at the history of Mycroft, I agree. They were talking about full-on conversational assistants years ago, and the software they have today is nowhere even close to that. The new CEO seems much more down to Earth, though he’s still forward thinking.

On this topic, I think people may not have realized that my last job was for the U.S. military. So even with the many reservations folks have about Mycroft, it’s way better than the alternative! I’m grateful to my previous employer for getting me off to a good start, of course, but my work would have ultimately been shut down or relicensed.

6 Likes

Hi @synesthesiam, first of all, congratulations on your new job. It’s always good to work on something you have fun with:)

Would be great to continue the concepts we started in the other thread.

If you can work with a non commercial only restriction, I might have a nice dataset source for your trainings.


Regarding your questions about missing features in Mycroft, for me the most important, which also were some of the reasons to build Jaco, were:

  • Missing full offline capability
  • Problems with recognition accuracy because of the missing language model adaption
  • Similar to Snips the skills did only support python and not arbitrary languages and dependencies

And now in comparison to Jaco it’s also missing the skill privacy concept and some of the modularity.


By the way, it would be great if you could add Mycroft with your new STT approach to the benchmarks: Jaco-Assistant / Benchmark-Jaco · GitLab

Greetings, Daniel

1 Like

I thought your announcement was a bad dream when you first made it, but alas it is not.

Seems I’m a bit late to the game again, first it was Snips before the Sonos purchase and now Rhasspy. I guess if you end up shutting Rhasspy down at least what I have now will still work.

There are a few things that I don’t like about Mycroft, first is the obvious it’s cloud-based. From what I’ve been reading it appears that when compared to Siri, Mycroft does less to protect their users personal data than Apple.
The second part is the terrible detection Mycroft has for wake words. Now to be fair the last time I had Mycroft running was nearly 2 years ago but the experience was terrible. Snips detected everyone in the house for ‘Hey Snips’. There wasn’t a blasted thing I could do to get Mycroft to detect my wife or kids when they wanted Mycroft’s attention. The FAF (Family Approval Factor) was zero which lead me to Rhasspy.

I hope you will have Sway with Mycroft but I fear it won’t be enough and we’ll loose possible the best offline assistant around.

1 Like

Can we sadly say that we will never see Rhasspy 2.5.12 ?

After Snips, and maybe now Rhasspy, which is still far better than anything else, I ask myself going google/alexa route with all sadness to not revive one more time having to redo everything … :cry:

Of course Rhasspy still works nice, but if you are standing still, you are actually going backwards …

Congratulation for your new job !

It’s seems very positive, mycroft will probably reuse some part of Rhasspy, and empower it.

Mycroft has a business models plus people paid for working on voice assistant. That s what is missing on current rhasspy project to one day compete with the big one.

1 Like

Mycroft is small enough that I’m not worried about not having enough sway. Offline speech to text and text to speech are already being promised for the Mark II’s release :slight_smile:

A difference with Rhasspy, of course, is that we need to find some way of funding development. I proposed that we offer to train custom voices or speech to text models for businesses, and use that money to keep the lights on.

I absolutely agree (though I would argue Raven is even worse :laughing:). I’ve already started working a new wakeword system (based on this paper). Let’s hope it performs better in the end!

No, I’m not abandoning the project. I have some stuff in the works for 2.6, in fact! Besides spare time, what’s holding me up right now is that so many things have changed in the past year that need to be integrated.

For example, Larynx has grown up as a TTS system and (as “Mimic 3” under Mycroft) is fast enough to be useful on a Pi. Additionally, Vosk and Coqui STT are now mature and ready for use (though they can’t be re-trained like my Kaldi system).

Another exciting thing I’ve been working on in a hybrid STT system that can recognize fixed commands and fall back to an open system like Vosk/Coqui for everything else.

Thanks! I think the future looks bright :slight_smile:

3 Likes

What do you mean by this?
I ve been using coqui and deepspeech before this in my homebrew nodered pipeline and i train a custom scorer based on my own domain specific language model which also adds new vocabulary and i actually find it it a lot easier than kaldi for this.

I mean re-trained from scratch quickly on-device. You can definitely create a custom scorer for Coqui STT/Deepspeech, but adding new vocabulary/sentences to the pre-trained scorers isn’t possible (as far as I know) without recreating the language model or doing an expensive merge on the n-gram counts.

Both Vosk and Coqui STT let you boost existing vocabulary at runtime, which is awesome. My goal with the hybrid STT is to allow for fast re-training of fixed commands, but have it “know what it doesn’t know” and let Vosk/Coqui do what they do best (open-ended transcription).

1 Like

Dunno @synesthesiam phonetic pipeline based systems are ok, but newer ‘end-to-end’ ASR seem to be providing better accuracy nowadays even if it is a single all in model.
‘end-to-end’ aint the best description but that is how approx differentiate the 2 also ‘end2end’ also can be lexicon free and being possible to handle out-of-vocabulary (OOV) words so don’t need any re-training.
Such as https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr

If only googles new tensor offline ASR was opensource but you can only wish.
Why fixed commands as are they not far more inflexible than a end2end asr with intent decoding by NLP?
I thought the infrastructure of Rhasspy was for low load whilst isn’t your target now a Pi4?

ASR would benefit from transfer learning due to dataset and model size where a local capture model can apply weights to a language model.
I think that is what Google are doing with their new ASR as supposedly it learns specific user intonation and word patterns.
I have never really concentrated on ASR as the input chain seems to have weaker parts of the pipeline on capture and initial keyword so never really progressed further up.
The quality and consistency to dataset of capture is really important and if its not right at the start further up the pipeline recognition will degrade.
So with the new audio board its an improvement even though beamforming alone is not what much state of art employs it is a step forward and guess many are wondering where this will take Rhasspy and Mycroft.

I have yet to see any end-to-end models that allow you to add new sentences/vocabulary on-device, and also run efficiently on a Pi. Oh, and don’t forget that more languages than English exist :wink:

To me, the hybrid approach (fixed commands + open local ASR) is what Rhasspy is all about: offline user-defined voice commands. The added flexibility of a fallback is great, but the overall point is to have the user train the system and not the other way around. I want to say a command in whatever way I want, pronouncing words the way I do, and never have it leave my house.

I am focused on the Pi 4 now, but with 2GB of RAM. So the flashlight models you linked (AM + LM) couldn’t even be loaded!

1 Like

Yeah guess your right Pi probably isn’t the best platform for AI lacking a decent GPU or NPU.
End-to-end if running with a lexicon then just add to lexicon, flashlight is a research framework and C++ so didn’t expect you to be running it, its just has parameters for all and was posted as an example as one.

I was opening things to discussion as the old system is getting quite dated to the rapidly moving voice AI scene and was wondering if you where working on something more current and flexible?
2gb is more than enough for many models, as far as I am aware there wasn’t any models in the link provided just some details of where Facebook research are publishing.
The old system could with a squeeze reside on the original zero and is sort of indicative, where a 2gb pi4 has considerabilly more scope even if GPU it aint great for AI acceleration and lacks a NPU.

I think its probably had its day but if you are going to take the effort to maintain the same then great.

Reason why Flashlight is of interest is that it is C++ though as its interesting as Rhasspy like infrastructure is now available on microcontrollers and I don’t agree the models are huge as like the ESP32-S3-Box demonstrates you can and is much more cost effective and has the added advantage audio processing lends itself to RTOS DSP of a microcontroller than application SoC.

I was purely wondering if you where working on anything new as this arena has been hugely fast paced and changed dramatically where industry leading offline ASR is embedded into mobile phones and simpler systems are utilising tiny low power devices such as ESP32-S3.

Rolyan how are you going with your own ESP-S3-32-Box ? Is it ready for real world use ?

The ESP33-S3 hardware does have distinct benefits for AI applications, and the demo looks impressive (as demos are supposed to). The demo appears to oversell it as capable of acting as both Voice Assistant and full-featured Home Automation controller … yet I see a big contrast between the espressif/esp-box repository and the activity on Rhasspy or Home Assistant repositories.

I don’t recall anyone ever suggesting Raspberry Pi was suited for AI. What it is, is affordable and (until recently) a freely available general purpose platform. Rubbish it all you like just because it isn’t your ideal platform, but I don’t see Raspberry Pi going away.

Look you can be a Pi fan and say it is affordable and yes ESP32-S3-Box works and because it has a audio pipeline containing AEC + BSS in tests it works better than a Pi with Rhasspy lacking simple audio processing.
Its not just Rhasspy as all the linux hobby projects have been missing essential initial audio processing that is an absolute must for what is considered basic voice AI standards.

It is not affordable as a voice AI as Mycroft clearly demonstrate with a $300 unit that offers little over a $50 unit and is completely inferior to $50 to $100 commercially available product and that is reality and yeah some hobbyists will build for fun but that is all they are doing.

There are loads of projects that the Pi does really well but the lower end of original zero to even Pi3 running Python for voice AI doesn’t work well because of specific reasons I often mention because I am being objectively honest and not just a fan boy.

You would not run a Voice Assistant and full featured Home Automation system because they are functionally distinct and benefit from running on distinct hardware that benefits them.
Cars don’t have toilets because generally its considered there are better places to take a dump and bloating a singular system is generally bad practise that often will land you in the shit.

But all the above is not the question or has anything to do with what I was asking as I was presuming because of Mycroft and as synesthesiam confirmed the focus is now a Pi4 2gb which does have far more processing power than the initial zero and asking if there is anything new in the pipeline of anything more capable than a system that had very modest roots.
With TTS we have seen this with larynx which really needs a minimum of that 2gb Pi4 64bit to really run well and all I am doing is asking if there is anything planned.

The ESP32-S3-Box was purely a demonstration that ASR models are generally getting smaller and I have no idea where synesthesiam thinks there are models that will not fit in a 2gb Pi4?
I mentioned flashlight to dodge my opinion that I feel VOSK is now a better option and other elements have evolved whilst the core ASR has pretty much stayed the same whilst elsewhere rapid changes are being made.

So how are you doing Donburch with your own Rhasspy Pi ? Is it ready for real world use ? As I don’t make any false claims about the ESP32-S3-Box as some others do with certain hardware and infrastructure.

The thread is ’ The Future of Rhasspy’ and I was asking as in certain respects it has stayed static.
It will be interesting what Upton says on the 28th The Pi Cast Celebrates 10 Years of Raspberry Pi: Episodes With LadyAda, Eben Upton, and More | Tom's Hardware as hoping we might get something like a Pi4A where the A is AI and minus USB3 the spare PCIe lane brings on board a Raspberry NPU as the PI is starting to lose huge ground in this area.
But that is just discussing the future and what are likely becoming essential requirements.

I personally feel the big processes of voiceAI TTS & STT can be shared centrally be it X86 & a GPU or what I have preordered Rock5 with 6Tops NPU that employs many ears of distributed room KWS to finally get real world use for low cost.
As if a application SoC is not a great platform for audio DSP processing then partition process to what it is great for and that is how I see Raspberry and satellite ESP32-S3 KWS and use both for what they are good for and not for what they are not.
Also waiting for the Radxa Zero2 which has a 5Tops NPU but until we get the cost effectiveness of a Pi with NPU currently unless light load the Pi is not a great platform for AI and that is just fact.

Dan Povey talk from 04:38:33 “Recent plans and near-term goals with Kaldi”
https://live.csdn.net/room/wl5875/JWqnEFNf

Tara Sainath End-to-end (E2E) models have become a new paradigm shift in the ASR community

Do you have anything to share that could be the future of Rhasspy or any cost effective VoiceAI?

Thanks for the link, it was great to hear the latest from Dan. I agree with him on the need for lexicons of some sort, and am happy that their new Kaldi stuff will stay on that path. It’s also becoming clear that I need to seriously consider using byte-pair encoding as an alternative to phonemization.

Please consider the bigger picture when making such statements. Why is that $50 unit $50 and not $300? Because they can manufacture a million of them and sell them at loss. Why? Because it’s more profitable to spy on you than to just sell a smart speaker.

I think the most important next step for Rhasspy is finding a way for others to more easily contribute updates to existing services or add new services. Changes are happening so rapidly that I obviously can’t keep up.

I’ve struggled for a while to come up with a better architecture that would allow for people to easily download new services, but there are so many unique use cases that I keep scrapping it :confused:

I have and the bigger picture is Googles next gen ASR / KW system is completely offline and is only online when you use a service as in grabbing the news, weather or play music, youtube or whatever.
There is no bigger picture when it comes to such a commercial difference that because they can subsidize product with services but much of the cost has nothing to do with selling them at a loss purely the huge economies of sale the big guys have.
Its likely we are not far off next gen smart AI with onboard NPUs like the Pixel6 giving approx 4 TOPs in 5 watts to enable offline, offgrid ASR, but maybe quite a few years yet and we will have to wait and see, but sadly no new announcements on their 10 year birthday from Raspberry.

Its interesting to watch you go into Mycroft sales speech as now I guess you have to but for me what $300+ does buy has some real cool alternative options that offer more, sound much better, look much better and work much better and many of them are less.
That is just my opinion and I am going to sit back and see how you guys do and how the reviews come in when released, but much of the cost of the MycroftII is down to the design and economies of sale chosen.
I just have a minimum level of expectation and because I have used the latest full Echo & Nest audio and have a reference and even though dev wise I tinker a Rhasspy or Mycroft would likely end up with a strop and the bin if for use.
I eventually went for 2x Nest Audio in a stereo pair that I managed to pick up for just over £100.
I don’t use them all that often but when I do its mainly music and news whilst I am doing something and those services are online anyway which for me is Spotify free and I put up with the adverts.
I think the Ech04 sound better and also have a zero latency aux in and went the Nest Audio because I think the recognition is slightly better and that is what matters to me with a voiceAI and the disparity with opensource is huge and my biggest problem and privacy becomes a tin foil concern when things run so bad.
But hey that is just me and I keep my interest up purely with developments and what is current with hardware and opensource and there is some very interesting stuff out there but for me it not Mycroft.
When my privacy is going to cost me $300 and not work to my expectations whilst I can not bother that Google & Spotify might have an inkling of my taste in music you can guess what I am going to plum for.

I am here because of my interest in AI and generally whats happening but your spy scare stories mean very little to me as I am still likely to use VoiceAI for online services.
The only thing offline is occasional alarms as they are in the kitchen/lounge of a relatively small flat and meant I could ditch the HiFi as actually for what I use they are good enough but again where the MKII is lacking.

I’m not trying to scare you, I’m just saying that spying, etc. is part of the total cost. Like with environmental externalities and poor labor practices, sometimes the final consumer price is not the only thing that matters.

I don’t have to, but I would certainly like to continue working on open source voice tech. It would be especially disappointing to have the Mark II (and Mycroft) fail because people who don’t value privacy over price go around complaining that it’s not an Echo.

1 Like