The Future of Rhasspy

synesthesiam · January 10, 2022, 3:12am

I’ve spoken with Paulus (creator of Home Assistant) a few times. He definitely sees the value in local voice assistant tech for HA. I’m hoping to collaborate with Nabu Casa through Mycroft; who knows what will become of that

Looking at the history of Mycroft, I agree. They were talking about full-on conversational assistants years ago, and the software they have today is nowhere even close to that. The new CEO seems much more down to Earth, though he’s still forward thinking.

On this topic, I think people may not have realized that my last job was for the U.S. military. So even with the many reservations folks have about Mycroft, it’s way better than the alternative! I’m grateful to my previous employer for getting me off to a good start, of course, but my work would have ultimately been shut down or relicensed.

DANBER · January 29, 2022, 5:33pm

Hi @synesthesiam, first of all, congratulations on your new job. It’s always good to work on something you have fun with:)

Would be great to continue the concepts we started in the other thread.

If you can work with a non commercial only restriction, I might have a nice dataset source for your trainings.

Regarding your questions about missing features in Mycroft, for me the most important, which also were some of the reasons to build Jaco, were:

Missing full offline capability
Problems with recognition accuracy because of the missing language model adaption
Similar to Snips the skills did only support python and not arbitrary languages and dependencies

And now in comparison to Jaco it’s also missing the skill privacy concept and some of the modularity.

By the way, it would be great if you could add Mycroft with your new STT approach to the benchmarks: Jaco-Assistant / Benchmark-Jaco · GitLab

Greetings, Daniel

rlongfield · February 7, 2022, 8:30pm

I thought your announcement was a bad dream when you first made it, but alas it is not.

Seems I’m a bit late to the game again, first it was Snips before the Sonos purchase and now Rhasspy. I guess if you end up shutting Rhasspy down at least what I have now will still work.

There are a few things that I don’t like about Mycroft, first is the obvious it’s cloud-based. From what I’ve been reading it appears that when compared to Siri, Mycroft does less to protect their users personal data than Apple.
The second part is the terrible detection Mycroft has for wake words. Now to be fair the last time I had Mycroft running was nearly 2 years ago but the experience was terrible. Snips detected everyone in the house for ‘Hey Snips’. There wasn’t a blasted thing I could do to get Mycroft to detect my wife or kids when they wanted Mycroft’s attention. The FAF (Family Approval Factor) was zero which lead me to Rhasspy.

I hope you will have Sway with Mycroft but I fear it won’t be enough and we’ll loose possible the best offline assistant around.

KiboOst · February 14, 2022, 11:24am

Can we sadly say that we will never see Rhasspy 2.5.12 ?

After Snips, and maybe now Rhasspy, which is still far better than anything else, I ask myself going google/alexa route with all sadness to not revive one more time having to redo everything …

Of course Rhasspy still works nice, but if you are standing still, you are actually going backwards …

tjiho · February 15, 2022, 1:45am

Congratulation for your new job !

It’s seems very positive, mycroft will probably reuse some part of Rhasspy, and empower it.

Mycroft has a business models plus people paid for working on voice assistant. That s what is missing on current rhasspy project to one day compete with the big one.

synesthesiam · February 15, 2022, 11:52pm

Mycroft is small enough that I’m not worried about not having enough sway. Offline speech to text and text to speech are already being promised for the Mark II’s release

A difference with Rhasspy, of course, is that we need to find some way of funding development. I proposed that we offer to train custom voices or speech to text models for businesses, and use that money to keep the lights on.

I absolutely agree (though I would argue Raven is even worse ). I’ve already started working a new wakeword system (based on this paper). Let’s hope it performs better in the end!

No, I’m not abandoning the project. I have some stuff in the works for 2.6, in fact! Besides spare time, what’s holding me up right now is that so many things have changed in the past year that need to be integrated.

For example, Larynx has grown up as a TTS system and (as “Mimic 3” under Mycroft) is fast enough to be useful on a Pi. Additionally, Vosk and Coqui STT are now mature and ready for use (though they can’t be re-trained like my Kaldi system).

Another exciting thing I’ve been working on in a hybrid STT system that can recognize fixed commands and fall back to an open system like Vosk/Coqui for everything else.

Thanks! I think the future looks bright

JGKK · February 16, 2022, 7:55am

What do you mean by this?
I ve been using coqui and deepspeech before this in my homebrew nodered pipeline and i train a custom scorer based on my own domain specific language model which also adds new vocabulary and i actually find it it a lot easier than kaldi for this.

synesthesiam · February 16, 2022, 3:21pm

I mean re-trained from scratch quickly on-device. You can definitely create a custom scorer for Coqui STT/Deepspeech, but adding new vocabulary/sentences to the pre-trained scorers isn’t possible (as far as I know) without recreating the language model or doing an expensive merge on the n-gram counts.

Both Vosk and Coqui STT let you boost existing vocabulary at runtime, which is awesome. My goal with the hybrid STT is to allow for fast re-training of fixed commands, but have it “know what it doesn’t know” and let Vosk/Coqui do what they do best (open-ended transcription).

rolyan_trauts · February 16, 2022, 4:08pm

Dunno @synesthesiam phonetic pipeline based systems are ok, but newer ‘end-to-end’ ASR seem to be providing better accuracy nowadays even if it is a single all in model.
‘end-to-end’ aint the best description but that is how approx differentiate the 2 also ‘end2end’ also can be lexicon free and being possible to handle out-of-vocabulary (OOV) words so don’t need any re-training.
Such as https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr

If only googles new tensor offline ASR was opensource but you can only wish.
Why fixed commands as are they not far more inflexible than a end2end asr with intent decoding by NLP?
I thought the infrastructure of Rhasspy was for low load whilst isn’t your target now a Pi4?

ASR would benefit from transfer learning due to dataset and model size where a local capture model can apply weights to a language model.
I think that is what Google are doing with their new ASR as supposedly it learns specific user intonation and word patterns.
I have never really concentrated on ASR as the input chain seems to have weaker parts of the pipeline on capture and initial keyword so never really progressed further up.
The quality and consistency to dataset of capture is really important and if its not right at the start further up the pipeline recognition will degrade.
So with the new audio board its an improvement even though beamforming alone is not what much state of art employs it is a step forward and guess many are wondering where this will take Rhasspy and Mycroft.

synesthesiam · February 18, 2022, 10:22pm

I have yet to see any end-to-end models that allow you to add new sentences/vocabulary on-device, and also run efficiently on a Pi. Oh, and don’t forget that more languages than English exist

To me, the hybrid approach (fixed commands + open local ASR) is what Rhasspy is all about: offline user-defined voice commands. The added flexibility of a fallback is great, but the overall point is to have the user train the system and not the other way around. I want to say a command in whatever way I want, pronouncing words the way I do, and never have it leave my house.

I am focused on the Pi 4 now, but with 2GB of RAM. So the flashlight models you linked (AM + LM) couldn’t even be loaded!

rolyan_trauts · February 19, 2022, 12:31am

Yeah guess your right Pi probably isn’t the best platform for AI lacking a decent GPU or NPU.
End-to-end if running with a lexicon then just add to lexicon, flashlight is a research framework and C++ so didn’t expect you to be running it, its just has parameters for all and was posted as an example as one.

I was opening things to discussion as the old system is getting quite dated to the rapidly moving voice AI scene and was wondering if you where working on something more current and flexible?
2gb is more than enough for many models, as far as I am aware there wasn’t any models in the link provided just some details of where Facebook research are publishing.
The old system could with a squeeze reside on the original zero and is sort of indicative, where a 2gb pi4 has considerabilly more scope even if GPU it aint great for AI acceleration and lacks a NPU.

I think its probably had its day but if you are going to take the effort to maintain the same then great.

Reason why Flashlight is of interest is that it is C++ though as its interesting as Rhasspy like infrastructure is now available on microcontrollers and I don’t agree the models are huge as like the ESP32-S3-Box demonstrates you can and is much more cost effective and has the added advantage audio processing lends itself to RTOS DSP of a microcontroller than application SoC.

I was purely wondering if you where working on anything new as this arena has been hugely fast paced and changed dramatically where industry leading offline ASR is embedded into mobile phones and simpler systems are utilising tiny low power devices such as ESP32-S3.

donburch · February 19, 2022, 9:40pm

Rolyan how are you going with your own ESP-S3-32-Box ? Is it ready for real world use ?

The ESP33-S3 hardware does have distinct benefits for AI applications, and the demo looks impressive (as demos are supposed to). The demo appears to oversell it as capable of acting as both Voice Assistant and full-featured Home Automation controller … yet I see a big contrast between the espressif/esp-box repository and the activity on Rhasspy or Home Assistant repositories.

I don’t recall anyone ever suggesting Raspberry Pi was suited for AI. What it is, is affordable and (until recently) a freely available general purpose platform. Rubbish it all you like just because it isn’t your ideal platform, but I don’t see Raspberry Pi going away.

rolyan_trauts · February 20, 2022, 7:19am

Look you can be a Pi fan and say it is affordable and yes ESP32-S3-Box works and because it has a audio pipeline containing AEC + BSS in tests it works better than a Pi with Rhasspy lacking simple audio processing.
Its not just Rhasspy as all the linux hobby projects have been missing essential initial audio processing that is an absolute must for what is considered basic voice AI standards.

It is not affordable as a voice AI as Mycroft clearly demonstrate with a $300 unit that offers little over a $50 unit and is completely inferior to $50 to $100 commercially available product and that is reality and yeah some hobbyists will build for fun but that is all they are doing.

There are loads of projects that the Pi does really well but the lower end of original zero to even Pi3 running Python for voice AI doesn’t work well because of specific reasons I often mention because I am being objectively honest and not just a fan boy.

You would not run a Voice Assistant and full featured Home Automation system because they are functionally distinct and benefit from running on distinct hardware that benefits them.
Cars don’t have toilets because generally its considered there are better places to take a dump and bloating a singular system is generally bad practise that often will land you in the shit.

But all the above is not the question or has anything to do with what I was asking as I was presuming because of Mycroft and as synesthesiam confirmed the focus is now a Pi4 2gb which does have far more processing power than the initial zero and asking if there is anything new in the pipeline of anything more capable than a system that had very modest roots.
With TTS we have seen this with larynx which really needs a minimum of that 2gb Pi4 64bit to really run well and all I am doing is asking if there is anything planned.

The ESP32-S3-Box was purely a demonstration that ASR models are generally getting smaller and I have no idea where synesthesiam thinks there are models that will not fit in a 2gb Pi4?
I mentioned flashlight to dodge my opinion that I feel VOSK is now a better option and other elements have evolved whilst the core ASR has pretty much stayed the same whilst elsewhere rapid changes are being made.

So how are you doing Donburch with your own Rhasspy Pi ? Is it ready for real world use ? As I don’t make any false claims about the ESP32-S3-Box as some others do with certain hardware and infrastructure.

The thread is ’ The Future of Rhasspy’ and I was asking as in certain respects it has stayed static.
It will be interesting what Upton says on the 28th The Pi Cast Celebrates 10 Years of Raspberry Pi: Episodes With LadyAda, Eben Upton, and More | Tom's Hardware as hoping we might get something like a Pi4A where the A is AI and minus USB3 the spare PCIe lane brings on board a Raspberry NPU as the PI is starting to lose huge ground in this area.
But that is just discussing the future and what are likely becoming essential requirements.

I personally feel the big processes of voiceAI TTS & STT can be shared centrally be it X86 & a GPU or what I have preordered Rock5 with 6Tops NPU that employs many ears of distributed room KWS to finally get real world use for low cost.
As if a application SoC is not a great platform for audio DSP processing then partition process to what it is great for and that is how I see Raspberry and satellite ESP32-S3 KWS and use both for what they are good for and not for what they are not.
Also waiting for the Radxa Zero2 which has a 5Tops NPU but until we get the cost effectiveness of a Pi with NPU currently unless light load the Pi is not a great platform for AI and that is just fact.

Dan Povey talk from 04:38:33 “Recent plans and near-term goals with Kaldi”
https://live.csdn.net/room/wl5875/JWqnEFNf

Tara Sainath End-to-end (E2E) models have become a new paradigm shift in the ASR community

Do you have anything to share that could be the future of Rhasspy or any cost effective VoiceAI?

synesthesiam · March 1, 2022, 3:41pm

Thanks for the link, it was great to hear the latest from Dan. I agree with him on the need for lexicons of some sort, and am happy that their new Kaldi stuff will stay on that path. It’s also becoming clear that I need to seriously consider using byte-pair encoding as an alternative to phonemization.

Please consider the bigger picture when making such statements. Why is that $50 unit $50 and not $300? Because they can manufacture a million of them and sell them at loss. Why? Because it’s more profitable to spy on you than to just sell a smart speaker.

I think the most important next step for Rhasspy is finding a way for others to more easily contribute updates to existing services or add new services. Changes are happening so rapidly that I obviously can’t keep up.

I’ve struggled for a while to come up with a better architecture that would allow for people to easily download new services, but there are so many unique use cases that I keep scrapping it

rolyan_trauts · March 1, 2022, 5:17pm

I have and the bigger picture is Googles next gen ASR / KW system is completely offline and is only online when you use a service as in grabbing the news, weather or play music, youtube or whatever.
There is no bigger picture when it comes to such a commercial difference that because they can subsidize product with services but much of the cost has nothing to do with selling them at a loss purely the huge economies of sale the big guys have.
Its likely we are not far off next gen smart AI with onboard NPUs like the Pixel6 giving approx 4 TOPs in 5 watts to enable offline, offgrid ASR, but maybe quite a few years yet and we will have to wait and see, but sadly no new announcements on their 10 year birthday from Raspberry.

Its interesting to watch you go into Mycroft sales speech as now I guess you have to but for me what $300+ does buy has some real cool alternative options that offer more, sound much better, look much better and work much better and many of them are less.
That is just my opinion and I am going to sit back and see how you guys do and how the reviews come in when released, but much of the cost of the MycroftII is down to the design and economies of sale chosen.
I just have a minimum level of expectation and because I have used the latest full Echo & Nest audio and have a reference and even though dev wise I tinker a Rhasspy or Mycroft would likely end up with a strop and the bin if for use.
I eventually went for 2x Nest Audio in a stereo pair that I managed to pick up for just over £100.
I don’t use them all that often but when I do its mainly music and news whilst I am doing something and those services are online anyway which for me is Spotify free and I put up with the adverts.
I think the Ech04 sound better and also have a zero latency aux in and went the Nest Audio because I think the recognition is slightly better and that is what matters to me with a voiceAI and the disparity with opensource is huge and my biggest problem and privacy becomes a tin foil concern when things run so bad.
But hey that is just me and I keep my interest up purely with developments and what is current with hardware and opensource and there is some very interesting stuff out there but for me it not Mycroft.
When my privacy is going to cost me $300 and not work to my expectations whilst I can not bother that Google & Spotify might have an inkling of my taste in music you can guess what I am going to plum for.

I am here because of my interest in AI and generally whats happening but your spy scare stories mean very little to me as I am still likely to use VoiceAI for online services.
The only thing offline is occasional alarms as they are in the kitchen/lounge of a relatively small flat and meant I could ditch the HiFi as actually for what I use they are good enough but again where the MKII is lacking.

synesthesiam · March 1, 2022, 5:47pm

I’m not trying to scare you, I’m just saying that spying, etc. is part of the total cost. Like with environmental externalities and poor labor practices, sometimes the final consumer price is not the only thing that matters.

I don’t have to, but I would certainly like to continue working on open source voice tech. It would be especially disappointing to have the Mark II (and Mycroft) fail because people who don’t value privacy over price go around complaining that it’s not an Echo.

rolyan_trauts · March 1, 2022, 5:56pm

Its not that its not an echo its the stupid choice that it is trying to be an echo and failing in just every area, so I might as well have an echo.

Why open source is trying to copy verbatim consumer electronics but failing when with the diversification of use is very obviously suited for client server and expense can be shared but it is not want for me its just Mycroft have chosen that path but in every aspect its inferior.
Why you have chosen a product model that is likely not efficient anyway has nothing to do with me wanting an echo but if I am going to buy something like that I am not going to buy something that is so inferior.

That is Mycroft’s problem and not mine and you can try your pitch as much as you wish but the choices they have made for me are bizarre and success and failure are in the hands of Mycroft.

synesthesiam · March 1, 2022, 5:59pm

By client/server, I assume you mean someone buying a server and having a number of satellites that use it?

rolyan_trauts · March 1, 2022, 6:19pm

You don’t even need satelites you just need KWS ears and a server nowadays can be a ARM board with NPU even a Pi with a Coral USB but even then a Pi starts to rapidly become less cost effective than it may 1st seem.
There is a huge amount of capable and cheap 2nd user equipment that is capable of multi threading.

As for struggling for infrastructure the problem has always been adoption of infrastructure without need mainly for branding.
A more loosely coupled modular system of feeding back to upstream projects of larger audiences has always been a better option for me and have been critical of the choices and bloat for a really long time but never bothered along that line as the start of simple low cost audio processing was always missing from the pipeline and we are beginning to see products that are finally filling that void.
That has been a huge hurdle as the very start of the audio process of voiceAI has been missing but yeah my Pi4 or old X86 refurb machine needs only a single unit to act as a ‘server’ and a room can have initial audio processing done on a $20 microcontroller of a KWS ear for each room.
So yeah a NUC, SBC or even your old desktop or laptop can be the basis of a central server at extremely low cost and cover numerous rooms.
Voice control and capture is extremely scalable because of its manner where short singular infrequent commands are often the norm.

Its very possible to add a coral accelerator to a PI 4 and broadcast audio to a wireless audio system than embed that functionality purely to call branding of your own.
Its likely you could do a better system for 3 rooms for half the price of a MKII that comes close to $1000 if you want x3 MKIIs.

And no not satellites as I have always argued they are bloat and want to get away from that bloated term, as all that is needed are network KWS mics (Networked Ears) and a single station.

jrb5665 · March 2, 2022, 10:39am

Before I comment on this topic I’d like to say a belated congratulations to @synesthesiam on your new job and I hope you can contribute to making Mycroft a real privacy based alternative to the other commercial offerings.

This has been an interesting discussion and you both have good points so I thought I might give another opinion which I think overlaps both points of view.

Personally I value privacy over price but only to a certain degree.
When they came out I bought 10 echos, of which I now use one as an alarm clock and timer, one other as a timer and one for general questions, the others basically are not used any more, largely for privacy reasons.
I have been struggling with Rhasspy for the last 2 years now because, while it is an excellent product for setup and flexibility, at least in my environment, I just can’t get reliable kws and accurate voice recognition at a reasonable price due to the quality of the microphones and the cost of building out satellites that then are fine in a close quiet environment and become almost useless when you try to use them in real world conditions.
I am running the base on a nuc with TTS also running there and mainly want to be able to control my home automation and music playback to sonos speakers from my local library, all which I can do now in my test lab but I would not put it in other rooms for the reasons I mentioned above.

Also I have been watching Mycroft since the first model and have been reluctant to go near it due to cost, capability, reliance on the cloud and it just doesn’t look professional.

I agree with @rolyan_trauts in what I would like to see:

Central processing/server based. If this was around the current price of the MarkII I would gladly pay. I would even pay a bit more if it could do the rest.
Low cost modules for microphones/satellites I could deploy to each room. By low cost I mean comparable to the echos and nets of the world or a little more (the privacy and flexibility would be worth it)
An easy way to integrate my own data and systems into it (i.e. slots)
A standard interface or protocol api/websocket/mqtt (pick one or more) that would allow things like node-red/home assistant or any other system I decide to build to integrate with it for I/O and control

I like almost everything about rhasspy except the satellites, I agree with @rolyan_trauts they are too bloated and the hardware they support is expensive for what should be needed and I agree with @synesthesiam about paying for my privacy.

If Mycroft could offer something like a base unit for the heavy lifting and packs of “microphone” units for the user to place in rooms I think they could offer a much more cost effective solution and using my own example I would seriously consider paying $1000-1500 for a base and 10 microphone units where I wouldn’t buy a single Mycroft Mark II for $300. I would even stretch my budget higher if enough features were offered on the Mycroft platform.

In the mean time I have echos for the menial tasks and keep trying to work out how to get rhasspy working to a level I would consider presenting to my wife rather than inflicting on her.