what do you think about GitHub - pyannote/pyannote-audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding, maybe add it too? and GitHub - kssteven418/Squeezeformer: [NeurIPS'22] Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Hi
I’m currently a Mycroft (Picroft) user who’s reviewing his approach to voice assistants
I’ve successfully run through the tutorial (nice - ty, a couple of minor bug reports pending…) and have a few comments/observations (mainly ideas I’ve been turning over for a while)
Firstly I personally think the ‘event to sound’ stuff should be almost broken out - it’s nice to have it all in the demo but conceptually I think the focus of a speech assistant is the input side so I’d tentatively agree with another post that a snapclient based/like approach to delivering TTS would be reasonably sane/light.
So now, thinking about the whole ‘sound to event’ architecture, my main thought is to question why it is so serial (if I’ve understood it correctly)?
My concern is that this has drawbacks in both user-experience latency as each stage of processing seems to need to finish before the next one starts; and it also means that the control of recording is driven by non-dynamic configuration.
May I suggest that adding a buffer-controller (BC) based approach could be interesting.
Something like this:
mic recording is constantly put into per-mic smallish (5s) ring-buffers.
The buffer controller (BC) provides (websocket & unix socket) access to the buffer using Wyoming.
Wake detection is connected to the BC socket and listens.
When wake is detected the BC is told to record properly and now just grows the buffer; the timestamp at the end-of-wake-word is available (speech-start). WD would emit an event with the speech-start timestamp (back to the BC for sending to other connections).
Anyway the buffer now grows and sends this event and timestamp to VAD which requests and reads until it eventually detects silence and tells the controller to stop recording and sends a speech-end event. (So now does the BC needs to be the event broadcaster to send that to ASR or does is this just how it gets an EOF pointer?)
At the same time the BC sends speech-start to any ASR(s) which eventually dispatch via intent recognition (Wyoming again) to an intent. The intent can now send a ‘send-more’ event to ASR which sends a start-recording to the BC which again activates VAD and provides an audio stream to ASR and a text stream on to the intent
Possibilities:
- More responsive ASR as it doesn’t need to wait for VAD to trigger before processing
- Multiple ASR could listen to the BC
- Audio available for other uses (eg debug capture or false-positive wake word training by a “never mind” intent that ask for the wake-word raw data)
- Almost continuous natural speech since the wakeword algorithm should be able to provide an entry point into the buffer for ASR.
- non-wakeword triggered recording: an intent says “any more”. eg my shopping skill responds to “We need some [ and …]”… Then (via TTS) says “any more?” and starts listening again rather than me having to re-utter “ We need some ” - this is much more natural.
A lot of the pipelining work still makes sense btw. If I’m getting it right the buffer-controller would provide Wyoming interfaces over websockets, unix sockets or (via a socket shim) stdin
Anyway… just thought there may be some ideas in here that would spark more discussion
Feeling a littel dump right now but how can i actually install rhasspy3 addon for homeassistant?
Here GitHub - rhasspy/rhasspy3: An open source voice assistant toolkit for many human languages it states * Install the Rhasspy 3 add-on and links to GitHub - rhasspy/hassio-addons: Add-ons for Home Assistant's Hass.IO but i have already added this to home assistant addon repositories and only see Rhasspy 2.5
Edit:
i had to remove the addon, remove the repository and add it back
Hi there,
Not sure to be in the right topic (forgive me if not !).
Just find an hugging face repo on an “improve” Whisper .
As it’s a bit tech-savvy for me, is it a nice way forward or not applicable or not enough open-source ?
Thanks for sharing though
@synesthesiam do you have a quick update for the Community about rhasspy3 ? Sound very interesting… unfortunately i‘m not a developer
Indeed @synesthesiam, didn’t get around to trying to get Rhasspy3 setup over the summer, but coming back around to it was going to start poking into it as another project again. However the code doesn’t seem to be updated (at least in master) at all really for the last 9-10 months. Any progress forward? Where can I find the current development branch to try and setup?
@synesthesiam I agree that now would be a good time for an update on Rhasspy 3. I guess that the issue for Rhasspy 3 is that you have been focussed (quite rightly) on your paid employment, which is integrating Rhasspy 3 with HA. I just watched HA Voice Assist chapter 6 video, and it is looking better all the time for those of us using Rhasspy with Home Assistant. Thank you Mike for all your effort.
My understanding is that Rhasspy 3 code is actually working well with Home Assistant. Those of us using Rhasspy with HA should be migrating to HA Voice Assist, and looking for support in the HA Community forum and Discord ?
Rhasspy 3 (and this forum) therefore will become focussed more on those using Rhasspy as a toolkit with other applications ?
I guess that the Rhasspy 3 code is already doing most of what is wanted - but the documentation may be lagging behind ? And given that the toolkit can be used in very diverse ways, the documentation (including examples) will have to be more more thorough … which takes time.
As Drizzt321 mentioned, the rhasspy/rhasspy3 github hasn’t been updated for some time, but the wyoming, wyoming-satellite, wyoming-piper, wyoming-openwakeword, and piper repositories have all been updated recently.
Is wyoming-satellite the current version of Rhasspy 3; or is is specific to Home Assistant ?
If wyoming-satellite is HA specific, do the wyoming modules reference the rhasspy/rhasspy3 modules (and thus is up-to-date) ?
Hi there!
As I’m about to migrate my homeautomation system to a newer hardware platform (and recent bullseye), I experienced quite a lot of things to cope around with the old (2.5.11) version; finally “stable” Rhasspy is up and running now (together with mimic3 as tts server). Don’t want to go into details, but as I’m also maintaining the code base for Rhasspy’s integration into FHEM, the question raise up if it’s worth to give the preview version a try.
But Rhasspy3 atm seems to just support HA? And there’s not much to happen the recent time, or did I miss sth.? Basically: Is Rhasspy3 still alive and worth porting the FHEM “adapter” to it? (And if: How to integrate such an external intent handler to the Rhasspy3 ecosystem?)
Rhasspy 3 is yet another dead open source project.
You can look to other solutions.
Any suggestions wrt. to “other solutions”?
Didn’t do much research on that yet, as latest Rhasspy 2 still can be installed (at least in bookworm and bullseye) and is working pretty well with my automation system (FHEM)… Additionally @synesthesiam still is active here in the community, so there remains a little hope for future development on that (or at least update to the installation packets to overcome some known installation and configuration problems for 2.xx version).
Hi everyone
Sorry as always for the delay in updates, etc. I always have way too many things going on
Rhasspy 3 is not dead, but I need to re-think about its place in the current open source voice ecosystem. My focus has obviously been on Home Assistant, but I purposefully designed the Wyoming protocol for distributed voice processing so no one is locked in to one ecosystem. I’ve been working on adding HTTP APIs to the various Wyoming services too (similar to Rhasspy 2) for additional compatibility.
Rhasspy 2 was an “all-in-one” solution which installed, configured, and trained voice services through a shared web UI. This paradigm started to break down towards the end as I began adding more diverse services with unique training requirements or a complete lack of the ability to train without an expensive GPU.
For Rhasspy 3, I’ve been rethinking the “all-in-one” idea and am considering having it be up to each voice service to do configuration and training. What I really want is what Home Assistant OS does with add-ons: you can do whatever you want in a Docker container, and there’s a common method for discovering, installing, configuring, updating, and starting/stopping each add-on. Something like this could be done with Docker compose, I’m sure, but it sounds very difficult to create and maintain.
In the mean time, my plan is to have a Rhasspy 3 server host multiple pipelines like Home Assistant does today. Each pipeline is made up of Wyoming/HTTP services, and there will be a web UI to create, configure, and test pipelines. For now, I plan to leave the installation and configuration of each voice service up to the user, though the tutorials will show how to set things up with Docker fairly easily.
What does everyone think?
Dear @Synesthesiam,
I’ve never had a chance to thank you personally for Rhasspy; your work has brought me so much enjoyment (and utility within my implementation) so this seemed to be an opportune time to do so as well as weigh in on Rhasspy 3. I’ve loved voice technology for a long time, the first tech I used was IBM VoiceType (discrete speech!) but Rhasspy/HA combination really lit the fire for me. Learning about Docker, JSON, YAML, & basic Python to make things happen in Rhasspy have been incredibly fun, and have probably consumed more hours of my time than they should :). Regarding Rhasspy 3, I do like the idea of a “componentized” successor for Rhasspy 2. Allowing folks to “mix & match” whichever services they want need would be helpful, and potentially easier to maintain (e.g. provide all components use the “lingua franca” of the Wyoming protocol, each could be updated separately). I do wonder if it would be necessary to have all of them bundled in a single Docker container; I’m perfectly okay with spinning up multiple containers for different components which I could update separately or turn up/down depending on need rather than having a monolithic Docker where the App pulls in the various functionality.
I do have a request (for the future…), if possible, can you have a component/module that supports MQTT? I’ve build a lot of my functionality using MQTT (I really like the simple elegance of the protocol) and I think it would be useful for some of the “light weight” transaction (e.g. a IOT sending a MQTT message).
Again, thanks for everything!
Jeff
@synesthesiam Thanks for the update … great information on your future direction,
I am puzzled by this. By spinning off HA Voice Assist you have set Rhasspy 3 up as the general toolbox (for use with non-Home Assistant systems) that i thought you had always envisaged for Rhasspy.
I think people want to know the current rhasspy3 status.
Is rhasspy 3 usable now without HA ? If so, which github repository should people use ? The rhasspy/rhasspy3 github hasn’t been updated for some time, but the wyoming, wyoming-satellite, wyoming-piper, wyoming-openwakeword, and piper repositories have all been updated recently.
I understand that you’re up to your eyeballs in the detailed code; and that user-level documentation is a time consuming chore … but could you please spend some time on fleshing out the current documentation. Hopefully a couple of rhasspy users with non-HA projects will document their experience with Rhasspy3 and post here.
Okay, you have been looking for a different way to confirm that Rhasspy3 is a project that in its current state makes no sense, and is in fact dead. But that’s okay. No need to waste time on useless projects (time is money).
Yesterday I read some of your responses to some users on Reddit, where you recommended “extended_openai_conversation.” Today, since you work for Home Assistant let me offer you some ideas for the future. The first: create a “micro AI” that allows novice users to make a request for the creation of scripts or automations, in human language. In short, a micro AI that is able to make Home Assistant “user friendly,” even using voice. The user says, “I want the light in the garden to turn on at eight o’clock at night and turn off at seven in the morning,” and the AI, after getting confirmation creates the automation.
Second point: is it possible that it didn’t occur to any of the developers to turn the Home Assistant companion app into a media player? This just boggles my mind, I really don’t understand.
First: I’m really glad to hear, you still having (some) focus on your great project Rhasspy!!! Also big thanks for what’s been achieved by now from my side!
From the technical prospective, I’m not able to give any solid feedback to your ideas, and I don’t have (and don’t neither need or want to get) any experience with HomeAssistant. So I’m also puzzled in what’s the essence of the “multiple pipelines” idea and so on.
Based on that background, I try to add some user focused thoughts here:
-
Doing the Rhasspy integration (at least) in FHEM as automation system already is a big challenge for quiet a lot of users. I’m just talking about things like get names, categories, colour settings etc. (all kind of “labels”) bridged from the home automation system to the STT and intent recognition. So most of them really appreciated Rhasspy beeing a “one stop” solution. Breaking that up into more pieces might raise complexity to an even higher level, so I really doubt this to be “attractive” for a (imo) really huge part of the people interested in home automation!
-
Beside the fact, I always disliked the idea to use MQTT as transport level for audio, MQTT (or the hermes protocol?) was a relatively “easy to understand” and transparent transport layer. So wrt. to user support, advising to have a look to mosquitto_sub’s output was always a helpful hint for everybody. Imo, independent from the used automation system, there should be any comparable (debugging) option offered.
-
Why (for the text based part) a new protocol at all? After having a look into the (rudimentary) documentation, Wyoming seem to be a complete new playground, which was really disapointing leaving me wondering, if I’d have to reinvent the wheel for the FHEM side as well. That’s not a very attractive perspective, to be honest.
-
Don’t know exactly why, but personally, I always preferred direct installations, and avoid using docker whenever possible. Might be irrational, but that’s how it is, and the recent try to install Rhasspy 2 dockerized ended up in a mess, so I finally once more went back to the “classic” way and used the (patched) deb. Most likely, I’m not the only one with that kind of experiences and feelings…
-
By now, we should avoid people getting frustrated by Rhasspy. Imo, first step should be to keep the “stable” tree (Rhasspy 2) usable for those people using it already. So please first provide (installable) deb versions for at least recent Debian distro’s and/or docker images with working mosquitto settings (authorisation disabled for internal server).
Most likely, those fears are just irrational and my personal problems where just “home-made”, so once more: first of all I’m still glad Rhasspy still is alive!
Looking forward for the things to come!
You’re welcome! I’m very happy to hear that Rhasspy has worked out for you
As I play around more with ESPHome, I’m really liking its architecture (which itself is modeled on early Home Assistant). Components are separated into “domains”, and a “platform” groups components based on something like a protocol. So there may be mic
domain that has a udp
platform and thus a UDP microphone component.
No, I would not say it’s usable at this time. The rhasspy3 repo represents an early vision of what became Wyoming, and I think I can do better now with the perspective gained.
One of the use cases for Rhasspy 2 was people just needing a few tools from the toolbox. They got this by spinning up a Rhasspy server and only enabling some of the services. Then, the Rhasspy API provided access to these services over HTTP and Websocket (the services themselves used MQTT).
Nowadays, I expect people to be able to spin up Docker containers for each service they want like wyoming-piper, wyoming-openwakeword, etc. I am slowly adding HTTP APIs to these services, and will eventually add MQTT/Hermes support. So something like Rhasspy would not be needed if you just want services à la carte.
Where I do see value for Rhasspy 3 is:
- Managing all of these services from a single configuration (likely using docker-compose ultimately)
- Coordinating pipelines across services and exposing high-level APIs for running pipelines and talking to satellites
- Experimenting with pipeline concepts that are too complex for Home Assistant (multiple levels of fallback, multi-language, etc.)
Nope, that’s not at all what I said
We’ve discussed this idea, and in fact it has been done before (HA partnered with a research group that later shut down their servers). I do believe this is possible with an LLM, likely even a “small” LLM with the proper training set.
I don’t know if voice is the best place for this, though. I would prefer to file this under “natural language interaction” instead – for example, you type the automation you want in natural language, and HA presents a filled-out automation for you to confirm/edit.
Wouldn’t it make more sense to add support for an existing media player app to HA?
This is a good point, and I don’t have a great solution for it. Hassil is (in my mind) a spiritual successor to the Rhasspy template language + slot lists. I do think that adding just a bit more complexity to the lists would help out – specifically, parent/child relationships between items (like a device being in an area). Maybe that would be enough complexity for most home automation systems.
I will have HTTP/Websocket/MQTT options for the services as well, so there shouldn’t be any need to reinvent stuff.
I went with a new protocol because of this logic:
- MQTT doesn’t do peer-to-peer
- HTTP is pretty lightweight, but you have to base64 encode binary data (audio) and two-way communication is a pain
- Websockets do most of what I want, but (1) they are complicated enough to need a library and (2) you can’t attach metadata to binary messages
So I thought, here’s all I want: TCP with JSON headers and sometimes some binary data (audio). And thus Wyoming. Initially, the header just had the event type/data and the length of binary data (if any) in a single line of JSON. Then it turned out that Python has a hard-coded limit on the length of lines (WTF), so I moved the event data to a separate section.
Wyoming works over TCP, but it also works great over Unix domain sockets and even standard input/output! So unlike HTTP, Websockets, and MQTT, you can communicate with Wyoming services without even bothering to open a port.
I’m worried about the amount of effort this would actually entail. I can only imagine how many Python libraries in Rhasspy no longer have working versions for Python 3.11+. With my limited time, I’m always struggling to decide between time spent on creating new things (like Rhasspy 3) and time spent supporting older things
As I live an breathe the open source world more and more, I have come to discover that I’m not very good at delegating. This is something I’m continually impressed with in the Home Assistant community, and I feel I’m lacking the necessary skills to make it work here
Okay, in this last answer you made sense of Rhasspy3.
I’ll try to tell you my use case: I use Rhsspy mobile on a tablet in the living room and on my smartphone, both connected via MQTT to rhasspy2 which only takes care of managing the intents and sending them to Home Assistant. On rhasspy I use my favorite TTS service, with its own defined voice, and, for example, with a home assistant automation I can send a wav file (an alert sound) before it tells me what time it is. Another example: every morning another automation sends a signature tune that precedes an almanac, with calendar events, weather, news, etc. Well, I would be very happy if I could use the Home Assistant companion app to manage this type of automation, especially on my smartphone, but to date, based on my knowledge, it’s not possible because the media player service is missing.
If the Micro AI project goes through, for training ask the community if they are willing to share automations. i would be proud. I think most will accept.
Thanks a lot for your elaborate answer!
Imo the essential point for interaction between any home automation system and the rest of services making part of sth. we may call the “wyoming ecosystem” really is some kind of compability in the intent recognition results. The JSON blobs in the hermes protocol are not very easy to understand and analyze, but contain (in case the intent recognition system is configured appropriately) all stuff needed to execute (in most cases) the action wanted by the user.
The “strenght” of the FHEM solution imo is: it automatically filled all the slots needed for most use cases, once you put any “entity” under Rhasspy control. E.g. adding new lights is just a piece of cake: Activate Rhassyp service for them and add a (speakable, but not necessarily unique) name for each of them…
Training etc. is then initiated automatically, no need to edit any yaml file (this is what (HA orientated?) hassil at first sight looks like, but I may be wrong) or code any parent/child relationships on the intent recognition side. Obviously, this also has the disadvantage of beeing a little to open with intent recognition results. But this way round, it’s very easy to handle, especially for users on the learning path…
Wrt. to updates for 2.5: Imo, atm it’s just two tweaks needed to get that done. (libgfortran version in deb file, MQTT without user/password in docker, see Bullseye refuses to install). So the efforts to spend on that are relatively moderate… But I clearly understand your focussing on developing for wyoming!
@synesthesiam thanks for all your work.
From my point of view, I like the idea of using docker with docker compose. If the documentation is good, an example of each service with all docker-compose configuration is pretty easy. Just copy and paste and adjust the directories and the services you want to use. In addition, from maintenance perspective, it is very easy, you don’t need to worried about new library versions. It is a nice solution, with a very small overhead.
In my case, I finally give up to use Rhasppy3 and I’m using Home Assistant with the Wyoming docker addons. In case the you are using a Home Assistant docker installation, that part is not very well documented.
The only thing I miss is the program that we have in Rhasspy 2.5 to extract all HA entities for Vosk. In my case, faster whisper didn’t work so well with spanish. I also tried some other models but in general the quality is far from the english version.
Finally, let me know if you are interested in a new tutorial with HA and Wyoming addons all in docker.
Part of the problem is the assumption that the Sota WER levels it posts is the same for all language models.
Its not as the WER rockets, as Spanish is the best supported in the large model (OpenAI Graph), but drops to 4th in tiny, where all languages post pretty bad WER.
The 0.37 WER of the Tiny Spanish model is a mile away from the Sota scores, but so are all languages, not just Spanish and why Whisper, as we had and there are much better models and much smaller, for the general hardware level Rhasspy is aimed at.
Speechly did a good review of Whisper Language vs Model Size and for the models supplied, the news for a long time hasn’t been good.
Also Whisper has been trained on standard mic inputs and works with moderate RIR and noise but use DSP to remove and WER gets much worse unless you fine train Whisper.
So you end up having to finetrain absolutely huge monstrous sized models and so far its DiY as no training framework has been given.
As there are many languages that have very bad WER even in the large model and as this guy finds out language training a fully opensource ASR framework such as Speechbrain creates a model x10 smaller than Whisper Large models whilst being nearly x2 better with WER & CER.
I have not used Vosk but I am having a guess VOSK has a LM (Language Model) that are super small accompanying vector databases of the text words of the language you want to use (simplistic).
This is why I really like what they have done @ wenet as confining a ASR to the words you use for ASR where for control such as HA its a small subset can massively increase accuracy.
I will let Wenet explain why as they do so far more eloquently than me.
https://wenet.org.cn/wenet/lm.html
Also here to add context biasing to add weight to the n-grams and further increase accuracy.
https://wenet.org.cn/wenet/context.html
Basically you can make vastly smaller (faster) more accurate models by using old tech but being domain specific. In a domain specific area such as HA this makes total sense. Also this could well extend to multi-modal ASR based on predicate detection or on-the-fly LM creation/ domain LM loading for smart devices.
Containers from a Dev perspective make 110% sense because of the nature of Python and importing and adopting so much opensource by others its creates a Dependency nightmare.
Library handling with Python is a tad raw with many just putting absolute versions to be used, but also the libs of the distro release often change.
When you grab the plethora of modules to complete a working ASR system you can soon enter a Dependency Hell where no matter what you do when you change or update Libs that creates new lib dependency probs, that become cyclic.
This is why Docker and containers are so great as you can split modules ‘contain them’ into isolated contained modules connected by declared net or /dev devices such as file system. Have different Python libs, different releases and even Distro’s.
It allows you to put together a whole range of software that would never work in a bare metal single enviroment due to dependency hell as what one module will run on, another will not.
Docker is a great tool, but also an image with all the docker containers ready running would also be great for the zero dev/config install users.
Why Whisper is a good question as is why does Rhasspy hardcode ASR?
With the rapidly changing arena of voice recognition and huge array of language and functional domains it doesn’t make sense to hard code.
We need a pre prepped ASR container that has a transport mechanism to accept audio speech binary and also the metadata of where it came from so that it can be passed on.
ASR does not need anything system specific and with very little modification any ASR could be used.
ZonalMic->WirelessZonalMicServer/QueueRouter->ASR->SkillServer->TTS->WirelessZonalAudioServer.
The speech recognition process is a simple serial chain and all elements can be agnostic of any system and just need the destination to pass on binary audio data or text metadata or both.
It screams individual contained process (containers) via network, but also ALSA and devices such as file system can be declared via docker.
A container can play into a shared Alsa-loopback and another container can be listening on the other side recording the mic via low latency C optimised kernel code of the Advanced Linux Sound Architecture - Wikipedia
Also can also have file type objects or even TCP sockets can be used via netcat.
Should really be able to use any ASR and route and queue to multiple instances as a web dashboard is still a singular control and config interface and be even easier than it currently is.