Rhasspy 3 Developer Preview

@synesthesiam thanks for the info; that’s too bad about the Seasalt custom wake-word.
I really need it to work on a Pi in addition to x86 VM.
I think you should give some info on how other people can help you with the to-do’s…
ie, what needs to be done - research items, testing configurations, python code that needs to be written, etc. If you break down the to-do tasks maybe other people can help out?

what do you mean “Accumulating context” can you give some examples?

1 Like

For me accumulating context in a pipeline of passing via stdin and stdout, just creates complexity and unnecessary overhead in the audio stream for no apparent reason.
An audio stream is an audio stream and a accompaning YAML is a accompaning YAML that doesn’t need to be passed through each stage in a pipeline its just associated with that stream and also audio files as the pipeline is assuming a realtime stream but it could equally be a queued file that is simple same name different extension.
Websockets again is great for this as it has 2 packet types text & binary, so its super easy to seperate the 2 and send varying length text and binary without having to provide a strict protocol and its super lite as you are not applying a unnesscary protocol.
Linux has well defined streaming protocols and the python module for gstreamer looks absolutely perfect as does websockets as the network layer is extremely important to be able to link up instances.
A linux speech recognition system is just a very simple serial processing chain of streams that you should be able to route, queue and split so you can easily apply different processing chains to different hardware and make scalable by having parallel instances.

There is zero advantage to embedding context in a totally proprietory protocol when everything can already be accomplished with standard Linux interfaces, that are hugely supported and more performant.

1 Like

I mean something like having a wake word system that can also identify the speaker, and a speech to text system that has speaker-dependent models. The speech to text portion of the pipeline needs to be able to get the speaker id from the earlier wake word stage.

I agree, I always feel perpetually behind :stuck_out_tongue:

The reason is so we can just directly call the original programs instead of having to wrap everything in a plugin for a specific framework like GStreamer.

1 Like

Thats my whole point as you don’t call the original program you just route the audio to it and you don’t have to use Gstreamer to do that, you can if you want.
Gstreamer was mentioned as for some reason rhasspy3/wyoming.md at master · rhasspy/rhasspy3 · GitHub seems to be just a completely proprietory Gstreamer like protocol that once again creates complexity by embedding control events and audio stream as one.
That step makes the ease of passing linux audio as files or streams into one that needs programatic conversion into this gstreamer like with embedded event info Python protocol.
Your embeding audio into a purely proprietory protocol and unlike Gstreamer, Pulse, Alsa, Snapcast that all have ready interfaces for each other in highly optimised ready and tested code, you are providing a completely proprietory audio stream control, that has no need!?
Also to top it off you don’t call the original program you just need a simple serial processing chain of streams/files that you should be able to route to the app as with voice applications they should already be loaded and waiting for input to minimise latency.
I don’t have to wrap anything in a plugin as Gstreamer is a standard linux framework with a huge array of ready made plugins.
The oppiste is true where you have created this Wyoming protocol that exists nowhere and to use it everything needs to adopt it.

That is really easy to accomplish where speech recognition programs sit waiting for a files or streams of standard linux frameworks that is inclusive of a huge range of ready tools that can provide this … ?

I think its time to give Rhasspy 3 a try.

  1. What platform can the current Rhasspy 3 server run on ?
  • Your tutorial starts with CLI commands, so I guess it is not intended for HAOS ?
  • Will it run on a RasPi, or does it need an X86 and linux ? I think I noticed somewhere that one package isn’t available for ARM processor.

In other words, should I use my old RasPi 4 + HAOS system (from before I upgraded to an old PC) ? Or create a new VM on my X86 desktop ?

  1. Similarly, what platform can Rhasspy 3 satellite run on ?

  2. My objective is to produce a tutorial and/or documentation for non-developers.

I use it on a x64 linux (fedora server). It should also work on arm I think, many submodules are the same as rhasspy 2.

1 Like

But the Wyoming protocol and it’s requirements/dependencies is simple to learn and implement.
And that’s the stated goal of Rhasspy V3 - simple for apps/devs to integrate/interoperate.

“For v3, a project goal was to minimize the barrier for programs to talk to Rhasspy.”

1 Like

That is the strange thing as it is a barrier in itself because its a proprietary protocol requiring programs to talk ‘Rhasspy’.
Things seem strangely in reverse for both input and out programs as its not a matter of the Linux kernel adopting Rhasspy protocol as audio is embedded as ALSA ( Advanced Linux Sound Architecture) and various servers are built ontop of that.
They are standard high performnace Linux libs for passing audio and now all programs must be converting to Wyoming for input to Rhasspy.
Why I am scratching my head is why does this as yeah have a simple protocol but don’t embed it into standard audio streams and make them non standard whilst its so easy to use kernel interfaces (ALSA) and the existing audio servers (Pulse, Pipewire, Gstreamer) and just pass the json as an external config file (YAML).

So if you talk about examples that currently work with Rhasspy such as Voice-EN AEC, Speex-AGC, Deepfilter.Net and infact any VAD or KWS that used standard audio interfaces will now have to be converted to embed Rhasspy metadata and render the standard audio stream proprietory.

So if we take for example rhasspy3/wyoming.md at master · rhasspy/rhasspy3 · GitHub and the event types.

Events Types

  • mic
    • Audio input
    • Outputs fixed-sized chunks of PCM audio from a microphone, socket, etc.
    • Audio chunks may contain timestamps

Even the very basic simple plug in a mic to a Linux machine now doesn’t work with Rhasspy as its no longer a standard Linux kernel audio stream its this ‘chunked’ protocol !?

  • wake
    • Wake word detection
    • Inputs fixed-sized chunks of PCM audio
    • Outputs name of detected model, timestamp of audio chunk

So now every wakeword program needs to strip the Wyoming protocol and return back to a standard linux audio stream to process a Wake event and then reassemble the event into a standard audio stream to make it proprietory once more.
Also why it outputs name of detected model, timestamp of audio chunk as what use or function that has upstream I have absolutely no idea.

I could go through each step where its creating work on the same audio stream and have no need or value for the meta data its injecting that standard standard programs do not need, in fact doing so excludes them.
I am not going to bother but in terms of a Dev talk ignoring standard Linux kernel methods for audio and creating a proprietory protocol that only Rhasspy needs seems the opposite of

Event Streams

Standard input/output are byte streams, but they can be easily adapted to event streams that can also carry binary data. This lets us send, for example, chunks of audio to a speech to text program as well as an event to say the stream is finished. All without a broker or a socket!

Each event in the Wyoming protocol is:

  1. A single line of JSON with an object:
  • MUST have a type field with an event type name
  • MAY have a data field with an object that contains event-specific data
  • MAY have a payload_length field with a number > 0
  1. If payload_length is given, exactly that may bytes follows

Example:

{ “type”: “audio-chunk”, “data”: { “rate”: 16000, “width”, “channels”: 1 }, “payload_length”: 2048 } <2048 bytes>

I am reading the above and my head is exploding as the very things it extols as virtues are exclusions of standard linux protocols with a replacement that is more complex, proprietory and as far as I can tell not needed for a simple Voice system to work.

Say we take All without a broker or a socket! and do a wiki on what a broker is.

Wyoming is a broker and its needed at every stage of the pipeline to convert back to normal Linux audio streams that likely the original program now wrapped in the Wyoming stdin/stdout broker pipeline uses anyway.

This means it has to be implemented so that you have to be a dev of the apps whilst many apps are ready an complete and work on standard audio streams, what is being said seems to be paradoxical.

Wyoming seems to be very much a continuation of Hermes audio like control minus the MQTT which I never got anyway as its seems to focus and a load of unnesscary whilst ignoring standard linux process and what is needed for a modern multi room voice server system.
If you look at the new Hass Assist it needs the ASR inference text and absoluely none of what Wyoming/Rhasspy are creating a non standard protocol and workload for as all it needs is the inference text and what is worse in terms of modern multi-rrom systems the protocol lacks the zonal/channel info for Assist to create a simple 2nd level interaction message so its TTS text message can return as audio to the source input.

I keep stating that because it is and it really is simple but for some reason Rhasspy enforces the input/output audio chain to know Rhasspy than Rhasspy work with standard linux audio streams and so becomes complex.
I think I am going to bail on Rhasspy3 as for me it is purely a continuation of what I percieved wrong in 2.5 and likely still continue to play Mic hardware, KWS and generally keep an eye on voice technology.
As an example if you are interested I will sketch out how things should operate so that its not a matter of minimising barriers for programs to talk to a voice server, there simply are no barriers as it uses standard linux streams and protocols.

Example interoperable voice system building blocks
Rhasspy seems to make the assumption that Mics and KWS should be able to talk to Rhasspy so enforcing a Rhasspy protocol because Rhasspy is central.
Rhasspy has always lacked a KWS/Mic server as KWS/mics are simple devices that do a specific job and have no need to know any voice server protocol so that a dev can create a KWS/mic device that is interoperable not just with Rhasspy but various voice applications.

The websocket api GitHub - rhasspy/rhasspy3: An open source voice assistant toolkit for many human languages was a good start and where we do need a simple protcol of “start/stop” and simple quality metric its lacking and strangely embeds a nulll in the binary to singnify an end.
But its not part of a KWS/Mic server that aggregates input to a single stream based on a single zone by its channel and if multiple zones are concurrent it queues one as a file to completed after the 1st is completed or route to another instance as ASR is generally a serial process or can route to another instance.
The KWS/Mic server is missing and Rhasspy forces each mic/kws/process to embed meta-data into a broker protocol because it processes them after reciept than on input and also assumes a single stream where multi-room connections will be processed later.
If you process connections and streams local and remote as a KWS server then each device doesn’t need to know Rhasspy or have any embedded protcol as the KWS server can embed it, but its not even needed if audio is ready to stream, queued and routed.

There is a whole section at that start of the audio processing system that is absent in Rhasspy and it creates a whole load of complexity to do this after the fact.

Any Skill such as Hass Assist needs only to send a text message to TTS but is missing the important KWS server zone and channel to return audio to.
Also skill meta-data that initiates a KWS/Mic stream is also absent so that the system can not simply differentiate between OP command streams and secondary response streams, but that is another example of how easy and simple control should be.

Its that simple but wow, so I am going back to playing with Mics, KWS and voice tech news and refrain from giving an honest opinion andf getting more than the tempory forum ban I got last time for purely expressing opinion.

Bemused :slight_smile: is all I will say but on this topic and Hass Assist has proven how great inference based skills are, but a very simple kws-server protocol seems to be missing and the complex one proposed is likely not needed… :roll_eyes:

1 Like

I say it right here in the adapters section:

Using events over standard input/output unfortunately means we cannot talk to most programs directly. Fortunately, small adapters can be written and shared for programs with similar command-line interfaces. The adapter speaks events to Rhasspy, but calls the underlying program according to a common convention like “text in, WAV out”.

Most programs do not need to know anything about the Wyoming protocol, because they already follow a convention covered by an adapter.

You received a temporary ban for being unnecessarily negative and non-constructive with your feedback, not for expressing an opinion.

At this point, though, it honestly does feel like no matter what I put out, you will not like. Which is fine, but it would be much more constructive to criticize something you have actually used and understand.

3 Likes

What adaptors? I have to write an adaptor before I can use any program.
Apols Michael if I can not like, but I am looking at an input audio chain of in and out of a protocol via Python that I have to write adaptors and take the hit for load of use and dev time and its not about like its I have absolutely no idea why.
Seriously I am sat here thinking linux has a audio system written in C by much better than me that I can not figure out at all why you would have such mechaninsms.
So no prob I will not use it and maybe concentrate on a simple KWS server that feeds an ASR that feeds Hass Assist and hopefully other inference based skills that might arrive.
Just look at all the documentation length, adaptors and protocols just to get an audio stream to an ASR??!
If that is deemed unnecessarily negative and non-constructive, well to be honest it is just opinion and feedback on your Dev as your code is always great and tidy, but I honestly have no idea why its always seem so bloated especially when we are targetting embedded.

I am just going to stick to hardware mics and look at the websockets interface to create a simple KWS server for audio in and for audio out likely squeezelite or snapcast as also multi-room audio and guess I just don’t need Rhasspy as struggling to work out why I need Wyoming, adapters and the rest just to accomplish that?

Fine, maybe come back here when all you talk about would be ready for testing, that is in a github repository with a full documentation to deploy and test it. I already miss it :slight_smile:

1 Like

I should really have a go at a esp32-s3 kws due to the Raspberry stock situation, but procratinating over KWS model and taking the plunge with the Espressif IDF.
I was sort of hoping I would be able to use existing but not looking that way.
The multi-room is a no-brainer really as like the ready made multi-room audio (squeezelite.snapcast…) its just a zonal client/server infrastructure that is needed and associate the zones of input/output.
You need a bit of a ‘debounce’ with multiple mics in a zone to cope with latency (wait 100ms for all to reply and pick best stream).
Things are changing so fast with generative models…

So with everything changing so fast, lack of stock, procrastination and likely little if anything I could use I think I am losing interest.
I have been meaning to retsart the KWS work that I ditched, but can not seem to get the enthusiasm, so maybe not :slight_smile:

As for documentation the client/server simplifies operation massively, there are x2 servers, where both just route and queue.

1… KWS Server (Skill Server is a client)
2… Skill Server (KWS Server is a client)

There is no concept of a satelite as non is needed purely they connect and register with a server with a static UID as you setup the system.
When an ‘ear’ connects to the KWS Server its UID signifies the zone its been attributed to a YAML which is queued and routed with that stream session be it actual stream or file.

The Skill Server contains the ASR on on inference text routes this to the best appropriate inference based skill on the predicate of the inference, inference added to Yaml.
If a response is required a Skill will append the Yaml with TTS text and return to Skill Server.
TTS is a skill but input zones/channels are mapped to output and why its a KWS Server client as it forwards the YAML to start a response recording on the the initial Ear of that zone.
The next inference knows this is the response so YAML contains all data to route back.

Client/Server TCP is massively more simple as delivery is guaranteed and the whole inference structure doesn’t need the Timestamps and protocols that a UDP MQTT broadcast network needs that delivery can not be guaranteed or order.

I am a Snapcast fanboy but likely you would modprobe ALSA loopbacks and assign one to each room and just play into it and Snapcast does the rest.

That is it and covers all as a voice server processes voice and skill servers process skills, we just need skills and KWS (Ears) and decide if you are going to have a fast specific domain predicate ASR and forward to task based secondary ASR.
IE Whisper (Conversational ASR) whilst a domain LM Wav2Vec2 could quickly provide for command skills such as Assist or local media libaries of known entities.

What we lack is decent cost effective KWS ‘ears’ with good audio processing and good inference based skills.
Hass Assist is great as its the 1st inference based skill I have seen but for some reason it has an older tech strict word stemmer that needs declared formatted YAML, than say more modern NLP or even LLM (Large Language Model).

Rhasspy is back to front for me as skills and devices do not need to know a Voice Server or convert standard audio streams or files into proprietory protocols that are historically needed because of MQTT.
A Voice server is the really simple part and likely we will start seeing LLM specific domain skills and a ESP32 Guru or Rust/C guru might provide a KWS and far too much fort one person to do all, but I am absolutely certain the way foward is a simple client/server infrastructure and not a broadcast network of embedded strict protocols.

Its the skills that are the complex type as in training an LLM but they are jaw dropping amazing and running locally and now.

As in Introducing LLaMA voice chat! by
Georgi Gerganov

Thanks to Ge Orgi :slight_smile:

Rhasspy itself will run anywhere Python can, but most of the voice programs are built for x86_64 and arm64 systems. So, Raspberry Pi 3/4 with 64-bit OS.

There is an HA add-on that includes Whisper and Piper (formally Larynx 2). But you can’t change anything yet.

32-bit ARM support is getting harder and harder, but everything will work on 64-bit ARM.

The satellite code should work on 32-bit ARM if you use porcupine (as well as 64-bit ARM and x86 of course). It’s also possible to run a satellite that’s just gstreamer or a simple websocket client, since wake word detection can happen on the server too.

I appreciate it, though this might be a bit premature. The biggest missing pieces for non-developers are being able to (1) install services via the web GUI, (2) download models as needed, and (3) control which services are automatically started with the HTTP server.

Maybe it’s too early but I want to be prepared so I have a question regarding recommended hardware platform for Voice Assistant (VA) in Home Assistant.
If I plan to use VA will I need to upgrade my current HassOS host “Odroid C4” (quad-core Cortex-A55) to some more powerful ARM (Odroid N2, M1?) or even more powerfull (x86 Proxmox) host?
Or the Voice processing can be dedicated to a secondary system (Rpi3) and I can leave my current Hassio system as is?

1 Like

I may have missed but I did not find anything about how this version will differ from 2.7? what are the innovations or major changes.

Will openwakeword be also an option along others for wake-word recognition?
I tried it recently and the results are just unbelievably precise compare to other embedded options.

Unfortunately I wasn’t able to make an image with rhasspy 2.5, since it’s based on debian 10 and had a 3.7 python, which in combination with RPi4’s aarch64 architecture wasn’t compatible with that library, so I had to make a workaround and basically run it outside the docker. But it would be such a good option.

I’d be happy to contribute though, if that’s a way to go. Especially after I looked at the structure of the rhasspy3, which is much more clear compare to 2.5, thanks a lot for that :clap:

2 Likes

what do you think about GitHub - pyannote/pyannote-audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding, maybe add it too? and GitHub - kssteven418/Squeezeformer: [NeurIPS'22] Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Hi
I’m currently a Mycroft (Picroft) user who’s reviewing his approach to voice assistants :slight_smile:

I’ve successfully run through the tutorial (nice - ty, a couple of minor bug reports pending…) and have a few comments/observations (mainly ideas I’ve been turning over for a while)

Firstly I personally think the ‘event to sound’ stuff should be almost broken out - it’s nice to have it all in the demo but conceptually I think the focus of a speech assistant is the input side so I’d tentatively agree with another post that a snapclient based/like approach to delivering TTS would be reasonably sane/light.

So now, thinking about the whole ‘sound to event’ architecture, my main thought is to question why it is so serial (if I’ve understood it correctly)?

My concern is that this has drawbacks in both user-experience latency as each stage of processing seems to need to finish before the next one starts; and it also means that the control of recording is driven by non-dynamic configuration.

May I suggest that adding a buffer-controller (BC) based approach could be interesting.

Something like this:

mic recording is constantly put into per-mic smallish (5s) ring-buffers.

The buffer controller (BC) provides (websocket & unix socket) access to the buffer using Wyoming.

Wake detection is connected to the BC socket and listens.

When wake is detected the BC is told to record properly and now just grows the buffer; the timestamp at the end-of-wake-word is available (speech-start). WD would emit an event with the speech-start timestamp (back to the BC for sending to other connections).

Anyway the buffer now grows and sends this event and timestamp to VAD which requests and reads until it eventually detects silence and tells the controller to stop recording and sends a speech-end event. (So now does the BC needs to be the event broadcaster to send that to ASR or does is this just how it gets an EOF pointer?)

At the same time the BC sends speech-start to any ASR(s) which eventually dispatch via intent recognition (Wyoming again) to an intent. The intent can now send a ‘send-more’ event to ASR which sends a start-recording to the BC which again activates VAD and provides an audio stream to ASR and a text stream on to the intent

Possibilities:

  • More responsive ASR as it doesn’t need to wait for VAD to trigger before processing
  • Multiple ASR could listen to the BC
  • Audio available for other uses (eg debug capture or false-positive wake word training by a “never mind” intent that ask for the wake-word raw data)
  • Almost continuous natural speech since the wakeword algorithm should be able to provide an entry point into the buffer for ASR.
  • non-wakeword triggered recording: an intent says “any more”. eg my shopping skill responds to “We need some [ and …]”… Then (via TTS) says “any more?” and starts listening again rather than me having to re-utter “ We need some ” - this is much more natural.

A lot of the pipelining work still makes sense btw. If I’m getting it right the buffer-controller would provide Wyoming interfaces over websockets, unix sockets or (via a socket shim) stdin

Anyway… just thought there may be some ideas in here that would spark more discussion

1 Like

Feeling a littel dump right now but how can i actually install rhasspy3 addon for homeassistant?
Here GitHub - rhasspy/rhasspy3: An open source voice assistant toolkit for many human languages it states * Install the Rhasspy 3 add-on and links to GitHub - rhasspy/hassio-addons: Add-ons for Home Assistant's Hass.IO but i have already added this to home assistant addon repositories and only see Rhasspy 2.5

Edit:
i had to remove the addon, remove the repository and add it back

Hi there,
Not sure to be in the right topic (forgive me if not :roll_eyes: !).
Just find an hugging face repo on an “improve” Whisper :face_with_raised_eyebrow:.

As it’s a bit tech-savvy for me, is it a nice way forward or not applicable or not enough open-source ?
Thanks for sharing though :slight_smile: