Rhasspy 3 Developer Preview

At long last, a developer preview of Rhasspy 3 is finally ready :partying_face:
Check out the tutorial to get started!

I’m calling this a “developer preview” because this is not Rhasspy 3’s final form. It’s missing a lot of pieces, including a user-friendly Docker image. But here are some exciting things that do work:

Pipelines

A pipeline in Rhasspy 3 is basically the configuration for an entire Rhasspy 2 system: the mic, wake, asr, vad, etc. systems and their settings. You can have as many pipelines as you want in Rhasspy 3, and they can be run continuously or on-demand in response to HTTP/Websocket calls.

Pipelines can share access to small speech to text or text to speech servers too, so you don’t have to keep separate copies of models in memory.

Satellites

Rhasspy 3 has been designed for satellites from the ground up. Once you have the HTTP server running on your base station, setting up a satellite is pretty easy.

You don’t have to use Rhasspy for your satellites, though! The Websocket API lets you stream raw audio after your wake word is detected, and receive raw audio back with a text to speech response. In fact, you can just create a pipeline on the server for each satellite with the “mic” (audio in) and “snd” (audio out) programs being something like GStreamer.

Where is sentences.ini?

This is currently missing from Rhasspy 3, but for good reason. With the release of faster-whisper, it’s now possible to run the “tiny” model on a Raspberry Pi 4 with decent performance and accuracy. And with the Assist feature in Home Assistant (which I wrote :wink:), you can send the transcript from Whisper directly in without a separate intent recognizer.

Custom sentences and intents will be possible in the future for Rhasspy 3, but I’ve obviously focused on the use case that’s centered around my job at Nabu Casa so I can kind of have a life :laughing:

Feedback and Contributions

I’m looking for feedback, mostly from developers at this stage. The pipeline system is powerful, but lacks some features that I’d like to get design ideas for. For example, there needs to be a way for each stage of the pipeline to pass custom data to the next.

Contributions are welcome, but I’d caution anyone with spending too much time implementing stuff when the API is still in flux. Bug fixes and discussion about architecture limitations would be best :slight_smile:

15 Likes

I was looking at benchmarks and thought wow either that is fast, but don’t remember Whisper.cpp being that slow.
So had to give whisper.cpp a refresh

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/gb1.wav' (3179927 samples, 198.7 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:09.000]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:09.000 --> 00:00:18.000]   At 9 o'clock this morning, mission control in Houston lost contact with our space shuttle Columbia.
[00:00:18.000 --> 00:00:24.000]   A short time later, debris was seen falling from the skies above Texas.
[00:00:24.000 --> 00:00:29.000]   The Columbia's lost. There are no survivors.
[00:00:29.000 --> 00:00:37.000]   On board was a crew of seven, Colonel Rick Husband, Lieutenant Colonel Michael Anderson,
[00:00:37.000 --> 00:00:46.000]   Commander Laurel Clark, Captain David Brown, Commander William McCool, Dr. Kultna Shavla,
[00:00:46.000 --> 00:00:52.000]   and Ilan Ramon, a colonel in the Israeli Air Force.
[00:00:52.000 --> 00:00:58.000]   These men and women assumed great risk in the service to all humanity.
[00:00:58.000 --> 00:01:02.000]   in an age when spaceflight has come to seem almost routine.
[00:01:02.000 --> 00:01:06.000]   It is easy to overlook the dangers of travel by rocket
[00:01:06.000 --> 00:01:11.000]   and the difficulties of navigating the fierce outer atmosphere of the Earth.
[00:01:11.000 --> 00:01:17.000]   These astronauts knew the dangers, and they faced them willingly,
[00:01:17.000 --> 00:01:21.000]   knowing they had a high and noble purpose in life.
[00:01:21.000 --> 00:01:26.000]   Because of their courage and daring and idealism,
[00:01:26.000 --> 00:01:39.000]   we will miss them all the more. All Americans today are thinking as well of the families of these men and women who have been given this sudden shock and grief.
[00:01:39.000 --> 00:01:51.000]   You're not alone. Our entire nation grieves with you. And those you love will always have the respect and gratitude of this country.
[00:01:51.000 --> 00:01:55.720]   The cause in which they died will continue.
[00:01:55.720 --> 00:02:04.120]   Mankind is led into the darkness beyond our world by the inspiration of discovery and
[00:02:04.120 --> 00:02:07.000]   the longing to understand.
[00:02:07.000 --> 00:02:11.160]   Our journey into space will go on.
[00:02:11.160 --> 00:02:16.480]   In the skies today, we saw destruction and tragedy.
[00:02:16.480 --> 00:02:22.040]   farther than we can see, there is comfort and hope.
[00:02:22.040 --> 00:02:29.280]   In the words of the prophet Isaiah, "Lift your eyes and look to the heavens."
[00:02:29.280 --> 00:02:31.640]   Who created all these?
[00:02:31.640 --> 00:02:39.260]   He who brings out the story hosts one by one and calls them each by name.
[00:02:39.260 --> 00:02:46.400]   Because of His great power and mighty strength, not one of them is missing.
[00:02:46.400 --> 00:02:53.580]   The same Creator who names the stars also knows the names of the seven souls we mourn
[00:02:53.580 --> 00:02:55.580]   today.
[00:02:55.580 --> 00:03:03.140]   The crew of the shuttle Columbia did not return safely to Earth, yet we can pray that all
[00:03:03.140 --> 00:03:05.820]   are safely home.
[00:03:05.820 --> 00:03:12.640]   May God bless the grieving families and may God continue to bless America.
[00:03:12.640 --> 00:03:22.640]   [BLANK_AUDIO]


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   655.10 ms
whisper_print_timings:      mel time =  1611.87 ms
whisper_print_timings:   sample time =   436.05 ms /   520 runs (    0.84 ms per run)
whisper_print_timings:   encode time = 82173.55 ms /     8 runs (10271.69 ms per run)
whisper_print_timings:   decode time = 34584.14 ms /   520 runs (   66.51 ms per run)
whisper_print_timings:    total time = 119509.48 ms

real    1m59.635s
user    7m49.414s
sys     0m2.604s

‘faster_whisper’

orangepi@orangepi5:~/faster-whisper$ time OMP_NUM_THREADS=4 python3 my_script.py
Detected language 'en' with probability 0.989548
[0.00s -> 9.00s]  My fellow Americans, this day has brought terrible news and great sadness to our country.
[9.00s -> 18.00s]  At 9 o'clock this morning, mission control in Houston lost contact with our space shuttle Columbia.
[18.00s -> 24.00s]  A short time later, debris was seen falling from the skies above Texas.
[24.00s -> 29.00s]  The Columbia's lost. There are no survivors.
[29.00s -> 37.00s]  On board was a crew of seven, Colonel Rick Husband, Lieutenant Colonel Michael Anderson,
[37.00s -> 44.00s]  Commander Laurel Clark, Captain David Brown, Commander William McCool,
[44.00s -> 52.00s]  Dr. Kultna Shavla, and Ilan Ramon, a colonel in the Israeli Air Force.
[52.00s -> 58.00s]  These men and women assumed great risk in the service to all humanity.
[58.00s -> 63.00s]  In an age when space flight has come to seem almost routine,
[63.00s -> 67.00s]  it is easy to overlook the dangers of travel by rocket
[67.00s -> 72.00s]  and the difficulties of navigating the fierce outer atmosphere of the Earth.
[72.00s -> 78.00s]  These astronauts knew the dangers, and they faced them willingly,
[78.00s -> 83.00s]  knowing they had a high and noble purpose in life.
[83.00s -> 90.00s]  Because of their courage and daring and idealism, we will miss them all the more.
[90.00s -> 96.00s]  All Americans today are thinking as well of the families of these men and women
[96.00s -> 100.00s]  who have been given this sudden shock and grief.
[100.00s -> 105.00s]  You're not alone. Our entire nation grieves with you,
[105.00s -> 112.00s]  and those you love will always have the respect and gratitude of this country.
[112.00s -> 116.00s]  The cause in which they died will continue.
[116.00s -> 121.00s]  Mankind is led into the darkness beyond our world
[121.00s -> 127.00s]  by the inspiration of discovery and the longing to understand.
[127.00s -> 131.00s]  Our journey into space will go on.
[131.00s -> 136.00s]  In the skies today, we saw destruction and tragedy.
[136.00s -> 142.00s]  As farther than we can see, there is comfort and hope.
[142.00s -> 145.00s]  In the words of the prophet Isaiah,
[145.00s -> 151.00s]  lift your eyes and look to the heavens who created all these,
[151.00s -> 155.00s]  he who brings out the story hosts one by one
[155.00s -> 159.00s]  and calls them each by name.
[159.00s -> 163.00s]  Because of his great power and mighty strength,
[163.00s -> 166.00s]  not one of them is missing.
[166.00s -> 169.00s]  The same Creator who names the stars
[169.00s -> 175.00s]  also knows the names of the seven souls we mourn today.
[175.00s -> 180.00s]  The crew of the shuttle Columbia did not return safely to Earth,
[180.00s -> 185.00s]  yet we can pray that all are safely home.
[185.00s -> 189.00s]  May God bless the grieving families
[189.00s -> 194.00s]  and may God continue to bless America.

real    1m22.424s
user    6m13.880s
sys     0m17.584s

Maybe its the chosen CPU Xeon(R) Gold 6226R as surprised but the rk3588s is faster than a Xeon Gold 6226R on 4 cores…
Faster_Whisper is faster than Whisper.cpp but not > x6 faster, as great the 8bit conversion seems to run great but confused to the benchmark figures posted on Github? Anyone else the same.

The tiny model runs approx same ratio as the small which I 1st choose as that was the supposed benchmarks.

whisper.cpp_tiny

real    0m18.860s
user    1m6.639s
sys     0m0.945s

faster_whisper_8bit_tiny

real    0m16.380s
user    1m10.504s
sys     0m2.735s

That is purely curiousity at the benchmarks posted on GitHub - guillaumekln/faster-whisper: Faster Whisper transcription with CTranslate2

Suprised you started with Whisper though as still on a Pi even the tiny model isn’t the fastest, but will have to see what benches you get with Tiny.

Do you have anything in the pipeline based on custom LM say with n-grams incorporating Assist entity construction?

I’m getting a similar relative speed-up on my Ryzen 5950X: faster-whisper is about 1.4x faster than whisper.cpp. So I think their claims are a bit overblown, but maybe they were comparing to a really old version of whisper.cpp?

Yes, the HassIL library that Assist uses for intent recognition can generate all possible sentences too (when you have the entity/area lists present). I plan to use the same trick I did in Rhasspy, where I generate the ngram counts directly from the sentence templates (instead of templates → sentences → ngram counts).

1 Like

Yeah I am only getting x1.151 but Whisper.cpp is optimised for the newer Mac Arm silicon that seems to suit the RK3588s haven’t tried a Pi yet.
Yeah that is what I thinking, but still wondering if the option should be open to have a simple small fast predicate ASR and rules to route to a secondary ASR.
As a LM of just the predicate / subjects of Assist entities should make a fast accurate ASR specific to Assist where a n-gram is generated on command.

That is the wierd thing about ASR as Whisper is a great long sentence conversational ASR but from playing it can be no better and even worse than specific LM ASR, that is conversely as bad for long sentence conversational ASR.
So how do you employ a middle layer (simple predicate ASR) to route the ASR most suited, or there is a fallback list and if one fails, on fail it will try a secondary…

Looks like you have been very busy, Assist looks great and likely would benefit much from a ASR model with a specific entity subset language model.
Would failover to Whisper work as an option as likely much faster than whisper also?

This seems ideal, yeah. The tricky part is knowing when exactly to fail over, as these systems will always return something from the training set. The “confidence” scores from systems like Kaldi are really just the probability of the guessed sentence relative to the ngram model, not relative to all possible spoken sentences.

I have some prototypes of a constrained ASR system (based on Kaldi) that can guess pretty accurately when something spoken is outside its tiny LM. With that, we should have the best of both worlds :slight_smile:


Another option I’ve considered is using something like string-edit distance to check the Whisper transcript against a small LM, and “repair” the transcript if it’s close enough.

For example, Whisper often hears “turn off the bad light” instead of “turn off the bed light”. That’s only 1 replace op, so it could be repaired.

Quick question… I skimmed the Assist documentation but couldn’t find an answer… is it possible to disable the default sentences?
Rhasspy regularly misunderstands me, often under difficult circumstances, like when I have the tap running, but still. So unless the speech recognition miraculously became flawless in Rhasspy 3, I certainly wouldn’t want Rhasspy to have the power to freely toggle any entity on or off. I’m fine with Rhasspy occasionally toggling the wrong light but I certainly wouldn’t want it to turn off the heating, unlock the car, falsely logging a diaper change or things like that.

That and the token length I guess as duration between command and conversational is distinct?

I also had a search for how you enable/disable entities and if the default is enabled/disabled or something along those lines and guess the documentation just takes time.
Is it post to github or here?

1 Like

We’re going to add the ability to decide exactly which entities/domains to expose to Assist. Later, we also plan to add confirmation for specific things, like locks and garage doors.

Yeah, that should work to our advantage since longer conversational sentences should require more “repairs” and exceed some threshold quickly.

1 Like

Whisper is fine for a technology preview, Assist as an inference based skill, so far looks really good to me.
Congrats in getting it all floated.

1 Like

Pretty interested to know what you put together for this :blush: Congrats on the dev preview. Looks good.

1 Like

The Websocket API is awesome!
I think it will simplify for most ppl using a satellite because you don’t have to worry about a lot on the satellite anymore.

1 Like

Yeah good to see websockets and could maybe extend into net pipeline to run on other instances and just link them up in a serial chain.
Compression might be a useful option Compression - websockets 10.4 documentation
You can check the received data type if str or binary, so you can easily seperate audio stream and protocol. Server - websockets 10.4 documentation

The client can “end” the audio stream by sending an empty binary message.

A start message could also be handy and a text protocol ‘start/stop…’, the control flow of text and binary frames helps differentiation…
Never been a fan of the websockets documentation though, so great just to see something that works :slight_smile: good job.

1 Like

@synesthesiam

Awesome news to see the next gen of Rhasspy coming together. Looks like you’ve been hard at work both here and at Nabu Casa (I’ve been watching the intents project). Thank you for everything you’ve done for this community! Having a truly local voice assistant is the piece I needed to step into the voice assistant world.

As a semi-new member of the community and potential skill developer, selfishly, I’m wondering if you could share you plans for intent handling in the future? I see Rhasspy 3 works with HA right now, but you seem to say that isn’t the long term plan. Do you plan to go back to a MQTT type connectivity, purely websockets, etc.? Very interested. I have recently started development on some skills for Grocy and Jellyfin for Rhasspy 2.5 using the Rhasspy Hermes App project as my basis, and released a polished time skill and timer skill, but now it seems you may be moving away from that model? Am I seeing that right? If so, I’m all for learning a new path just interested in what that path may be so I can get started asap.

Also, as someone who has been trying to develop some useful skills on 2.5 a suggestion for the Rhasspy API going forward. If you think of the use case for these “devices” one of the very early things someone is going to want to do with them are some of the basic O/S level things, like connect a bluetooth headset, turn the volume down, switch to a different audio output (maybe even turn a screen on and off). If you can build the ability to manipulate those things on all devices in the Rhasspy infrastructure via the API then intents can be built to take advantage of that. I am also working on something like that for 2.5, I call it a satellite skill, designed to be run on the satellite directly, still Rhasspy Hermes App based, but registers dynamically named skills on the MQTT bus (name is based on intent action + satellite name so each satellite gets unique intents to control themselves). Right now I have basic volume control working, volume up/down/set/mute. Bluetooth intents were next. Again looking to how to “migrate” this to Rhasspy 3.

Again thank you for everything! Please don’t take any of this as complaining. Just hoping to catch the train as it’s leaving the station and not be too late!

1 Like

Just hoping to catch the train as it’s leaving the station and not be too late!

As yeah its a great time to discuss before things become set in stone, as Rhasspy 3.0 is probably a bad name as it tends to indicate Rhasspy 2.5 has an end.
I am reading @JoeSherman and thinking the reason for the huge divergence in Rhasspy 3.0 is partially due to the MQTT protocol 3.0 is trying to make a clean start on.

If your mindset and solution fits the Rhasspy 2.5 model then develope on Rhasspy 2.5 as Rhasspy 2.5 has a more descriptive name as Raspberry/Hass/Pi has strong connotations, whilst Rhasspy3.0 is already employing tech ‘Whisper’ that stretches any Pi hardware, whilst due to the nature of many current SotA models have beam searches and use context and are race-till-idle suited than streaming.
Also what Hass have done with Assist Assist - Talking to Home Assistant - Home Assistant is absolutely amazing stuff and Rhasspy is no longer as the ill placed. Hass has moved to a inference based skill that the bigger user & dev community of Hass will supply to maximise support and usage that can now utilise any Voice system that can supply inference.

Thats a win/win situation as Hass now gets a great inference based skill and RPy gets this really great skill and that is an important difference between the 2.5 & 3.0 model, but I am not kidding about the name though but its 3.0 that maybe should change and both continue with clear and distinct directions.

For Hass the Assist module has just opened up so many avenues as a Dev on any platform only needs to feed intent and there is a clear partition between voice system and a skill server.

Again its a different mindset but instantly I am questioning why embed audio functionality when great opensource wireless audio projects exist, with bigger user bases such as Squeezelite or Snapcast and a whole load of RTP libs and audio servers, will do this.
An inference based skill just like Hass will get more users, likely become more strong because of the obvious introperability of inference based skill adds, than embedded functionality.

Hass does have an audio system, that likely in 3.0 can do this already if you employed the Hass audio system. I am not a Hass user and a read some articles quite a while ago, but dev is ongoing and the rest I have forgot.
In 3.0 I could tell you how to employ this with wireless audio quite easily as part of the TTS module should be an audio router that with a ready made wireless audio systems be it Hass, Squeezelite or Snapcast installing clients is not a huge endevour that why repeat the Dev we will never achieve of what they already offer in wireless audio?

I am not saying that idea is bad it just a bad fit, but reading through the repo things like the The Wyoming Protocol is a huge paradigm shift from 2.5, so much so that are these now distinct companion projects?

2.5 might get more Pi based focus, whilst 3.0 is more of a Question of how to intergrate SotA models, applications that has some freakenly insane ML going on.
There seems to of been a lot of work and a big shift in direction, that its understanding 3.0 and ignoring 2.5 until things are clearer, or continue with 2.5.

Hi @synesthesiam
Thanks for all your hard work with Rhasspy and Home Assistant!
I tested out the HA addon and it works very well, but I noticed one thing that may be a bug.
It seems when the TTS gets a large chunk of text, you can’t hear anything coming back from larynx, though the log shows is sent:
DEBUG:rhasspy3.program:client_unix_socket.py [‘var/run/faster-whisper.socket’]
DEBUG:rhasspy3.program:vad_adapter_raw.py [‘–rate’, ‘16000’, ‘–width’, ‘2’, ‘–channels’, ‘1’, ‘–samples-per-chunk’, ‘512’, ‘script/speech_prob “share/silero_vad.onnx”’]
DEBUG:rhasspy3_http_api.pipeline:stream-to-stream: voice started
DEBUG:rhasspy3_http_api.pipeline:stream-to-stream: voice stopped
INFO:faster_whisper_server: What’s the weather like?
DEBUG:rhasspy3_http_api.pipeline:stream-to-stream: asr=Transcript(text=" What’s the weather like?“)
DEBUG:rhasspy3.program:handle_adapter_text.py [‘bin/converse.py --language “” “http://supervisor/core/api/conversation/process” “/app/config/data/handle/home_assistant/token”’]
DEBUG:rhasspy3.handle:handle: input=Transcript(text=” What’s the weather like?")
DEBUG:rhasspy3.handle:handle: Handled(text=‘Currently the weather is sunny, with a temperature of 43 degrees. Under present weather conditions the temperature feels like 36 degrees. In the next few hours the weather will be more of the same, with a temperature of 43 degrees.’)
DEBUG:rhasspy3_http_api.pipeline:stream-to-stream: handle=Handled(text=‘Currently the weather is sunny, with a temperature of 43 degrees. Under present weather conditions the temperature feels like 36 degrees. In the next few hours the weather will be more of the same, with a temperature of 43 degrees.’)
DEBUG:rhasspy3_http_api.pipeline:stream-to-stream: sending tts
DEBUG:rhasspy3.program:client_unix_socket.py [‘var/run/larynx2.socket’]
Real-time factor: 0.107471 (infer=1.27417 sec, audio=11.856 sec)
DEBUG:rhasspy3_http_api.pipeline:stream-to-stream: tts done

Also noticed after the request my dev console gets flooded with this:
image

Figured I’d call it out :slight_smile: When I tested other commands that had shorter text they all seemed to work fine.

EDIT: Opened issue on github as I figured that was the more proper channel for this :smiley:

1 Like

You’re welcome! This is great to hear :slight_smile:

My hope is that the type of connectivity won’t matter quite so much going forward. Rhasspy 3 pipelines are just made up of programs, so it’s not a problem if one or more of them connect out to MQTT. I could see a small bridge server for Hermes being necessary to make it work smoothly, but nothing requiring a total rewrite of the skills.

No worries :wink: The plan is not to deprecate MQTT/Hermes, but to see it as just one way among many to connect things to a voice assistant.

My hope was to boil skills, etc. down to their bare essentials in Rhasspy v3. For example, the handle domain can take programs that take text in (speech to text transcript) and push text out (text to speech response). So this is a valid “skill”:

cat

which just repeats back whatever you say! You need an adapter, etc. of course for Rhasspy 3, but the complete YAML isn’t too much worse:

  handle:
    repeat:
      command: |
        cat
      shell: true
      adapter: |
        handle_adapter_text.py

I considered renaming, but to me Rhasspy v3 is delivering more on the promise of Rhasspy being a toolkit for building your own voice assistant.

For example, you can build a satellite now that consists entirely of two GStreamer processes: one streaming mic audio to a base station, and one playing audio from a base station. You can add OPUS or whatever compression, and have an always-on satellite with the wake word running on the base station. Or you can play back audio to a Bluetooth speaker, etc. Just change the snd program :nerd_face:

I still think there are 2 camps as there are some who just want a relatively simple web-based setup to suit various hardware aka various Raspberry Pi.
Then there are those that may be looking more multi-room, home-control, distributed audio employing cutting edge SotA ML models that could scale into money is no object…
I was just wondering if simplicity & complexity where at odds with each other and distinct paths.

I think the Assist inference based skill is absolutely amazing and wish more projects supplied an inference interface as it is super convenient for any voice assistant and so much better for a skill to be supported by the project its a skill for.
Its very likely my choice will be Hass due to the Assist module now, but my focus on modern distributed audio means an embedded audio system isn’t of much interest as I intend to have a singular base station that will route audio to a audio server and don’t really have such a thing as a ‘Satelite’

I like the websockets interface and I like the emphasis on being a toolkit for building your own voice assistant especially you can wrap a module in the Wyhoming protocol so you can almost drop in a stream to stfout.
Still curious to why not something Gstreamer to try and minimise high level python with DSP.

I never did check that article but the lib example is much better.
Would it not of been great to have Gstreamer ASR & TTS modules?

In fact I didn’t know Kaldi had a gstreamer plugin until I wrote that.
Kaldi: Online Recognizers whilst I was googling there are some others.
Gst-nvdsasr — DeepStream 6.2 Release documentation but Nvidia…
Using PocketSphinx with GStreamer and Python – CMUSphinx Open Source Speech Recognition

This is what I want to do - preferably running on a Pi or VM hosted in the cloud.
And I think it makes sense to not focus on the home-automation config that HASS already has covered.

:partying_face:

1 Like

There is definitely an overhead, though Rhasspy 3’s core could technically be rewritten in something like Rust for less overhead. But the majority of the time spent on the Python side is in async await around subprocesses, so the only real savings would probably be in fewer memory allocations.

The GStreamer pipeline paradigm seems like it would be a good fit for Rhasspy too, since it can describe an actual graph and not just a simple feed-forward pipeline.

1 Like