Feel a little like i am in limbo now. HA vs Rhasspy

So after the esh video showign how to add a esp32 mic for streaming to HA for wakeword integration i dont know if i should be continuing to use rhasspy or if im meant to just move to HA now.

Rhasspy V3 hasnt had anything new released since April. 2.5 seems to be still used by a lot of people and suggestions are discussed in the github and on here but no new release since august 21.

I like rhasspy and the rpi3 or 4 satellite setup i have atm. but it almost seems like we are all waiting for HA to just catchup and everything is going to be HA based after that.

Where do you see rhasspy fitting in in the future? will it still be required?
I get that wakeword on device in HA seems to be concetrating more on esp devices than pi’s which i dont really want to do as my rhasspy satellites also serve other purposes too.

But im kinda lost atm. i dont know what the right choice is. Anyone have a plan on what they intend to do with rhasspy? keep using it ? or just move over to HA voice altogether?

1 Like

So far with HA the KWS runs on Arm the ESP32 devices are just always broadcast wireless mics.
I am not sure why they haven’t run it on a ESP32-S3 as seems to just use layers that are compat.
Its from this but run @16Khz rather than 8Khz https://arxiv.org/pdf/2002.01322.pdf

M5 Atoms are just broadcasting to a central KWS on a Pi

Actualy don’t think it will run on ESP32 as the audio embedding model uses a layer that isn’t available.
Leaky ReLU

Which is a shame as strangely the ESP32-S3 has a Alexa certified front end that we don’t have on the Pi.
You can install on a Pi GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.

I hear you. It feels like it’s been a very long wait, with Mike focused totally on Home Assistant Voice Assist. And the upgrade paths are not clear.

But consider that the majority of people currently use Rhasspy with Home Assistant. Nabu Casa are paying Mike to develop his next generation of voice assistant - as open source which he has stated his intention to integrate into Rhasspy.

My upgrade path

I use Rhasspy 2.5 with 3 RasPi satellites - solely for Home Assistant.

It looks to me that HA Voice Assist now does pretty much all I want from Rhasspy; so I have decided that for me now is the time to swap over to HA Voice Assist. The video and blog points to Mike’s homeassistant-satellite which I have running now on a test RasPi 3 unit. It doesn’t do the wakeword locally (whereas Rhasspy 2.5 is running Porcupine locally) - but that does seem to be on the way.

Satellite hardware

RasPi has always been an expensive overkill for a voice satellite, and then not available :frowning: More so if it is only passing audio through. Maybe my RasPi Zero won’t be up to running openWakeWord locally - but I only have one of those and can stand one constant audio stream over my wi-fi.

For the last year I have been putting off purchasing (or recommending) any hardware for voice satellites - and I will continue to wait. The Atom Echo is OK as a proof of concept - but IMHO not up to real world use.

I think ESP32-S3 currently looks like the best hardware base, with the ESP32-S3-BOX-3 being a bit of an overkill, but if the price is low… Of course it needs Open Source versions of the audio processing magic … which seems to be the real issue.

I think Nabu Casa are keeping a close eye on options here, and I guess they may even leverage their ESPHome expertise to bring their own satellite hardware to market. When there are more suitable and cheap devices (i.e. comparable to the amazon and google devices) I expect I will gradually swap over and repurpose my RasPis. That’s the great thing with RasPi - there will always be some other task they can be used for :wink:

Rhasspy v3

As for Rhasspy, Mike has always said Rhasspy is a toolkit - not just for Home Assistant. He has stated his intention to incorporate all he has learned and developed for Nabu Casa into Rhasspy v3 when he gets time. Rhasspy will have lost a lot of its user base (like me) to HA Voice Assist - but that will mean Rhasspy’s focus is very much on all those fancy uses for the toolkit.

I guess Mike will be immersed in HA Voice Assist till at least the end of this year; so I wouldn’t expect much Rhasspy activity till then. Then maybe Rhasspy 3.1 all at once (since almost all the work has already been done).

I do hope Nabu Casa keeps him on their ongoing payroll, as I look forward to seeing what Mike comes out with next.

Your options

Bottom line is that if Rhasspy is working for you now, it will continue to work. There is no hurry for you you to make any change.

1 Like

I really don’t like the idea 24/7 always on mic WiFi broadcast and been scratching my head for alternatives.
I guess you could get a POE splitter PoE Splitter with MicroUSB Plug - Isolated 12W - 5V 2.4 Amp | The Pi Hut then at least its wired and maybe use a USB ethernet adapter and have a network dedicated to that audio.

You can aliexpress or ebay POE splitters and find them cheaper same with injectors or POE switches.
You can get hats but seem expensive for what they are.

I really like the Esp32-S3 due to its AFE (Audio Front End) but I have never actually listened to what effect it has on the x2 split BSS streams and haven’t managed to get anyone to post a sample.
I also think the ESP32-S3-Box that takes a $7 microcontroller to approx $50 of bloat overkill and it would be great if a simple ADC/DAC for any ESP32-S3 was made avail via Seeed.
Still though Esp32-S3 with its Alexa certified AFE only has the Esspressif KWS and they are not that great or again it would have to be 24/7 broadcast of the x2 BSS streams.

They do say a Pi4 can run multiple instances of OpenWakeWord and maybe its back to the 2 channel or single channel usb soundcard with the Max9814 mic preamp.
Because you do have a preamp you can run quite long mic cables and just have a mic or x2 mics in tiny little enclosures with as many soundcards you can plug into a hub or instances you can run on a Pi4 that is hidden away somewhere.

@donburch For example, a single core of a Raspberry Pi 3 can run 15-20 openWakeWord models simultaneously in real-time.

It does say the above even if I am struggling to believe it…

PS Respeaker 2mic drivers are failing as the kernel headers for the update have not been published yet, dunno why raspberry does this.

I struggled to get the example script to run but using pyaudio in blocking mode maybe.
Anyway replaced with Sounddevice and actually it works really well so far.

# Copyright 2022 David Scripka. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Imports
import sounddevice as sd
import numpy as np
from openwakeword.model import Model
import argparse
import threading

# Parse input arguments
parser=argparse.ArgumentParser()
parser.add_argument(
    "--chunk_size",
    help="How much audio (in number of samples) to predict on at once",
    type=int,
    default=1280,
    required=False
)
parser.add_argument(
    "--model_path",
    help="The path of a specific model to load",
    type=str,
    default="",
    required=False
)
parser.add_argument(
    "--inference_framework",
    help="The inference framework to use (either 'onnx' or 'tflite'",
    type=str,
    default='tflite',
    required=False
)

args=parser.parse_args()

CHANNELS = 1
RATE = 16000
CHUNK = args.chunk_size
sd.default.samplerate = RATE
sd.default.channels = CHANNELS
sd.default.dtype= ('int16', 'int16')

def sd_callback(rec, frames, time, status):
    audio = np.frombuffer(rec, dtype=np.int16)
    prediction = owwModel.predict(audio)

    # Column titles
    n_spaces = 16
    output_string_header = """
        Model Name         | Score | Wakeword Status
        --------------------------------------
        """

    for mdl in owwModel.prediction_buffer.keys():
        # Add scores in formatted table
        scores = list(owwModel.prediction_buffer[mdl])
        curr_score = format(scores[-1], '.20f').replace("-", "")

        output_string_header += f"""{mdl}{" "*(n_spaces - len(mdl))}   | {curr_score[0:5]} | {"--"+" "*20 if scores[-1] <= 0.5 else "Wakeword Detected!"}
        """

    # Print results table
    print("\033[F"*(4*n_models+1))
    print(output_string_header, "                             ", end='\r')

# Load pre-trained openwakeword models
if args.model_path != "":
    owwModel = Model(wakeword_models=[args.model_path], inference_framework=args.inference_framework)
else:
    owwModel = Model(inference_framework=args.inference_framework)

n_models = len(owwModel.models.keys())

# Start streaming from microphone
with sd.InputStream(channels=CHANNELS,
                    samplerate=RATE,
                    blocksize=int(CHUNK),
                    callback=sd_callback):
    threading.Event().wait()

Still though from a single instance


But guess we are not talking instances just multiple KW

Onnx model also works but takes more load so would stay to the tflite model.

@donburch aliexpress do cheap https://www.aliexpress.com/item/1005006017042745.html poe splitters but likely as Pi’s are 5.1v it might undervolt guess just try as you could always snip the cable and bypass the polarity diode and put 5v direct on gpio.

@Siparker I don’t really have a coherent long-term vision for Rhasspy at the moment, unfortunately. As @donburch said, my focus has obviously been on Home Assistant since I’m employed by Nabu Casa. However, all of the services I’ve built are still based on the Wyoming protocol so interoperability isn’t out of the question.

There are several problems I’m facing that are really shaping my work:

  1. I have so many Github repos that responding to issues and reviewing PR’s would be a full-time job in itself. By the time I finishing responding to one issue, a new one has appeared! I need community maintainers, but I don’t really have the time to manage and coordinate a group of open source contributors either. So I end up having to ignore things and shift focus to try and move projects along.
  2. People in this space have such unique set ups and requirements that it’s virtually impossible to have a “recommended path”. I agree with @rolyan_trauts that the S3-BOX-3 is quite impressive, and it looks like it’s going to be our “target hardware” going forward. But for many people, custom wake words are an absolute requirement before they would switch from Alexa, Google, etc. This isn’t easy to do on the S3-BOX-3, so we went with the streaming approach (which has upsides and downsides). But other people want to reuse old hardware instead, so I created homeassistant-satellite. But other people want to use their HA device with a mic, so I created Assist microphone. But other people want to reuse a Nest hub, or a standalone camera, or…and so on.

Besides fixing bugs and creating tools to make it easier for people without Computer Science degrees to use what already exists, there is pressure to create new things. Piper is seeing a lot of use in the open source blind community, and there has been work done to speed it up/make it sound more natural. I also have a whole side project where I get volunteers to record public domain datasets.

Everyone agrees that the HA intent matcher is too rigid, but many fail to appreciate the difficulty of creating an “A.I.” based matcher that works across 50+ languages (of which I only speak one). The vast majority of solutions won’t run in real time on a Raspberry Pi 4, and those that do almost certainly won’t train that fast. If you rename a light in HA, you expect to immediately be able to refer to it in the next voice command.

So I guess what I’m saying is: I’m struggling to prioritize :smile:

3 Likes

@synesthesiam I don’t think the esp32-S3-box is impressive, its a lot of bloated hardware that you don’t need for a KWS mic.
The esp32-s3 as a microcontroller is impressive as with its vector instructions it can run ML up x10 of a standard esp32.
Also Espressif's Audio Front-End Algorithms | Espressif Systems is impressive as they have an Alexa certified frontend in code ready to go for the esp32-s3 but really the rest of the box is just bloat that takes a $7 microcontroller to $50.

There are certain layers being used in openWakeWord that are not supplied by Espressif and likely someone would have to write custom layers for tflite4micro or GitHub - espressif/esp-dl: Espressif deep-learning library for AIoT applications.
Even though for a Pi3 OpenWakeWord is quite lite I doubt there is any chance of running on esp32-s3 as the BSS spits the audio into x2 streams and really its has x2 KWS running to select the right stream, I really doubt the relatively large parameter openWakeWord will run on a ESP32-S3, nevermind x2. Also when it comes to peoples WiFi that 2x 16Khz audio streams from each mic device running 24/7 is likely not going to acceptable by many.
I guess you could have those boxes streaming audio over wifi 24/7 but for me that is a bad idea.

I think what GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity. has done with openWakeWord is brilliant, but I totally disagree people needed custom wakewords to swap from Alexa or Google as people are used to not having custom wakewords.
Having custom wakewords means the already sparse herd of opensource users are now even more sparse as they have been partitioned into custom KW that if they had one or even a couple to choose from opensource could of collated data from use.
All it needed was to get people to opt-in and that huge gulf between bigdata and the datasets they have and opensource which has no good datasets of any size, could of been quickly built over time getting more accurate as more opt-in to donate data.

I am not really sure why when linux already has audio RTP that you created wyoming that needs to be implemented for ever type of instance like you have in your repos.

Its really curious to what you have done and you know me completely bemused as you have taken interoperable linux audio and made it bespoke with wyoming and as you see each app in the above has its own implementation!?!
This is the weird thing as you don’t need to get volunteers to record public domain datasets as we have huge public datasets for near every language, its just KW single word datasets we don’t have and still don’t.
Also apart from each entry being a single sentence MP3 what datasets are you actually recording and why as all those languages have huge sentence/asr datasets already?

You might be right about a Pi4 but actually a Pi5 or RK3588 you can run LLMs and you can use langchain and feed entity yaml to a LLM to create output.
For some reason you have created language based entities whilst the only thing that needs any specific language is the Name data a user applies to it! People do not write code in language based python than English, there is no De python, its just python and its commands just happen to be english based as most code is. Users don’t need to know this because it works in the background so why do entities need to be language based apart from what a user names them? Creating a multi-lingual REST API just seems a strange concept as creating multilingual Python…
As far as I know when a German Python programmer wants to use the Print command he writes print not drucken.

Bemused as always, but no I don’t think the esp32-s3-box is impressive.

https://sungkim11.medium.com/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76

I understand totally. And i want to say that your work on HA is super important and will benefit the most people.

How can i help to put together a team of community maintainers? i have some programming experience with Python and have managed some decent sized teams in the past.

Do you have time one day or evening to just brain storm / discuss with a few others and discuss what Rhasspy potentially should be?

As you say, variations in setups while being so different on each persons systems are also the reason many people chose rhasspy in the first place.

Thats always been its power and with V3 being as modular as it is and using Wyoming it should have a place to be able to work alongside other systems.

If we can have a goal in mind for Rhasspy V3 or even just a guideline on what you think should be the plan for it then perhaps a small steering team could vet the potential pull requests and keep on top of the issues and drop you a once a week email to summarize? or accept obvious good ones?

that way its still moving forward with , hopefully, very minimal input from you.

I agree the S3 box is definitely a good choice for satellite hardware as it appears to have enough power and features to be not just a listening device but also have multiple uses. which has been my goal from the start. i have been trying (sometimes successfully) to have my Rhasspy combined with a network audio player and build it some sort of housing the contain a basic amp and speakers so its not a tinny mess. but again that’s my setup. I saw ESH add a mic to his everything presence one last week and that was then working as a mic for controlling HA assist.

The voice datasets thing i am also onboard with. I setup and was using the Mimic Training Studio which can be repurposed for seemingly anything as you feed it the words you want. I wanted to retrain the Jarvis wakeword as i get a few too many false positives but i hadnt got very far as i dont know how to implement a new wakeword into rhasspy. and i really wanted to be able to export all of the wakeword detections from my satellites so they could be utilised in the new training. sort of a recursive continual training method. I have discussed this in the past and I have a plan to implement it at a high level but i need more understanding of the code to get there.

Openwakeword seems to be the way forward for sure as it has a really simple setup anyone can use and if like me you want a specific wakeword that doesnt exist elsewhere then its the best option at the moment and appears to be the main ww project everyone is looking at.

I guess most users of any sort of voice cotnrol will assume its going to be Chatgpt but with voice so wont understand why it cant match intents in a way that seems obvious. but that again i think would be a part of the ecosystem. intent recognition AI models that use a decision tree or similar to determine what the user wants. I appreciate that Apple and Google have enourmous LLM’s running on their expensive hardware to handle the requests and that replicating this locally will be difficult. but projects like Superagi with locally run LLM’s (still using a min of a 12G nvidia card but local nonetheless) are making progress with RLHF in their models. and a model that is built specifically for HA intent recognition isnt too far fetched an idea. the HA approach of pipelines plays along with this really well.

Imo a HA intent matcher that can work in 50 languages just wont happen. there are too many differences in intents and meanings in languages. a single model per language downloadable as required would be both slimmer in terms of space and much more accessible. a plan for what the specific language speakers need to contribute in order to train the model could then be done. and people opt into that to send their data in to be utilised in the model. or they can use the mimic training studio or a variation of it to add in their contribution.

Anyway off topic a bit.

@synesthesiam how can we help. there are plenty of people here willing to put in the time and effort to improve an open source project. Some of us are doing so anyway just in a disjointed way that isn’t benefiting the whole.
I Love Rhasspy and would like to have a plan and a heading that i felt others were also on that we could all contribute to.

And just to be clear i am just asking for @synesthesiam 's reply here now.

2 Likes

That is the main thing, but on a Pi5 or RK3588 they do with projects such as GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++ quantising the 7b paramter models down to less than 4gb.
I think with speech pipelines from initial concept when LLM’s had not arrived the low power Pi speech pipeline on a Pi4 was a makers thing.
Now though with advances and freely available models that will run on a Pi5 or more we are starting to go far beyond the original concept of Rhasspy.
I think Rhasspy as a brand may of come to an end as its branding maybe only covers the low-end of hardware that you might use.
Everything is LLM’s even Whisper is a LLM based ASR model that gets its accuracy via a LLM where it corrects the errors of others by sentence context.
Continue if you wish with Rhasspy but now with more modern ASR things are simpler are really at the ASR its not a streaming model but file based in faster than realtime race-till-idle that doesn’t block like a streaming ASR would for the length of the command sentence in realtime.
Likely only ‘satelites’ will stream to a ‘satelite server’ that will then queue and pass command sentences as files if on the odd ocurrance two zones trigger at the same time.

Too much was embedded in Rhasspy last time and for a small crew its unessacary when existing code and frameworks already exists for certain functions.
For audio out there is already Squeezelite running on a LMS zonal server or Snapcast that are tried and trusted wireless audio platforms.
You don’t need to embed audio out you merely have to associate an audio-in zone to a LMS zone or Snapcast zone.
If you employ LMS or Snacast as your audio server then any audio delivery is as simple as selecting the correct named pipe.

There is a real problem with 24/7 always on broadcast to a satelite server running 1 or more KWS as in dense urban areas WiFi channels can become conjested.
If I scan Wifi on my mobile phone there are 12 Wifi connections all competing in range for channels. I run my main computer wired even when I am playing with Pi’s its wired because my Wifi varies wildly much depending on my neighbours.
I don’t think I am alone and apart from some really disliking 24/7 broadcast of anything mics can pick up, many don’t have great Wifi and doubling that by sending the x2 streams from a ESP32-S3 running the Alexa certified AFE (Audio Front End) is going exascerbate that, especially if you thinking of multiple mics in a zone or multiple zones.

I don’t think that is a problem as GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity. runs great with about 50% on a single core of a Raspberry Zero2 where broadcast on KW hit only massively reduces WiFi traffic.
I am pretty sure unless someone writes custom C/C++ layers for TFlite4Micro or ESP-DL then openWakeWord will not run as it uses layers in TFlite that don’t exist in TFLite4Micro.
TFLite4Micro supports Relu layers but not Leaky-Relu layers and others such as GRU and I think also LSTM.
Cadence do but they hold those behind a paywall and that is why I don’t think the ESP32-S3_Box is impressive as unless you have someone who can write C/C++ layers for ML frameworks as it is, it isn’t going to run locally because TFlite4Micro doesn’t have the layers its running on TFLite.
The models Esspressif have picked are very simplsitic convolution models which they have written layers for and we only have a subset of what a Pi can run with TFlite.
If you wanted to use the ESP32-S3 likely you would have to use the KW models supplied by Esspressif that are far less accurate than openWakeWord or design another KWS as I think with the layers avail your choice is limited to a more conventual CNN or DSCNN, which is prob easier than writing a ML layer for microcontrollers. (Even if you do manage to get someone who can write a framework layer, I still have doubts if a ESP32-S3 can run x2 openWakeWord simulataneouly as it does with the box firmware)

So what GitHub - dscripka/openWakeWord: An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity. has wriiten is perfect as by starting with always on broadcast satelites or Pi versions you can add the code to capture KWS data and if users opt-in start collating a dataset that could make a KWS that will run on TFlite4Micro. I still think congesting a home wireless network with always on broadcast, purely to accomodate custom KW is a bad choice.
I am at least hoping HA will promote a KW that will allow opt-in data capture to allow smaller ESP32-S3 models to be produced and some form of on-device training (upstream) that allows models to learn through use also.

Thanks @Siparker! Yes, let’s do this. I’ll PM you and we can work out a day/time to talk about the details.

My “vision” for Rhasspy is for it to be the place people go when HA’s constraints are too tight. HA has adopted the pipeline model, with swappable components for each stage, so it’s pretty flexible already. And using Wyoming lets you run voice services on any machine you want. But more advanced use cases, like having multiple fallback services for each pipeline stage, are probably not going to be supported in HA out of the box.

2 Likes

What would be involved in making something like homeassistant-satellite or Voice Assistant — ESPHome talk to Rhasspy 3?

Rhasspy already does that. its the getting it to act as a home assistant mic that needs implementing.
If you just setup rhasspy as normal then it will talk to the rhasspy intent recognition and act accordingly.