Slightly confused about rhasspy-satellite

Like all remote audio software and protocols generally you can always use as either a standard alsa or pulseaudio device sometimes both.
Where are you going to fit it in? As its already a standard alsa device

pcm.!default {
    type plug
    slave.pcm rate48000Hz
}

pcm.rate48000Hz {
    type rate
    slave {
        pcm writeFile # Direct to the plugin which will write to a file
        format S16_LE
        rate 48000
    }
}

pcm.writeFile {
    type file
    slave.pcm null
    file "/tmp/snapfifo"
    format "raw"
}

Be it snapcast or roc

https://roc-project.github.io/

There is nothing to fit in as these present as standard linux audio devices.
Hermes protocol obviously is important to you as rhasspy-satelite has no need to speak to other modules because rhasspy-satelite should be what the name describes and be a satellite of a rhasspy-server not a clone of a rhaspy-server.
It doesn’t need to fit in or need software as when on the server it uses standard alsa configuration or are you now going to claim alsa needs to fit in and this is why we need hermes-audio?

Call the repo rhasspy-clone or rhasspy-mesh but don’t deceive and present as a lightweight satellite of rhasppy.

I really do not understand where you want to go with this.
There is no presentation of lightweight satellite, there is just one example on how you could setup satellites and a server.

That’s it, no more no less.

What is the point you want to make in this discussion, I do not get it.
If you can create a server and satellite another way then is currently available, that is perfectly fine.
I have create an MQTT audio streamer on ESP32, which was what I needed.
The hermes protocol is indeed important, because the choice was made to use that as a way of making the various modules communicate. It is not important to me as a person, but it was adapted from Snips when it was shut down.

Again, I am currently confused about the point you want to make.

2 Likes

No you not confused but yes its purely something you want to do.

Many thanks for at last some honesty.

@ulno

Also its not just snapcast its any RTP audio in a real simple system for satelites where any audio after KWS auth, capture and playback is just a stream.
A 1 sec delay is nothing compared to some systems that wait for end silence before presenting complete sentance wavs to ASR and not sure where you get 1sec delay from but there will be delay from the KWS unless you have a permanent stream.

Someone or some project will catch on that high end commercial systems use distributed wide area microphone arrays to server based audio processing that with consumer based voiceAI has extremely strong paralells with wireless audio.
Low end can create low cost speaker/mics be it stereo, quad, 5.1 or above where the expense DSP is singular in a server and its a natural efficient distribution of cost.
Mic audio streams are extremely important as distributed wide array mics are vastly superior to any compact mic array by pure physical laws to how sound and capture operates.

Someone is going to catch on beyond the blinkered focus of singular capture points and badly bloated server systems.
Eventually it will be distributed room systems with a room processor that likely hooks up to a room media centre to also echo cancel any further sources.
If the Voice system is the media system then even that is not needed and quickly you start to outclass any singular point source.
Snapcast I mention purely because its amazingly simple and lite and gives latency adjusted streams.

Much of what we have with the current satelite system doesn’t and needs a view of partitioning technology and use cost effectively and that will be the ultimate driver.
Also there seems to be mainia to create project specific versions of common already avalilable software just to tag with own name branding that greatly expands unnecessary support needs.

I agree with most of what you say, but I wonder how we can facilitate this into the right direction.
Communities play a big role in open source projects and re-wrapping and re-inventing things might be hard to avoid because it all lives by personal projects and the problems occurring in them.

So, what do you think is the best way forward? You have way more experience in sound than I have - I am just a software engineer -, and I already suggested to synethesiam to get you involved into the precise and mycroft workgroup - would you be interested or can we start some alternative threads projects to fast track some of the things you outlined (or at least check if there is a community to support that)?

To be honest I am no expert in sound and because I am not heavily involved I am not focussed or blinkered by the project.
I started my career at Advanced Music Systems but that was a long time ago.

I have been playing with equipment available and have noticed that the beamforming Aec device mic arrays are exceptionally good at picking up voice from a distributed noise field.
They are not so good when a singular predominant noise source exists.
In industry noisy backgrounds of many sources is common whilst in domestic situations the oppsite is true.

Its simple physics that a distributed wide array microphone system has the advantage by simple positioning rather than forcing physics through advanced DSP.
Its also a matter of cost and that a common use for Amazon/Google units is to play audio and quite cost effective they can be the audio source and distributed as a speaker system.
This also gives a distributed microphone system where generally a mic will be closer to the source than noise.

The Rhasspy satelite is new but its been completely bloated with all manner of unnessacary and also omits hugely important AEC or otherwise expects hardware which is very costly singularrly never mind in multiple room satelites.

There is no need to do anything but capture and play audio and be a 1st tier KWS detection system with maybe a simple protocol to hand back things like LED status, volumes and the likes.
A satellite is in its name and it ends in lite and usually is microscopic in respect to the body of orbit.
This is definately not currently so.

The best way forward for satelites is to drop much of the bloat and create just the basics keep it simple and have AEC of some sort.
The current Rhasspy satellite is probably wrongly named, feature rich and probably more of a Mesh than sattelite.

Seems to me much delight has gone into what python functionality can go into it without respect on how it may be used, how much cost that use will be and who will use it.

Personally I love what synethesiam did originally and had a brilliant insight into partitioning voiceAi infrastructure and that has been derailed almost completely by unesscary bloat across the whole project whilst missing some essential audio processing to create functional living space voiceAI with the exception of the assumption of expensive hardware use.

I started with Mycroft, jumped ship to Rhasspy but since you mentioned Linto and I had another look, probably going to jump ship again as much here has individual merit but as a whole it seems to be going off course as often opensource is prone to.

Are you checking version 2.5 or 2.4?
2.5 is what you describe actually :slight_smile:
The only thing is, both satellite and server or a Rhasspy docker, but with specific settings to act as a master or satellite.

Haven’t bothered to test since initial look and don’t think I will bother to be honest.

Rhasspy is still a work in progress. You have obviously much experience in audio processing, so for you AEC and cheap audio satellites are important, but most people are just fine with using a Raspberry Pi that they have lying around anyway. This doesn’t mean that what you want (cheap satellites with very lightweight software) doesn’t fit in Rhasspy, it just means that it doesn’t exist yet. Hopefully you or other people experienced in these matters can contribute to it. That’s what open source is about, people joining and contributing their expertise to create better software than one person is able to create.

1 Like

It already exists and this is another thing that is switching me off is the encapsulation of function just to make it Rhasspy.

Works great and only causes load when media is playing when barge in is needed.

Snapcast already exists and for me is just brilliantly simple for use.

Being opensource the json messages in snapcast prob could do with expanding for some simple additional sattelite roles and negate the addition for any further protocol.
It has a webserver that is any webroot put in its root folder of each server.
Mic/Speaker->Rhaspy server
Rhasspy server->Speaker/Mic

Thats it apart from KWS and probably VAD but nothing else is needed.

I really like what synethesiam did with intent2json but from my viewpoint anything beyond that is the premise of something else, what I call an ‘intent processor’ or just merely external app.
An intent processor may request TTS and includes the source method as a prefix to the following voice capture intent if needed or null if its a to return to a first stage intent.

There is a whole glut of stuff going on at the moment like skills for rhasspy that will be rhasspy only and in my mind its the wrong way to go.
That is an external skill provider that should really be able to work with any intent stream and would be a great app/project to provide.

I am no audio expert but you don’t have to be to envisage how distribution of mics means close proximity to at least one is possible and if you want to keep things simple just sum the inputs in a asound.conf on the server.

But to be honest guys think I am going to jump ships and only posted info in response to a request by @ulno that my opinion is satellites should just be audio processors and RTP streamers to a voice AI server because that seems to provide the best cost effective use.

Rhasspy lacks its own KWS, it lacks dataset collection and model training and also it lacks audio processing such as AEC. But its not for me to set priorities just judge what others see as priorities and weigh up if there may be better alternatives elsewhere.
ASR has to be imported into the project because its so damn complex but the rest is fairly simple and specific and guess the big giant robot is just not for me.

It really sounds like you should look at Rhasspy‘s little sister voice2json a little more.
You can cherry pick the wakeword and stt and tti components with none of the protocol overhead and build your own minimal structure around it using snapcast and mqtt or anything else really to tie it together.
Maybe worth a look?
Johannes

3 Likes

http://voice2json.org/ is once again some simple genius provided by synesthesiam that hopefully will not get the treatment Rhasspy has.

I guess I could do it all myself as all that is required is a mixer based on the peak VAD of last KWS time frame.
Its not rocket science really, in fact its quite simple and obvious in implementation.

However I am not interested in recreating the wheel alone whilst there are existing projects that are innovative, showing loads of promise and likely to garner herd support.
Voice2json looks extremely good like most things synesthesiam lays his hands on but its looking rather lonely at the moment.

The lack of innovation here with zero ideas apart from hijacking a project to resurrect a dead one verbatum, means I do have a destination and even though great its not Voice2json.
It doesn’t have to be snapcast or streaming sattelites, but it does far more than just rehashing what was once before.

That replies here in terms of needed audio processing of whatever innovation they will be are ‘maybe find another project’ speak volumes in direction and thought.

But hopefully maybe synesthesiam may think its time to split activity and adopt further solutions and direction https://github.com/synesthesiam/voice2json/issues/13

1 Like

Well. keep us up to date with your project, it is always nice to hear about different voice assistant software!

Prob will still be around sharing hardware info but yeah I will spill the beans, maybe even voice2json dunno yet.
The KWS is looking like Linto as haven’t tested but like what they are doing with datasets, but again wondering why they seem to of attached a display to a voice sattelite :slight_smile:

But the base https://pypi.org/project/pyrtstools/ might be the KWS solution for Rhasspy due to how they are approaching datasets.
Actually uses the Mycroft MFCC and dunno why there optimised neon version is missing?

1 Like

Thank you for the compliment :slight_smile: I’m thinking “simple genius” would make a great epitaph and/or t-shirt!

I see the audiences for Rhasspy and voice2json as quite different. Rhasspy was intended for a less technical user, and I want it to reach as many people as possible (hence the Snips compatibility and multiple ways of doing the same thing).

voice2json is for command-line junkies like me, and is much more opinionated about tooling. I have no intention of “bloating” voice2json with a GUI or any kind of messaging infrastructure like MQTT. Unix pipes are the ultimate composable tool, but they’re unfortunately not for everyone.

@JGKK is working on a node-RED plugin for voice2json that should help people caught between Rhasspy and voice2json :slight_smile:

1 Like

@synesthesiam Did you manage to have a look at https://github.com/JuliaDSP/MFCC.jl ?

All I am going to say is wow!

With Rhasspy feeling guilty about my opinion but just have this gut feeling there is this Wordperfect Vs MSoffice analog going on there. To be honest I am equally critical of Mycroft for it as well.
I see a clear partition via intent from VoiceAI to an intent processing project sort of Almond like and specific uses have much advantage to partitioning. Even media should not be part of a VoiceAI as it has a logical platform and that is a media player.
We all know what happened to Wordperfect with the all-in-one approach as infrastructure wise in use many people will be agnostic to what is underneath.

Its just opinion and gut feeling so who knows as often its very different, so enough of that.

In terms of Dev and implementation for the technical users whats your opinion on https://github.com/JuliaDSP/MFCC.jl especially RASTA-PLP or the jaw dropping array of audio processing methods it can provide.

I am not worthy but might be beneficial to have discourse with https://github.com/JuliaDSP/MFCC.jl/graphs/contributors

It grabs https://labrosa.ee.columbia.edu/matlab/rastamat/ and more and dunno if my question about implementing SAD in the same way as feacalc(:energy) for MFCC apart from it creates and envelope similar to ALC from ‘silence’ to ‘silence’ for sentance part splitting.
Is just plain stupid or not.