Thoughts for the future with homeassistant/rhasspy

CrankyCoder · November 28, 2022, 5:21pm

Just curious if with @synesthesiam working for nabu casa and the detection of satellites and base stations, does that mean that the nodes will (hopefully) become media player entities so we can easily do stuff like “play this sound” or TTS to a specific node or groups? I assume to do so, the TTS would still go to the base station, but then playback on the satellite (that’s how I do it manually now)

Or am I completely missing the point on how the discovery integration is going to be seen lol.

rolyan_trauts · November 28, 2022, 6:38pm

A media player should just be a media player that is part of a media system that Rhasspy plays to the the media systems media servers correct zone input and play this sound” or TTS to a specific node or groups should just be a default function.
Zones and channels should be the base infrastructure and the mirroring of input audio to output audio should just simple config.
Should of always been that way…

I can not say I like at all the idea of autodetection of mic/kws or wireless/amps especially microphones that deserve far more security to how and what you join them to.
There is a reason why commercial systems use wifi ap mode or bluetooth on usually a poweron button press and use a mobile device as in intermediary to pair.
Its inherently more secure as the ap or bluetooth are not routed and purely local with BT being even shorter range.

If what is being proposed that each unit has a function to advertise itself locally on a physical button press so that a local device can link up units then yeah ok, but if any of the big players had devices as we currently do on open web interfaces they would get absolutely slaughtered by the security press.

rolyan_trauts · November 29, 2022, 7:33pm

The future of a simple open standard Linux implementation of a voice system is more important to me because it can be accomplished by reusing two tiny simple applications at each stage of a Linux system.
incredible lightweight the simplicity is akin to RISC as reusing simpler modules means the system can be simple or complex and scale by choice by purely adding duplicate blocks to the system.

There is a natural serial queue to voice processing:-

1… Mic/kws input
2… Mic/kws & speech enhancement server
3… ASR
4… NLP Skill router
5… TTS
6… Audio server
7… Skill Server(s)
2 – 5 can work in a really simplistic manner where either audio & metadata or text & metadata or queued until the process in front is clear and then sent.

Thats it in a nutshell how simple a native Linux voice can be as its just a series of queues and keeping it simple with Native Linux methods than embedded programming means its scalable to the complex.

Each Mic/KWS is allocated to a zone (room) and channel which should remain /etc/conf linux file system that likely mirrors the zone & channel of the audio system outputs
As distributed mic/kws can connect to a Mic/kws & speech enhancement server and on KW hit the best stream of that zone of the KWS argmax is selected.
The Mic/kws & speech enhancement server receives both audio and metadata transcribes to audio but merely passes on the metadata to a skill router.
A Skill router connects to skill servers to collect simple entity data by basic NLP matching of predicate and subject to route to a skill server again purely forwarding metadata
The Skill router will also accept text from skill servers that return metadata so the TTS will forward audio to the correct zone & channel also on completion the calling skill server is added to the metadata and forwarded back to the Mic/kws speech enhancement server to initiate a non kws mic broadcast.
Again the chain starts again and because the initiate skill server metadata is included the skill server knows that transcription dialog destination.
Thats it and you can add multiple routes at any stage to multiple instances so that it scales.

All that is needed is 2 tiny bits of code a low latency communication server and client, I can even demonstrate how simple that code is as wenet have both Websockets & gRPC examples.

They are just very simple standard linux conf queues as a voice systems is a serial near exclusively a serial chain and the only decision is to route to a specific skill server and near non of what we have is actually needed.
You purely forward metadata with audio or text if the destination is free or queue scalability is just muliple destinations.
Skill server need to know a very simple entity exchange with the skill router and from then neither system needs to each other.
The Skill router just needs the entity listing to decide on what to route to and it doesn’t matter that much as non matches are bounced back anyway.

Front of house are standard zonal audio systems Airplay, LMS or Snapcast its totally open to any because there is zero need to embed code.
Same with KWS Mics multiple plugins allow multiple KWS systems just to plugin to the KWS/Speech enhancement server that provides the initial zonal information of metadata for the voice to begin the serial chain of processing and the metadata pics up process stage data for process decision.

A simple client/server adoption would mean we can pick & mix opensource voice modules as there is a plethora available and they are being released at breakneck speeds and like Home Assistant it would be stupid to try and also be the Tasmota or ESPhome and be everything to everyone as that is a huge dev cul-de-sac as it has to do everything.

The future of Rhasspy os really strange as it doesn’t provide what we are missing and seems to provide a whole lot of what we don’t need.

A simple client/server queue router and a quick rethink on the basic building blocks and requirements of Lnux voice systems is all we need and when you do, you should quickly realise its actually a very simple serial chain.

That leaves vendors of all types to create a whole range of skill servers as a voice system has zero need to know as a system it purely has a list of its entities routes to the server and after that doesn’t care.
This has been a huge flaw as a system grows a voice system can not broadcast across many skill servers on nodes as it links all those systems and as security flaws goes it makes the above KWS discovery pale into insignificance but the answer is so simple as skill server to skill router is 1to1 and you do not do control you do voice inference and forward to skill servers.