Almond + Rhasspy?

Thought exercise: It seems that Rhasspy is very modular. I’m wondering if it would be possible to use Rhasspy and Almond together, like Rhasspy for voice recognition and voice synthesis, with Almond for intent recognition or something like that? It seems like both Rhasspy and Almond itself (as opposed to the Home Assistant add on) are actually developing faster than Ada…

4 Likes

I think Almond would make a great Uber skill that is sort of an automated NLU.

It should have its own server and ASR deliver sentences as boy it can do a lot so all those services and dependencies.
I always say that though as because of complexity and ease of deployment I am against the all-in-one approach and for connecting services at a network layer.

Rhasspy just needs to filter out non Almond tasks and then forward all to an Almond server.

Guess there must be existing Almond services that can accept intent and send TTS.
I often feel things are the wrong way round as intent there is one type as its an intent sentence. TTS has 2 types, Announcements (No response required) & Questions (Response expected).

Thats it really and there should be a Almond4Rhasspy as much as no Rhasspy4Almond and all that needs to be done is relay the the data and type metadata & zone/room/device metadata.

I have been extremely surprised that the input voice-in isn’t sub divided into voice zones that has its own server so it can be interoperable with many KWS and ASR system changes do not enforce KWS firmware changes.
It should just have a middleware KWS server that can map and translate to many vendor devices to ensure interoperability and length of life.
Its operation is to get the best KW stream, queue and translate vendor KWS to a Vendor ASR and it can accept a zone question and translate to KWS to broadcast the best stream of the last best KW hit without waiting for KW until silence.
That is it as the data and metadata just needs to follow the process pipe of ASR->Intent->Intent service.
On the output again to keep interoperability a server maps and translates to many vendor devices that should have no need to understand Rhasspy protocol so that you also can ensure interoperability and length of life as changes do not affect firmware but purely mapping and translation services to output devices.

We should of had an Almond server ages ago, we should of had a simple data & metadata relay and 2 middleware servers and a whole load of KWS and media systems to choose from.

Bemused as ever but Almond would be a great addition if things where not so complex and strangely so.
@synesthesiam

From what I learned of Almond when Home Assistant went all-in on it, it’s online only. You couldn’t (at the time anyway) run your own instance, and you depended on Stanford to add new sentences and skills.

The underlying tech is very impressive – natural language translated into an actual program (technically a complex event processor). But I’m not going to bother with something that requires an Internet connection; you may have noticed that services like the Google STT are contributed by interested Rhasspy users.

I’m actually surprised you (@rolyan_trauts) like Almond. It smacks of the whole “shove everything into one big pre-trained thingy that is supposed to fit everyone”. You always seemed like more of a “keep it simple” and “train it locally” kind of guy to me.

Yeah Almond is cool for an example and the example above is how simple it could be to add complex services.
Why is it so hard and complex to do so is because its all-in-one and not partitioned into dependency free network layers.

That “home server” just forwards things to a cloud server: https://github.com/stanford-oval/almond-server/blob/982f85e708df99f9afbf23a77ec53ab8a1497371/main.js#L47

Almond could be easily added to Rhasspy. Catch one MQTT message (asr/textCaptured) and produce one other (nlu/intent).

Not from what User support says as it can run a home assistant service locally but as its says you need to cordan off.

Almond could be easily added to Rhasspy. Catch one MQTT message (asr/textCaptured ) and produce one other (nlu/intent ).

Exactly and I quite like a lot of software https://spacy.io/ being another and the above main question why is it so hard to be interoperable with other services?
Hence middleware to map and translate as its about choice?

So yeah, from that forum topic it basically can’t doing anything offline. It’s impressive work, but I’ll wait for the redux where they realize they can do the same thing with a 10x smaller network. Google just put out a paper where they replaced self-attention in an NLP network with a Fourier transform and get almost the same accuracy.

I’m all about choice, which is why I still support Rhasspy services like Google STT/TTS that I wouldn’t personally use.

It’s not really that hard when the services just do what’s advertised on the hardware you run them on. The vast majority of interoperability complexity in the external Rhasspy services comes from:

  • Credential management (Google needs a magic JSON file)
  • Local caching due to intermittent connections
  • API bullshit because they couldn’t bring themselves to just do text/voice/language strings for TTS parameters (yes, Google, you’re so special that you need magic enum integers)

When using services that are well designed and can actually run locally, interoperability is usually just a matter of (1) changing parameter names, (2) changing transport types between HTTP/MQTT/Websockets. Otherwise, it’s gonna be JSON and PCM audio.

1 Like

Glad you said it as been wondering that for a while.

Not the best example (Almond but it will do) for the context of the discussion on ASR & streaming and my take streaming shouldn’t be done ASR/TTS direct and have networked partitioned middle ware holding the templates without need of programmatic implementation to convert and connect to other services by 3rd party modules.
Maximising vendor interoperability and reducing firmware needs for me seems important and if people want to then they can irrespective of opinion.

Thats my take said it in the discussion and glad you agree.

Ideally, something like NodeRED is all we’d need for interoperability. I can see the need to call out to the occasional shell command (e.g., sox), but it should mostly be about gluing things together. The Jaco assistant does this well – everything is separated into containers.

1 Like

Yeah seperating into containers can cause hassle but as something grows it can reach a critical mass.

I used to have a lot of interest in a platform called Zentyal server and it was great and grew to so many services it just always collapsed and if they had partitioned it into network abstracted containers it would of been so easy to maintain but every update was like dominoes often bringing things down.

The web interface as web interfaces are is just a portal to many in a singular presentation so for the end user it would of still been the same.

“HTTP/MQTT/Websockets” is just a larger set of “Otherwise, it’s gonna be JSON and PCM audio” and that is what I am saying.
Shouldn’t be any but they are just the 1st with templates mappings and routing modules.

@VoxAbsurdis using Almond effectively would also require the speech to text system to be in open transcription mode, since there are a huge number of sentences it can accept. I’m hoping to include Vosk at some point in Rhasspy, which I think would work well for this.

2 Likes