Recently I’ve been implementing a POC for interacting with Rhasspy using chat (XMPP) by configuring it as a satellite. The thing worked out surprisingly well when using Hermes (I could even send replies back via chat!!), although I had a small issue. I had two ways (that I knew of) to go for this:
send a DialogueStartSession and, after receiving DialogueSessionStarted, send an AsrTextCaptured with the text received from chat
send a NluQuery with the text received from chat (emulating the text input in Rhasspy web UI)
Using option 2 seemed easier, but I’d lose session management, which is essential when the interaction has multiple steps. So I went for option 1.
Sending AsrTextCaptured however triggers the “recorded” wav file, which takes away precious response time. I could have sent a fake PlayFinished to trick Rhasspy into going ahead, but I just wanted to skip that step, so I created a convention for this:
Whenever AsrTextCaptured.wakewordId is null, Rhasspy will assume that the request didn’t come from voice so it will skip playing the “recorded” wav file
I implemented that in my fork. Currently though only the Google ASR module can fill the wakewordId field. Core developers, could this be a reasonable way to go? Can we use this or some other approach? Or am I going totally the wrong way?
A chat is not siteId specific (multiple sessions can happen in the chat at the same time). The current dialogue manager is not fit for this (feedback sounds, asr listening, session timeout, etc.).
I’d create a specific chat dialogue manager (Rhasspy-chat-manager ) with specific topics like chat/dialogue/start and leverage Rhasspy Hermes protocol to convert chat text input using NLU topics like hermes/nlu/query to intents internally.
Actually I would create a siteId for each user that is chatting (e.g. xmpp_daniele). But I can see it can be somewhat cumbersome to manage (and it might have other caveats).
You surely know Rhasspy better than me, but couldn’t this really be handled via parameters instead of writing a whole new module?
It could as I think @synesthesiam did some magic tricks with the siteIds but… it will surely increase the maintenace cost and complexity of the dialogue manager:
enabling feedback sounds for vocal dialogue but disabling them for chat
different session timeouts as 10s is short enough for a vocal utterance but surely too short for a keyboard input
topics that are not useful need to be bypassed, etc).
The chat dialogue protocol is somewhat different from the vocal dialogue one.
I think it will probably be easier (and surely simpler to maintain and improve) to leverage the Hermes protocol (that is pretty appropriate for both actually) and create a specific chat dialogue manager that users can setup if they need to.
It can embed chat protocols like XMPP directly and serve as interface between a chat app and the Hermes/Rhasspy ecosystem.
I agree with @fastjack that trying to shoehorn a chat dialogue into a voice dialogue manager will become ugly. They are two fundamentally different types of dialogue.
I once actually started doing for Snips exactly what you have done now, creating a chat program by configuring it as a satellite. But I bumped into the same assumptions of the dialogue manager. I think a separate rhasspy-chat-manager or rhasspy-chat-dialogue-hermes or whatever you call it will be a better fit for the architecture.
The trick here is having multiple dialogue managers in configuration, which is not allowed now if I understand correctly. And even if multiple managers can be configured, how would Rhasspy decide which one to use? Based on siteId?
Yes I think this would need to be made more flexible. I’m curious what @synesthesiam thinks about this.
Rasa works with the concept of channels. Maybe we should take a closer look at that. For instance, they have channels to link Rasa to Telegram, Slack, your website and so on.
Aww, c’mon, folks. Can’t we just get ELIZA.BAS and port it to Python for grins and giggles?
[/s]
I’m being TOTALLY facetious, in case you didn’t notice. Although there’s a sister project lurking in this request, if someone has the time to develop it. I have absolutely NO CLUE how exclusively Rhasspy uses the mic/speaker, but I suspect it’s possible to share between multiple applications, as Rhasspy is already doing.
@daniele_athome, I like your solution as a quick hack (wakewordId = null), but I agree with others that it might be useful to have non-voice interactions be first-class citizen in Rhasspy.
Off the top of my head, here are some solutions (in order of difficulty):
Use the existing wakewordId = null trick
Make the existing dialogue manager listen to audio toggle on/off messages. If you turn audio off before sending your text captured, it should skip the sound.
Create a separate message like asr/textCaptured (or add a field to it) that indicates chat input
Create a new dialogue manager that does chat stuff instead of ASR stuff
For #4, we might want to resurrect the discussion about running multiple instances of a given system (microphone, ASR, NLU, etc.). One idea would have something like this in your profile:
{
"dialogue": {
"system": "rhasspy,chat"
}
}
And then Rhasspy would start two different dialogue systems at the same time. Without anything extra, it would be their responsibility to decide which messages to respond to.
The siteId “magic” @fastjack referred to is just that each Rhasspy service can listen to any number of site ids by adding more --site-id arguments to the command line. This is how satellites are implemented: a base station service listens to both the base and the satellite site ids. Not sure if we want to use site ids here or not.
I kind of like solution #3 to start, just because session management was a pain to write and it could be reused as long as chat isn’t totally different.
I think that dialogue manager and chat management are really different services that do kind of the same thing.
It might be easier to add a new kind of service (chat) to the mix:
{
"chat": {
"system": "rhasspy"
}
}
It will be easier to configure and maintain as it only leverage the existing Hermes protocol without impacting the dialogue manager at all.
Plus it will be able to embed a chat protocol like XMPP natively so any compatible chat app could plug in directly from a mobile phone or a desktop computer and exchange with Rhasspy.
Adding a hermes/chat/ subset (ex: hermes/chat/textInput) can help to avoid adding additional parameters and complexity to the existing topics. (Solution #3)
EDIT: The service should indeed emit topics like hermes/dialogueManager/# for multi turn apps. How to handle sessions will have to be determined though (no or very long timeout? user as siteId, wakewordId?)
I also went a bit further and also implemented voice commands via XMPP by using out-of-band data. Basically the XMPP client on my phone records my voice, uploads the recording somewhere and sends the URL to Rhasspy in a message. The AppDaemon app downloads the file and sends it to Rhasspy via Hermes (audioFrame+stopListening).
I know I’m doing hacks on hacks, but this whole thing is experimental anyway… I just hope this way I can give some ideas and inspiration to people who can spend more time developing on Rhasspy (it is a great project and, as you probably noticed, I couldn’t spend as much time as I would have hoped).