I thought this might be interesting, but I’m not sure if it’s fully supported by Rhasspy. I’ve added a ChatGPT interface to Jaco, which might be able to be used with Rhasspy as well. The skill can be found here:
The basic concept is that you first trigger a normal intent (“I have a question”) with triggers the skill to send you an empty question (to allow you to continue your question and prevents the assistant from triggering other, unwanted intents). The question is then directly returned as greedy text, so no language model would try to match it to a different intent (in difference to normal commands), and the skill gets exactly the text the speech-to-text model understood. This text is then sent to ChatGPT and the answer is returned as speech output.
(What I like most about it, is that you can now ask the assistant arbitrary questions and not only some predefined intents:)
Using Jaco’s Rhasspy-Skill-Interface skill, it is possible to use most Jaco skills with Rhasspy, which should basically work for this skill as well. The only feature I’m not sure whether it’s supported by Rhasspy, is the possibility to directly return a greedy text without restricting it to the intent vocabulary beforehand.
This sounds like a really interesting project, and I like the idea of a skill adaptor to allow skills to be shared between different projects.
It looks like you were able to make some progress on the greedy text intent, although I can’t tell how successfully. I’d love to hear how you have been resolving this.
In version 3 of Rhasspy I have been advocating a skill router.
In a inferenrce we generally have a predicate and a subject ‘Turn on the light’ ‘Turn on’ is the predicate and ‘light’ is the subject.
Generally we have very few unique predicates ‘Turn’ is obviously a control predicate as is ‘Set’, ‘Show’ and ‘Who is’.
Subjects and ‘greedy’ text can quickly become full language models or be very many. A media skill from a local library quickly can have a huge array of track / band names same with video.
A skill router is a 1st stage predicate ASR that forwards inference audio to what are skill servers with embedded ASR.
The current version of Rhasspy is really just a Hass skillserver with direct audio and not a Linux voice system and the extra step of skill routing by predicate allows the introduction of more accurate full language model or complex and simple language models embedded into seperate shareable skill servers.
If you where going to chat to ChatGPT then do it through a state of art ASR such as Whisper.
Partitioning skill servers means that ‘Turn on the lights’ is being processed by a lite weight control ASR such as a Hass control server, whilst conversational AI uses SOTA whisper.
Then you can have custom language model ASR that may self train on the contents of the media that contains thousdands of subjects as ‘Play I am a Walrus by the Beatles’ but generally Predicate is always at the start of the sentence so additional delay and latency should be minimal.
A linux voice system is a really simple system its the skill servers that are the clever bit and need investment in development and to be able to share them would greatly increase choice and create a bigger herd.
The biggest part of current Rhasspy is that its a Hass skill server and that specific focus and small herd obstructs the implementaion of different and ‘greedy’ language model skills.
Change the infrastructure in V3 to a predicate skill router and embed ASR into standalone skill servers so that skill choice will just be like a distro apt where you install and assign predicate and routing in a skill router in a very similar way to how Rhasspy allocates Hass controls.
The predicate skill router partitions so skills have no need of system specifics and this breaks the confinement of current voice servers having to provide everything with small herds and dev pools and swap out to a different infrastructure with just the addition of a skill router based on predicate and routing tables.
A Linux voice systems only skill is a skill router and its the user space of a voice systems kernel and you install skill servers that you route inference to.
The skill already is working with Jaco it’s just not working with Rhasspy yet.
That’s basically how this skill works with Jaco. The assistant creates a lightweight language model bases on the installed skills and uses it to improve the general intent+entity recognition. Now with the ChatGPT skill it works like this: (1) you trigger the skill with a simple voice command like “i have a question for chatgpt” which then activates the skill. (2) the skill activates the ASR module again (3) the user speaks his question (4) which is transcribed by the ASR module, but this time without using the small language model, so there is no restriction to known vocabulary (5) the skill recieves the transcribed text and directly sends it to ChatGPT . (6) The textual answer is converted back by the TTS module and spoken to the user.
Regarding your comment about the SOTA model of Whisper, Jaco currently uses a ConformerCTC-Large model, which only has a slightly lower performance (it was the best open-source model before), but can be used on a RaspberryPi (Whisper is too large and slow for that)
Most commercial skills breakdown to individual subject and map to a specific skill as ‘Turn on the bedroom light’ and it works in reverse. Bedroom light will be mapped to some horrid cloud based skill that will then use ‘Turn on’ for control. You really don’t need to with opensource as rather fight for branding you can just match language model bases with the correct skill type.
There are very few clear differences of language model needs and they can they can be split into 3 clear groups.
1… ‘Greedy’ I will use your term but full fat language models for activities such as conversational AI.
2… ‘Complex domain’ for specific domains such as media audio / video libaries where specific modeling can greatly increases accuracy and help with size and speed.
3… ‘Simple domain’ such as controls models where often predicate and subject qty is small and high speed accurate models can be made such as Rhasspy.
“I have a question for chatgpt” → chatGPT → “Ok what is your question” ->“Why am I asking to ask you a question and not simply using predicate routing?”.
In a modern opensource assistant you ask the question “How tall was George Washington” you have already set your predicate routings which uses a liteweight Rhasspy predicate language model and you have a top level predicate entry for “How” that has a Whisper ASR / ChatGPT routing because it has no further matches as the rest are Media or Control predicates. You get to choose and setup your collection of assistant software and skills to work how you want without need to specify skill type in every command sentence.
There is really no such thing as a ‘Assistant’ unless you deliberately choose to define and confine opensource to specific software and hardware as any ‘Assistant’ is merely a collection of any software that can run on any platform. Its a collection of serial modules queued and routed to the next destination that has no dictate to software, hardware and needs nothing more than the zone of operation and the return skill.
This is what I have been arguing for sometime as there is no reason to define and confine opensource to specific software and in fact certain predicate command subjects are more suited to a type of ASR than others and multiple ASR should be routed in what is a collection of Opensource Assistant software that gives choice and doesn’t define and confine.
Its an assumption that a single ASR instance that many devices can route to would be a PI that in the fast world of AI is currently extremely lacking, but even on a Pi as in Whisper Python performance - Benchmarking · Issue #15 · usefulsensors/openai-whisper · GitHub people are using Whisper on a Pi and its there choice and also what Whisper model they use.
I have a preference for a RK3588S as nearly 500% faster but Pi4-8gb price range, whilst others may not want those limits and use something cutting edge like a Apple Mac M2 Mini as the choice is theirs.
Hardware is also choice when it comes to opensource.
Also when you are using a SOTA automation system such as HASS you don’t have a skill for each subject it uses a specific ASR type likely embedded into a HASS inference based skill server that encompasses all control types.
You have chosen to define and confine and hopefully Rhasspy V3 will not.
That might work for many requests, but I think there will always be those, that are quite similar, like “Can you make me a coffee”->coffee machine skill and “Can you make a coffee with raw beans”->chatgpt skill. You could send all requests (from your simple/complex domain) that the assistant didn’t understand to the chatgpt-skill, but this would require a very low error rate, else you would frequently get some long answers you don’t want to hear.
I think it’s quite similar to how I’m using the skill currently: “I have a question:” <-wait for beep-> (the skill won’t say anything, that it doesn’t break the speech flow) “can you make coffee with …” <-wait for answer->. So basically the term “I have a question” is basically the predicate routing, like your “how” except that you can continue with any text, and are not restricted to some predefined words that your question has to start with like “how, what, when, …”
Of course, it would be better if assistant would directly know which skill to trigger, but I think this isn’t so easy if you only have a small routing model.
I might have created a missunderstanding here, Jaco has no real hard or software restrictions. You can run it on any device that supports (linux-)docker, and with any software that can be dockerized. And if you need to, you can skip the containerization as well and only use the MQTT interface. You also can run each module on a different device if you want to. The raspberry pi is only the current minimal target, and for simplicity I’m using modules that work on both, computer and raspi. But since the different assistant modules are already containerized, it’s relatively easy to replace them with a different solution. Like using whisper for ASR or replacing the TTS with a new one by creating a new containererd script with the same mqtt interface. You don’t have to rebuild everything, only the module you want to replace needs to be fitted to the right interface. The replacement approach is not perfect yet, I’m aware about that, but I’m trying to improve it… (If you’re interested you can help with that if you like)
Depends really as you might feed that output to NLP and it would decide and apply more AI to where the skill is routed.
Being a control action likely you would use a control predicate than make as in ‘Turn on the coffee machine’
That doesn’t really matter as again its a matter of choice and this leads interchangeable modules between various smart-assistants and even what is a smart-assistant apart from the sum of the modules chosen.
I would have a guess without knowing you are using Wav2Vec2? for ASR? Which is a great ASR if true and sort of better than Whisper as language model can be even more specific to language subset purely for the skills employed.
Whatever ASR your using its ASR and it has no need of being in anyway system specific and could be a module on its own.
Voice audio → In to Voice text → Out that is what ASR does and nothing else and previous steps are similar each either collecting or processing that input audio and passing it along a chain.
There is no need for MQTT as no messages are being sent so why batter a series of nodes with MQTT headers when the previous module is just pushing to the next.
It would be far better on a lower latency tcp/ip 1to1 connection such as websockets than doing RTP over a liteweight message protocol such as MQTT as what we are pushing in comparison to normal lite MQTT text is some pretty heavy binary audio in realtime or faster.
Every module can have the same transport that can carry audio as a stream or a file, its a serial sequence of processes that has no need of MQTT or its QoS pushed to max so you can guarantee delivery. If there is a metadata file that is also moved to destination and that is it and the same method can be used all the way up the chain.
From KWS server, audio processing, ASR its purely serial and we just need DNS and port for destination, then devs can provide the time on perfecting a standalone module than being spread too thin on all-in-one specific systems.
Its very easy to add dns to a container, I should read up more on docker DNS but would hazard a guess on creation each entry could be to something simple like dnsmasq, from the top of my head.
No more of these ‘branded’ all-in-one systems as you can pick and mix your serial chain of modules on what is your preference of complexity as that is all it is, a simple serial chain and its the endpoint skill servers that are the clever bit.