Skill to talk with ChatGPT

I thought this might be interesting, but I’m not sure if it’s fully supported by Rhasspy. I’ve added a ChatGPT interface to Jaco, which might be able to be used with Rhasspy as well. The skill can be found here:

The basic concept is that you first trigger a normal intent (“I have a question”) with triggers the skill to send you an empty question (to allow you to continue your question and prevents the assistant from triggering other, unwanted intents). The question is then directly returned as greedy text, so no language model would try to match it to a different intent (in difference to normal commands), and the skill gets exactly the text the speech-to-text model understood. This text is then sent to ChatGPT and the answer is returned as speech output.
(What I like most about it, is that you can now ask the assistant arbitrary questions and not only some predefined intents:)

Using Jaco’s Rhasspy-Skill-Interface skill, it is possible to use most Jaco skills with Rhasspy, which should basically work for this skill as well. The only feature I’m not sure whether it’s supported by Rhasspy, is the possibility to directly return a greedy text without restricting it to the intent vocabulary beforehand.

Is anyone having some experience with this?

3 Likes

This sounds like a really interesting project, and I like the idea of a skill adaptor to allow skills to be shared between different projects.

It looks like you were able to make some progress on the greedy text intent, although I can’t tell how successfully. I’d love to hear how you have been resolving this.

In version 3 of Rhasspy I have been advocating a skill router.
In a inferenrce we generally have a predicate and a subject ‘Turn on the light’ ‘Turn on’ is the predicate and ‘light’ is the subject.
Generally we have very few unique predicates ‘Turn’ is obviously a control predicate as is ‘Set’, ‘Show’ and ‘Who is’.
Subjects and ‘greedy’ text can quickly become full language models or be very many. A media skill from a local library quickly can have a huge array of track / band names same with video.

A skill router is a 1st stage predicate ASR that forwards inference audio to what are skill servers with embedded ASR.
The current version of Rhasspy is really just a Hass skillserver with direct audio and not a Linux voice system and the extra step of skill routing by predicate allows the introduction of more accurate full language model or complex and simple language models embedded into seperate shareable skill servers.
If you where going to chat to ChatGPT then do it through a state of art ASR such as Whisper.
Partitioning skill servers means that ‘Turn on the lights’ is being processed by a lite weight control ASR such as a Hass control server, whilst conversational AI uses SOTA whisper.

Then you can have custom language model ASR that may self train on the contents of the media that contains thousdands of subjects as ‘Play I am a Walrus by the Beatles’ but generally Predicate is always at the start of the sentence so additional delay and latency should be minimal.
A linux voice system is a really simple system its the skill servers that are the clever bit and need investment in development and to be able to share them would greatly increase choice and create a bigger herd.

The biggest part of current Rhasspy is that its a Hass skill server and that specific focus and small herd obstructs the implementaion of different and ‘greedy’ language model skills.
Change the infrastructure in V3 to a predicate skill router and embed ASR into standalone skill servers so that skill choice will just be like a distro apt where you install and assign predicate and routing in a skill router in a very similar way to how Rhasspy allocates Hass controls.

The predicate skill router partitions so skills have no need of system specifics and this breaks the confinement of current voice servers having to provide everything with small herds and dev pools and swap out to a different infrastructure with just the addition of a skill router based on predicate and routing tables.
A Linux voice systems only skill is a skill router and its the user space of a voice systems kernel and you install skill servers that you route inference to.

The skill already is working with Jaco it’s just not working with Rhasspy yet.


That’s basically how this skill works with Jaco. The assistant creates a lightweight language model bases on the installed skills and uses it to improve the general intent+entity recognition. Now with the ChatGPT skill it works like this: (1) you trigger the skill with a simple voice command like “i have a question for chatgpt” which then activates the skill. (2) the skill activates the ASR module again (3) the user speaks his question (4) which is transcribed by the ASR module, but this time without using the small language model, so there is no restriction to known vocabulary (5) the skill recieves the transcribed text and directly sends it to ChatGPT . (6) The textual answer is converted back by the TTS module and spoken to the user.

Regarding your comment about the SOTA model of Whisper, Jaco currently uses a ConformerCTC-Large model, which only has a slightly lower performance (it was the best open-source model before), but can be used on a RaspberryPi (Whisper is too large and slow for that)

Most commercial skills breakdown to individual subject and map to a specific skill as ‘Turn on the bedroom light’ and it works in reverse. Bedroom light will be mapped to some horrid cloud based skill that will then use ‘Turn on’ for control. You really don’t need to with opensource as rather fight for branding you can just match language model bases with the correct skill type.

There are very few clear differences of language model needs and they can they can be split into 3 clear groups.

1… ‘Greedy’ I will use your term but full fat language models for activities such as conversational AI.
2… ‘Complex domain’ for specific domains such as media audio / video libaries where specific modeling can greatly increases accuracy and help with size and speed.
3… ‘Simple domain’ such as controls models where often predicate and subject qty is small and high speed accurate models can be made such as Rhasspy.

“I have a question for chatgpt” → chatGPT → “Ok what is your question” ->“Why am I asking to ask you a question and not simply using predicate routing?”.

In a modern opensource assistant you ask the question “How tall was George Washington” you have already set your predicate routings which uses a liteweight Rhasspy predicate language model and you have a top level predicate entry for “How” that has a Whisper ASR / ChatGPT routing because it has no further matches as the rest are Media or Control predicates. You get to choose and setup your collection of assistant software and skills to work how you want without need to specify skill type in every command sentence.

There is really no such thing as a ‘Assistant’ unless you deliberately choose to define and confine opensource to specific software and hardware as any ‘Assistant’ is merely a collection of any software that can run on any platform. Its a collection of serial modules queued and routed to the next destination that has no dictate to software, hardware and needs nothing more than the zone of operation and the return skill.

This is what I have been arguing for sometime as there is no reason to define and confine opensource to specific software and in fact certain predicate command subjects are more suited to a type of ASR than others and multiple ASR should be routed in what is a collection of Opensource Assistant software that gives choice and doesn’t define and confine.

Its an assumption that a single ASR instance that many devices can route to would be a PI that in the fast world of AI is currently extremely lacking, but even on a Pi as in Whisper Python performance - Benchmarking · Issue #15 · usefulsensors/openai-whisper · GitHub people are using Whisper on a Pi and its there choice and also what Whisper model they use.
I have a preference for a RK3588S as nearly 500% faster but Pi4-8gb price range, whilst others may not want those limits and use something cutting edge like a Apple Mac M2 Mini as the choice is theirs.
Hardware is also choice when it comes to opensource.

Also when you are using a SOTA automation system such as HASS you don’t have a skill for each subject it uses a specific ASR type likely embedded into a HASS inference based skill server that encompasses all control types.

You have chosen to define and confine and hopefully Rhasspy V3 will not.