Recognized untrain sentences / words

Hi,

Is there a way to get a slot for an unknown part of a sentence ?

My idea would be to do a reminder intent. I can define the start of the sentence, but would never know what I will say to remind.

example :

Hey Rhasspy, remind me to call my wife in two hours
Hey Rhasspy, call me back in 30 minutes to shutdown the oven
Hey Rhasspy, bring me back to take my son out of freezer in ten minutes

etc …

Any idea ?

2 Likes

Hi,
I do like this.
Hey rhasspy.
Listen{actionstt} to me…
At this moment I make a speech to text (with google speech recognition) and I recover the sentence:
‘add 3 carrots and 2 leeks to the list’
And I treat this phrase according to the words that interest me.
So, we can make requests on Wikipedia, file writing, youtube search …

3 Likes

@kookic i like this. Would you be willing to share some of the configs to show how you do stuff like adding things to the list? This would be an awesome first addition to my “out of the box” not doing much setup lol.

Kind of following this up… Is there a history of communications that an intent was not identified for?

Would be interesting to go through unrecognized requests and see what the rest of the family is trying to get rhasspy to do. Maybe i need to add a joke telling thing if my kids keep asking for jokes ect.

Would it work with kaldi ?

Prefer to keep everything offline, especially with google … :roll_eyes:

KiboOst
of course but i didn’t get good results with kaldi, and i also use translation and wiki with yandex and wolfram, so a bit of online.

A bit of code

recup_the_sentence = dire( )

import speech_recognition as sr

def dire ( ) :
r = sr.Recognizer()
global CLE_GOOGLE # your key google api
wavefile = “input.wav”
os.system( "arecord -d 5 -f cd -t wav " + wavefile) # record your voice for 5 second
with sr.WavFile(wavefile) as source:
audio = r.record ( source )
try:
retourG = r.recognize_google (
audio, language = ‘fr-FR’, key = CLE_GOOGLE)
print(’…Traduction GOOGLE: ’ + retourG )
san = sanitize_string(retourG)
return san.encode(‘utf-8’)
except LookupError:
print ( ‘Cannot understand audio!’ )

sentence example
[course]
il (faut | faudrait) faire{carnet:a} les courses
(efface |supprime){carnet:e}[toutes] les courses
(ajoute | rajoute ){carnet:a} des courses
qu’est-ce qu’il [nous] faut{carnet:l} pour les courses
(tu peux me lire){carnet:e} les courses

recup les slot value and

def course( v ) :
if v == “a”: # ajoute
mytts.dire(“Je técouteu, tu veux quoi?”)
# enregister
try:
os.system(“arecord -d 5 -f cd -t wav output.wav”)
# transcrire
retourG = dire()
print(’…Traduction GOOGLE: ’ + retourG +"" )
with open(‘course.txt’, ‘a’) as mon_fichier:
scourse = retourG
mon_fichier.write(str(scourse)+".\n")
mytts.dire ("Okay, jai ajoutai, "+retourG)

	except:
		print ("probleme enregistrement")

if v == "e": #efface
	with open('course.txt', 'w') as mon_fichier:
		mon_fichier.write("")
	mytts.dire ("Okay, j'ai tout effaçé.")

if v == "l": # lecture
	with open('course.txt', 'r') as mon_fichier:
		mytts.dire ("Bien, pour les courses, il te faut: ")
		mytts.dire (mon_fichier.read())
print ("Fin des courses.")
1 Like

Ok so you don’t say the untrained part in same sentence but in a second sentence.

I seriously doubt it would work with kaldi. It Alan understand trained words even when they are far from them regarding phonemes. Seems It just HAVE to find something known.

yes that’s why I lost no time with kali and use google.
with Poketsphinx…?

One way might be to run a second instance of Rhasspy with open transcription enabled and a special wake word. Not ideal, but at least possible :slight_smile:

When Rhasspy 2.5 comes out (someday, I promise!) you could theoretically run two STT services connected to the same MQTT broker. We’d need to think about how to adjust the dialogue manager to handle multiple responses and decide which one to choose. Just a thought.

2 Likes

Just a few nights of work. :wink: Thank for all.

Just so I understand this correctly; it is currently not possible to configure rhasspy to just pass custom words to the intent handler? So it is not possible to create a handler for example for:

  • play queen on spotify (where “queen” is not part of the trained sentence)
  • who is barack obama (where “barack obama” is not part of the trained sentence)

I’m quite new to the whole voice command ecosystem so I might ask for something really difficult, I don’t know but it would be nice to be able to create a more custom handler.

I came here for help with this too. I really need to be able to tell my assistant to add things to a list or even fun little things like Simon says.

Something like

[Groceries]
add [*]{thing} to the list

[SimonSays]
simon says [*]{thing_to_say}

This is difficult with the way Rhasspy approaches speech recognition. Depending on where the “wildcard” is also affects the complexity.

Let’s take @digitalfiz’s example:

[Groceries]
add [*]{thing} to the list

[SimonSays]
simon says [*]{thing_to_say}

The SimonSays example is easier to implement because the wildcard occurs at the end of the sentence. I would implement this by having the first speech system (e.g., Kaldi) recognize “simon says”, then I would clip the audio from there to the end and send that off to a second system (e.g., DeepSpeech) which is listening for generic English.

The Groceries example is harder, though, because the first speech system has to detect both the start and end of the wildcard phrase. One way I know to do this is to add <UNKNOWN> “words” to the first speech system. During recognition, it might recognize “add apples and bananas to the list” as “add <UNKNOWN> to the list” (the <UNKNOWN> may be repeated). Kaldi and Pocketsphinx can give the time window of each recognized word/token, so I could clip out the audio from all the <UNKNOWN> words and send it to DeepSpeech.


A much easier approach to wildcards would be if the number of words were known in advance. If I knew that [*] was one generic English word, I could just add every possible word from the English dictionary as a candidate at that point in the grammar. No need for a second speech system!

If you knew there would be between 1 and 3 words, for example, I could use knowledge from the generic English language model to do something similar. It would be harder than a single word, but easier than adding a second speech system.

1 Like

When Rhasspy 2.5 comes out (someday, I promise!) you could theoretically run two STT services connected to the same MQTT broker.

I’m on 2.5.11 now. Is this still in the works?

I agree that it might not be the right approach to set wildcards or clip the audio.
Sounds like the easiest implementation is to run a separate STT service (Kaldi in Open Transcription mode) and have the base Rhasspy intent manager (home assistant in my case) open a new session with that separate STT service.

Basic process flow:

  1. Speak wake word
  • Wake WAV plays
  1. “Play [me a] (song | artist | album | wiki) {type}”
  • Recorded WAV plays (intent recognized)
  • TTS response - “specify name of song”
  • Intent also invokes “/api/listen-for-command” on 2nd STT service running with Open Transcription
  1. “song name”
  • Recorded WAV plays (no intent handling)
  • Raw_Text from Open Transcription passed to a commandline lookup program, along with intent from metadata from 1st STT in item 2

We’d need to think about how to adjust the dialogue manager to handle multiple responses and decide which one to choose.

Yes, this too.
First, if we can get the 2nd STT service running, and the dialog manager to be aware of it, so it can send the audio input to it, based on the intent of the initial voice command.

Is there a hacky way to do this, for testing?
Or is just better to wait until STT services (like kaldi) are capable of a true “hybrid” mode (not mixed) that can train on sentences for quick predetermined commands AND be triggered to dynamically switch to an Open Transcript mode?

I did this a fair while ago, I installed them both under different user ids on my NUC which I use as the server. The main issue I had was how to differentiate the responses from the multiple instances.

I had one running trained with Kaldi and the other with Open transcription mode.

Everything would then respond with multiple responses and I wasn’t happy with trying to handle the output. If I remember correctly, the trained sentences generally responded a bit faster and so I got multiple responses through MQTT to my node-red instance and it would try to handle them as though I had said two different phrases.
I had other things on at the time and didn’t get my head around how to effectively equate the two responses and then decide between them what to do.

Maybe have one instance call /api/listen-for-command of the other instance.

Thanks for the suggestion but I don’t understand what exactly that would achieve or how that would help decide which of the two responses want to use.

I think the only way to do it, would be to explicitly call one or the other instances. Only one at a time. So only one response.

You can ask the first one to open a new session in the second instance. Basically have an intent on trained sentences, that will invoke /api/listen-for-command on the second instance that is running with open transcript. This open transcript instance would NOT have a wake word at all, so it won’t respond unless triggered by an explicit (trained) intent from the first instance.

All that said, I haven’t tried this out.
I would prefer a dialog manager to handle dynamically toggling open transcript mode.

This wouldn’t be too difficult to do. The MQTT messages could be extended to allow specifying whether the current voice command should be interpreted in closed or open transcription mode.

For this, though, I would probably just have the STT service load both speech models (open + closed) with one of them being the default. This would be less jarring message-wise over the MQTT bus than having two STT services.

What do you think?