Recognized untrain sentences / words

A bit of code

recup_the_sentence = dire( )

import speech_recognition as sr

def dire ( ) :
r = sr.Recognizer()
global CLE_GOOGLE # your key google api
wavefile = “input.wav”
os.system( "arecord -d 5 -f cd -t wav " + wavefile) # record your voice for 5 second
with sr.WavFile(wavefile) as source:
audio = r.record ( source )
try:
retourG = r.recognize_google (
audio, language = ‘fr-FR’, key = CLE_GOOGLE)
print(’…Traduction GOOGLE: ’ + retourG )
san = sanitize_string(retourG)
return san.encode(‘utf-8’)
except LookupError:
print ( ‘Cannot understand audio!’ )

sentence example
[course]
il (faut | faudrait) faire{carnet:a} les courses
(efface |supprime){carnet:e}[toutes] les courses
(ajoute | rajoute ){carnet:a} des courses
qu’est-ce qu’il [nous] faut{carnet:l} pour les courses
(tu peux me lire){carnet:e} les courses

recup les slot value and

def course( v ) :
if v == “a”: # ajoute
mytts.dire(“Je técouteu, tu veux quoi?”)
# enregister
try:
os.system(“arecord -d 5 -f cd -t wav output.wav”)
# transcrire
retourG = dire()
print(’…Traduction GOOGLE: ’ + retourG +"" )
with open(‘course.txt’, ‘a’) as mon_fichier:
scourse = retourG
mon_fichier.write(str(scourse)+".\n")
mytts.dire ("Okay, jai ajoutai, "+retourG)

	except:
		print ("probleme enregistrement")

if v == "e": #efface
	with open('course.txt', 'w') as mon_fichier:
		mon_fichier.write("")
	mytts.dire ("Okay, j'ai tout effaçé.")

if v == "l": # lecture
	with open('course.txt', 'r') as mon_fichier:
		mytts.dire ("Bien, pour les courses, il te faut: ")
		mytts.dire (mon_fichier.read())
print ("Fin des courses.")
1 Like

Ok so you don’t say the untrained part in same sentence but in a second sentence.

I seriously doubt it would work with kaldi. It Alan understand trained words even when they are far from them regarding phonemes. Seems It just HAVE to find something known.

yes that’s why I lost no time with kali and use google.
with Poketsphinx…?

One way might be to run a second instance of Rhasspy with open transcription enabled and a special wake word. Not ideal, but at least possible :slight_smile:

When Rhasspy 2.5 comes out (someday, I promise!) you could theoretically run two STT services connected to the same MQTT broker. We’d need to think about how to adjust the dialogue manager to handle multiple responses and decide which one to choose. Just a thought.

2 Likes

Just a few nights of work. :wink: Thank for all.

Just so I understand this correctly; it is currently not possible to configure rhasspy to just pass custom words to the intent handler? So it is not possible to create a handler for example for:

  • play queen on spotify (where “queen” is not part of the trained sentence)
  • who is barack obama (where “barack obama” is not part of the trained sentence)

I’m quite new to the whole voice command ecosystem so I might ask for something really difficult, I don’t know but it would be nice to be able to create a more custom handler.

I came here for help with this too. I really need to be able to tell my assistant to add things to a list or even fun little things like Simon says.

Something like

[Groceries]
add [*]{thing} to the list

[SimonSays]
simon says [*]{thing_to_say}

This is difficult with the way Rhasspy approaches speech recognition. Depending on where the “wildcard” is also affects the complexity.

Let’s take @digitalfiz’s example:

[Groceries]
add [*]{thing} to the list

[SimonSays]
simon says [*]{thing_to_say}

The SimonSays example is easier to implement because the wildcard occurs at the end of the sentence. I would implement this by having the first speech system (e.g., Kaldi) recognize “simon says”, then I would clip the audio from there to the end and send that off to a second system (e.g., DeepSpeech) which is listening for generic English.

The Groceries example is harder, though, because the first speech system has to detect both the start and end of the wildcard phrase. One way I know to do this is to add <UNKNOWN> “words” to the first speech system. During recognition, it might recognize “add apples and bananas to the list” as “add <UNKNOWN> to the list” (the <UNKNOWN> may be repeated). Kaldi and Pocketsphinx can give the time window of each recognized word/token, so I could clip out the audio from all the <UNKNOWN> words and send it to DeepSpeech.


A much easier approach to wildcards would be if the number of words were known in advance. If I knew that [*] was one generic English word, I could just add every possible word from the English dictionary as a candidate at that point in the grammar. No need for a second speech system!

If you knew there would be between 1 and 3 words, for example, I could use knowledge from the generic English language model to do something similar. It would be harder than a single word, but easier than adding a second speech system.

2 Likes

When Rhasspy 2.5 comes out (someday, I promise!) you could theoretically run two STT services connected to the same MQTT broker.

I’m on 2.5.11 now. Is this still in the works?

I agree that it might not be the right approach to set wildcards or clip the audio.
Sounds like the easiest implementation is to run a separate STT service (Kaldi in Open Transcription mode) and have the base Rhasspy intent manager (home assistant in my case) open a new session with that separate STT service.

Basic process flow:

  1. Speak wake word
  • Wake WAV plays
  1. “Play [me a] (song | artist | album | wiki) {type}”
  • Recorded WAV plays (intent recognized)
  • TTS response - “specify name of song”
  • Intent also invokes “/api/listen-for-command” on 2nd STT service running with Open Transcription
  1. “song name”
  • Recorded WAV plays (no intent handling)
  • Raw_Text from Open Transcription passed to a commandline lookup program, along with intent from metadata from 1st STT in item 2

We’d need to think about how to adjust the dialogue manager to handle multiple responses and decide which one to choose.

Yes, this too.
First, if we can get the 2nd STT service running, and the dialog manager to be aware of it, so it can send the audio input to it, based on the intent of the initial voice command.

Is there a hacky way to do this, for testing?
Or is just better to wait until STT services (like kaldi) are capable of a true “hybrid” mode (not mixed) that can train on sentences for quick predetermined commands AND be triggered to dynamically switch to an Open Transcript mode?

I did this a fair while ago, I installed them both under different user ids on my NUC which I use as the server. The main issue I had was how to differentiate the responses from the multiple instances.

I had one running trained with Kaldi and the other with Open transcription mode.

Everything would then respond with multiple responses and I wasn’t happy with trying to handle the output. If I remember correctly, the trained sentences generally responded a bit faster and so I got multiple responses through MQTT to my node-red instance and it would try to handle them as though I had said two different phrases.
I had other things on at the time and didn’t get my head around how to effectively equate the two responses and then decide between them what to do.

Maybe have one instance call /api/listen-for-command of the other instance.

Thanks for the suggestion but I don’t understand what exactly that would achieve or how that would help decide which of the two responses want to use.

I think the only way to do it, would be to explicitly call one or the other instances. Only one at a time. So only one response.

You can ask the first one to open a new session in the second instance. Basically have an intent on trained sentences, that will invoke /api/listen-for-command on the second instance that is running with open transcript. This open transcript instance would NOT have a wake word at all, so it won’t respond unless triggered by an explicit (trained) intent from the first instance.

All that said, I haven’t tried this out.
I would prefer a dialog manager to handle dynamically toggling open transcript mode.

This wouldn’t be too difficult to do. The MQTT messages could be extended to allow specifying whether the current voice command should be interpreted in closed or open transcription mode.

For this, though, I would probably just have the STT service load both speech models (open + closed) with one of them being the default. This would be less jarring message-wise over the MQTT bus than having two STT services.

What do you think?

Yes, a single STT is preferred. Just need a way to dynamically toggle between speech models.
Would be nice for the dialog manager to do this within a single session. So I don’t have to say the wake word twice.

1 Like

I think what would be good would be an option to set up an order which could use multiple trained instances and set a confidence level for each one. If it failed to reach the confidence level for the first it would pass the results of the second back and so on down the line until it reached the last one, which would probably be the open translation. That way you could set up either different sentences at each level or even different engines with similar lists that could translate for different combinations.

When I was testing I was hoping to move from one large set which included controls for all my home automation devices and the variations that could be used to identify them (e.g. Room7 (AC Zone 7) is also known as John’s bedroom, John’s study, John’s Room, etc and any device in it can be addressed with one of those prefixes or just something like John’s light), my music collection’s list of artists and albums, a shopping list based on my previous online purchases and general queries about things like weather, date, time or the individual status of anything in the home automation system.

My thought with that was to train different instance with specific subsets of the above and try to decide on which response to use based on the confidence level, but running them in parallel would mean getting back out of order responses that I would have to somehow relate back to the single request or wait for each to run one after the other with the overheads in the time it would take to each complete in turn, which could mean waiting a considerable amount of time for a valid response. I think it may still be achievable to do something like this but I haven’t had the time to put toward it.

Has anyone had any luck with this? I would really like to have a wildcard intent that captures anything not caught by other intents to send it off to WolfgramAlpha.

I’m fine with using Google STT, although in a perfect design I could use Google STT and then if that’s unavailable for whatever reason have a backup offline STT like Kaldi.

This wouldn’t be too difficult to do. The MQTT messages could be extended to allow specifying whether the current voice command should be interpreted in closed or open transcription mode.
For this, though, I would probably just have the STT service load both speech models (open + closed) with one of them being the default. This would be less jarring message-wise over the MQTT bus than having two STT services.

Has this been added to any feature request or todo list?

@synesthesiam, you still seem to have this on your personal roadmap, calling it “a hybrid STT system that can recognize fixed commands and fall back to an open system like Vosk/Coqui for everything else”.

Have you had much time to work on it, with your full time job at Mycroft AI?