Yes, a single STT is preferred. Just need a way to dynamically toggle between speech models.
Would be nice for the dialog manager to do this within a single session. So I don’t have to say the wake word twice.
I think what would be good would be an option to set up an order which could use multiple trained instances and set a confidence level for each one. If it failed to reach the confidence level for the first it would pass the results of the second back and so on down the line until it reached the last one, which would probably be the open translation. That way you could set up either different sentences at each level or even different engines with similar lists that could translate for different combinations.
When I was testing I was hoping to move from one large set which included controls for all my home automation devices and the variations that could be used to identify them (e.g. Room7 (AC Zone 7) is also known as John’s bedroom, John’s study, John’s Room, etc and any device in it can be addressed with one of those prefixes or just something like John’s light), my music collection’s list of artists and albums, a shopping list based on my previous online purchases and general queries about things like weather, date, time or the individual status of anything in the home automation system.
My thought with that was to train different instance with specific subsets of the above and try to decide on which response to use based on the confidence level, but running them in parallel would mean getting back out of order responses that I would have to somehow relate back to the single request or wait for each to run one after the other with the overheads in the time it would take to each complete in turn, which could mean waiting a considerable amount of time for a valid response. I think it may still be achievable to do something like this but I haven’t had the time to put toward it.
Has anyone had any luck with this? I would really like to have a wildcard intent that captures anything not caught by other intents to send it off to WolfgramAlpha.
I’m fine with using Google STT, although in a perfect design I could use Google STT and then if that’s unavailable for whatever reason have a backup offline STT like Kaldi.
This wouldn’t be too difficult to do. The MQTT messages could be extended to allow specifying whether the current voice command should be interpreted in closed or open transcription mode.
For this, though, I would probably just have the STT service load both speech models (open + closed) with one of them being the default. This would be less jarring message-wise over the MQTT bus than having two STT services.
Has this been added to any feature request or todo list?
@synesthesiam, you still seem to have this on your personal roadmap, calling it “a hybrid STT system that can recognize fixed commands and fall back to an open system like Vosk/Coqui for everything else”.
Have you had much time to work on it, with your full time job at Mycroft AI?
Just came across this conversation and the problem interests me too. My project bosses don’t want outside access in a final version (I’m using replicant.ai + Ghost for transcription at the moment but will have to drop later) and we will want wildcards. Hence also Kaldi at present.
The voice transcriptions go into an sqlite3 database at the moment and I’m writing a crude process to extract keywords for a screen word-cloud. So I’m thinking that I could use cron (or something) to update the (a) slots file with popular keywords.
I’m aware that this is partial and crude, but I hope the idea will help someone else.