Unless you’re looking at the messages from the ASR service (specifically,
hermes/asr/textCaptured), you’re seeing the NLU service’s confidence score. This is almost certainly going to be 1 with
fsticuffs because the ASR is designed to put out exactly what the NLU is trained on. Additionally,
fsticuffs ignores words it doesn’t know.
See my discussion below, though for how we can work around this.
I’ve finally gotten around to adding confidence scores to the Kaldi ASR service (in
master). I had to write my own C++ tool, but it seems to work well and even provides word-level timings/confidences too.
For sentence-level confidence, I chose to go with Minimum Bayes Risk (MBR). The documentation describes this as “…the expected WER over this sentence (assuming model correctness).” By computing
max(0, 1 - MBR), I get a value between 0 and 1 where 1 is perfect confidence (no “risk”) and 0 is no confidence.
In tests, I’ve been able to produce a range of confidences with gibberish or wrong words. It’s skewed towards 1, though, when the number of possible sentences is small.
(discussion mentioned above)
Here’s a question for everyone: which service should be the “gate keeper” for sentences below some confidence threshold? In other words: who is responsible for checking confidences and rejecting poor transcriptions?
This is what’s happening now:
- ASR transcribes audio and attaches “likelihood” values to the entire sentence as well as individual words in
- Dialogue manager passes
asr/textCaptured along to
nlu/query (ignoring likelihood)
- NLU recognizes intent and attaches a “confidence” value in
- Dialogue manager receives intent (ignoring confidence)
It seems to me that the dialogue manager is the best place to check likelihood/confidence values and “reject” things. My first pass would be:
asr/textCaptured likelihood and automatically generate
nlu/intentNotRecognized if it’s below threshold
nlu/intent confidence values and generate
dialogueManager/intentNotRecognized if it’s below threshold
What does everyone think? Should the NLU service check its own confidence threshold, or should it let the dialogue manager do that?