How does the Confidence Score work?

Hi there,

I am currently trying to understand the confidence score within my setup.
I am using Kaldi for STT & Fsticuffs with fuzzy matching currently.
I do NOT have opentranscription enabled.

I somehow expected the confidence to vary between 0 and 1 based on the likelihood that the text I said and the text that was transcribed is correct.

However, In my tries I always get a confidence score of 1. When I speak a command correctly, or even if I just feed the microphone some noise (i.e. whistling).
So basically I am looking for somebody to explain me how I could get a lower confidence score in both intent & asr and how this whole confidence score works.

I read somewhere that Kaldi does not implement confidence score yet, which is why it can’t do IntentNotRecognized.

1 Like
1 Like

Oh ok, that explains a lot. The only available alternative to Kaldi currently with similar or better accuracy would be deepspeech right?
But Deepspeech is requiring AVX Instructions set as far as i know. Thats why it doesnt work on my J5005 NUC.

Ok, now i just tried the same with pocketsphinx as the STT engine.
And I get the same result as with Kaldi. Everything has the confidence of 1.
Noise, silence, ladididadia gets recognized as a preprogrammed sentence with a confidence of 1.

Does everybody have the same experience?
If not, do you mind sharing how you setup yours?

No the wheels they offer are built on tensorflow with AVX instructions even with older such as my I5 3570 I need specific wheels, but for intel (core) they are offered and built.

There are wheels available that do work its just finding one or the compile method which can be daunting but actually quite simple but the length of time can be horrendous especially if you make a mistake and need to restart.
It builds on Arm so sure it will build on J5005 Nuc its just the know how but often its just a ./configure and the local hardware will be configured.

Same with Kaldi as haven’t tried but think pykaldi and some others do return confidence level its also the worst thing about PvPorcupine as I don’t really care about the lack of custom KW but the simple yes/no of no confidence level on KW is a major Api flaw (yes or nothing I should say).

Unless you’re looking at the messages from the ASR service (specifically, hermes/asr/textCaptured), you’re seeing the NLU service’s confidence score. This is almost certainly going to be 1 with fsticuffs because the ASR is designed to put out exactly what the NLU is trained on. Additionally, fsticuffs ignores words it doesn’t know.

See my discussion below, though for how we can work around this.

I’ve finally gotten around to adding confidence scores to the Kaldi ASR service (in master). I had to write my own C++ tool, but it seems to work well and even provides word-level timings/confidences too.

For sentence-level confidence, I chose to go with Minimum Bayes Risk (MBR). The documentation describes this as “…the expected WER over this sentence (assuming model correctness).” By computing max(0, 1 - MBR), I get a value between 0 and 1 where 1 is perfect confidence (no “risk”) and 0 is no confidence.

In tests, I’ve been able to produce a range of confidences with gibberish or wrong words. It’s skewed towards 1, though, when the number of possible sentences is small.


(discussion mentioned above)

Here’s a question for everyone: which service should be the “gate keeper” for sentences below some confidence threshold? In other words: who is responsible for checking confidences and rejecting poor transcriptions?

This is what’s happening now:

  • ASR transcribes audio and attaches “likelihood” values to the entire sentence as well as individual words in asr/textCaptured message
  • Dialogue manager passes asr/textCaptured along to nlu/query (ignoring likelihood)
  • NLU recognizes intent and attaches a “confidence” value in nlu/intent message
  • Dialogue manager receives intent (ignoring confidence)

It seems to me that the dialogue manager is the best place to check likelihood/confidence values and “reject” things. My first pass would be:

  • Check asr/textCaptured likelihood and automatically generate nlu/intentNotRecognized if it’s below threshold
  • Check nlu/intent confidence values and generate dialogueManager/intentNotRecognized if it’s below threshold

What does everyone think? Should the NLU service check its own confidence threshold, or should it let the dialogue manager do that?

1 Like

The dialogue manager is indeed the best place to check confidence threshold as it is the service that pass the information from ASR to NLU to Intent.

2 Likes

I’m going to need to post a “dev” version of Rhasspy to make sure I didn’t break something big with this change…

3 Likes

What would be the best way to update the kaldi ASR service with this new feature when using the HomeAssistant Addon (currently in 2.5.9)?
Is there a HomeAssistant Addon “DEV” Version?

No, but you can easily out this whole folder in your /addons folder (available if you have the samba addon installed)

Then change the first line in Dockerfile to point to the correct image.
Best to change the version in config.json as well
Stop your “official” addon and refresh your addons. A local Rhasspy should pop-up.

1 Like

Great Tip thanks!
I did that and basically deleted the reference to the version 2.5.9 leaving only

FROM rhasspy/rhasspy

That should give me the MASTER Version correct?

When testing, I am still seeing a likelihood of 1 for any gibberish im throwing at it.
However, I am not seeing the individual words in the asr/textCaptured output… making me believe I am still using a wrong version.

No, that will give to this image: https://hub.docker.com/r/rhasspy/rhasspy (untagged version)
Which is the latest by default

There is no docker dev build (yet), otherwise you should use rhasspy/rhasspy:dev (or which ever that dev docker is tagged)

Sorry, I haven’t uploaded a dev Docker image yet. My goal is to get a new release uploaded before the month is out.

We’ve tried for a while to get the Docker build to work on GitHub, but so far have been unsuccessful. It’s a large, complex build due to supporting multiple platforms and a few dozen services. I hope to get it automated some day so we can have a nightly build available.

2 Likes