Great KWS Word dataset multi language

Not that multi-language is all that important with KWS apart from your chosen KW it should help with accuracy and the datasets covers many langs and is quite large

There is always the Google command set or

Also common voice have a word dataset and it also includes ‘Hey’ which its quite easy to concatenate with other words for a great KW.
I did this before and used sox to get pitch of ‘hey’ and exported a CSV of the filenames with pitch sorted from low to high.
Did the same with ‘Marvin’ and stepped through Hey concatenating the next nearest pitch of Marvin to kw and it worked really well as the more Phones in your KW and more unique the better.

Download the Single Word Target Segment from Common Voice which has one very useful KW ‘hey’ for KW concatenation.

The very large KW dataset at Multilingual Spoken Words | MLCommons is new to me so thought I would post.
Always loads of ASR sentence datasets but word KW have always been rarer so its a great addition

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).

PS the commonvoice dataset is stored on github and likely instally LFS (Large file support) will allow much faster access as its painfully slow.

Bit more of a write up on it here from testing the label qty’s vary wildly but has many with enough for KW especially if multiplexed by augmentation whilst the label count is a a huge plus to get a bigger selection of phonetic spectra for !KW


There is another thing about this dataset as it is great as I tried to do similar with deep speech and another thing called Montreal Aligner and extracting words from a sentence is extremely hard.
I don’t think we realize when we speak but we overlap words and concatenate our intonation and I gave up it a fit of rage of much time wasted.

From what they said they did it makes me feel a bit better as they must of struggled also as if the managed to extract a word without the previous extraction and next word extraction overlapping the target word then it was a keeper otherwise they would jettison it.
So they must of just processed tons and tons of speech to get the current dataset.

Likely if your going to do the same on your own captured speech its likely the majority will thrown away but at least I now know to expect that and just have a simple system that slowly collects words.