Not that multi-language is all that important with KWS apart from your chosen KW it should help with accuracy and the datasets covers many langs and is quite large
There is always the Google command set
http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz or http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Also common voice have a word dataset and it also includes ‘Hey’ which its quite easy to concatenate with other words for a great KW.
I did this before and used sox to get pitch of ‘hey’ and exported a CSV of the filenames with pitch sorted from low to high.
Did the same with ‘Marvin’ and stepped through Hey concatenating the next nearest pitch of Marvin to kw and it worked really well as the more Phones in your KW and more unique the better.
Download the Single Word Target Segment from Common Voice which has one very useful KW ‘hey’ for KW concatenation.
The very large KW dataset at Multilingual Spoken Words | MLCommons is new to me so thought I would post.
Always loads of ASR sentence datasets but word KW have always been rarer so its a great addition
Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).
PS the commonvoice dataset is stored on github and likely instally LFS (Large file support) will allow much faster access as its painfully slow.
Bit more of a write up on it here from testing the label qty’s vary wildly but has many with enough for KW especially if multiplexed by augmentation whilst the label count is a a huge plus to get a bigger selection of phonetic spectra for !KW