https://github.com/42io/dataset has done a great dataset builder for KWS not exactly how I would do things but in respect to the above converting wav to MFCC and then storing in numpy.npz means your dataset becomes super tiny in respect to what could be a full wav dataset.
Guess will have to try for results but generally I add noise at 25-35% KW gain of 75-85% of KW and only have a small proportion not containing noise and think its likely they might be more noise resilient.
Still getting to grips but before I have really tried it was so near what I intended to do which I might mangle a little bit to do my way the main thing is storing a dataset as .npz as super fast to load but as said compared to wav tiny.
What I do like is how it checks through the samples and ejects any suspiciously poor examples as many are even if they say verified.
Also multiplexes via augmentation to create a bigger dataset.
Prob due to not being released at the time Multilingual Spoken Words | MLCommons brings in far more variation for !kw and likely I would do a ‘Hey’ from Common Voice concatenated into a 1sec sample with “Marvin”, “Shiela”, “House” or others for a good phonetically unique KW.
You need to batch them and match pitch as best you can and do some trimming and sox manipulation but likely could automate and add to current procedures with input params.
Multilingual Spoken Words | MLCommons is a major break through as it democratizes KW to anyone with the right tools to produce a KWS of choice and likely tools like above will encompass it.
https://github.com/42io/dataset is near perfect maybe adding more varying noise to more of the dataset I think also !kw should also have noise in a similar manner and with newer and better datasets being available it looks really good.
42io used the same dataset I used for noise but I found I had to be very careful of what is used for noise and hand picked noise files as its really important not to have speech in there even singing to an extent and maybe VAD could be used to auto eject some but that is being very picky as it is very good with the right source datasets and longer more phonetically unique kw’s and less of them than the 0-9 example of 10 kw.
The code and scripts from 42io are beautifully eloquent with a simplicity of design that demote me to the terrible noob hacker I am.
You may have to remark out some of the SHA checks as the datasets may of changed and I think with FFT its an alg that you can not guarantee exact replication even though the difference will be miniscule.