Yeah I am also not a fan of faster-whisper as the benchmarks they publish are baloney and have a tendency to stand on my soap box whilst running scared from miss-infomation.
I did some benches a while back of faster-whisper vs whisper.cpp and whisper.cpp just edged it, whisper.cpp is what I already knew and it doesn’t post false benchmarks. Also I don’t have a Pi4 as don’t use them anymore as for ML OPi5 is similar price but x4-x5 faster, so I have dumped Raspberry as a bad job for Whsiper.
–beam-size 1 (1 sec) is extremely short and only ever going to capture approx 2-3 syllables, default with faster-whisper is 5 and think from memory whisper.cpp is 10 (5 should be ok for command sentences, just record what you say and check the wav length)
Also the quantisation to Int8 likely loses a tad of accuracy also and its a shame that Whisper seems to of monopolised optimisations.
Size |
Parameters |
English-only |
Multilingual |
tiny |
39 M |
✓ |
✓ |
base |
74 M |
✓ |
✓ |
small |
244 M |
✓ |
✓ |
medium |
769 M |
✓ |
✓ |
large |
1550 M |
x |
✓ |
large-v2 |
1550 M |
x |
✓ |
If you compare to say large models of an alternative language specific model such as jonatasgrosman/wav2vec2-large-xlsr-53-english · Hugging Face
The model size is 315M params irrespective to what its quantised to, to fit into memory.
So the large model of Wav2Vec2 is somewhere between small and medium (bit bigger than small) of Whisper and if you do the WER compare at that level then Wav2Vec2 is more accurate and since Wav2Vec2 further advancements have been made with conformers and like.
So projects such as sherpa — sherpa 1.3 documentation or Welcome to wenet’s documentation! — wenet documentation likely have much better frameworks for a Pi4 and above as all benefit from less parameters for better accuracy, but they are limited on the trained language models avail, but 100% opensource.
Its a shame someone with the GPU’s or access isn’t training as they are excellent and being English I am not excluded from what often is a choice of En or Cn, so it would be great if someone could add additional langs.
You can finetune Whisper GitHub - jumon/whisper-finetuning: [WIP] Scripts for fine-tuning Whisper What I Learned from Whisper Fine-Tuning Event | by bofeng huang | Medium but I don’t see the point as its the equivalent of training a fully opensource model that likely will be more accurate for far less parameters.
I guess you could have a go at finetuning whisper with the specific command sentences you aim to use.
I have started becoming a real fan of what the guys are doing at wenet as its very easy to on the fly create a custom domain specific LM on entity data of a skill say HomeAssistant, but they are very forward thinking and nearly always a couple steps ahead of the rest.