Rhasspy3 Whisper experiences

Hi,
i started testing rhasspy3. I like the pipeline-design it’s getting easier to understand whats going on. Maybe it was just as easy before, but i only had the docker container running.

But the whisper recognition in german seems to be really bad, even with the larger models, I actually had the parameter set to german as well, if I did it right. Has anyone had similar experiences? I tested it on a raspberry pi 4. Does anyone have better models? Maybe finetuned for german language, which are best already converted?

Whisper for German is supposedly quite good for ASR with its default 30s beamsearch on the large model.

WER rockets as you go down in model size and also the shorter the sentence the less context it has for the beamsearch and the same happens.
Likely there are better far more optimised and smaller ASR models for the Pi4 which really struggles with anything other than the tiny model for ASR.

It might be German on the Tiny model with the command sentences you are using just sucks.

Likely you can test out vanilla whisper via GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ as the models can be quantised and smaller and GGML is the benchmark for optimisation.
Many have adopted Whisper for Pi4 IMO is that its a bad fit, but hey…

1 Like

Thanks for your informative reply! Turns out, that, with the example on github it doesnt pick up the command properly. As i tested it with plain script and sample.wav it performed way better, also started as service. I will play around. I think, i will wait a little to use it daily and build up a little more knowledge on this.

Short way for testing:
config/programs/asr/faster-whisper/script/wav2text --beam-size 1 --language "de" "/home/pi/rhasspy3/config/data/asr/faster-whisper/tiny-int8" "/home/pi/rhasspy3/sample.wav"

Yeah I am also not a fan of faster-whisper as the benchmarks they publish are baloney and have a tendency to stand on my soap box whilst running scared from miss-infomation.

I did some benches a while back of faster-whisper vs whisper.cpp and whisper.cpp just edged it, whisper.cpp is what I already knew and it doesn’t post false benchmarks. Also I don’t have a Pi4 as don’t use them anymore as for ML OPi5 is similar price but x4-x5 faster, so I have dumped Raspberry as a bad job for Whsiper.

–beam-size 1 (1 sec) is extremely short and only ever going to capture approx 2-3 syllables, default with faster-whisper is 5 and think from memory whisper.cpp is 10 (5 should be ok for command sentences, just record what you say and check the wav length)

Also the quantisation to Int8 likely loses a tad of accuracy also and its a shame that Whisper seems to of monopolised optimisations.

Size Parameters English-only Multilingual
tiny 39 M
base 74 M
small 244 M
medium 769 M
large 1550 M x
large-v2 1550 M x

If you compare to say large models of an alternative language specific model such as jonatasgrosman/wav2vec2-large-xlsr-53-english · Hugging Face
The model size is 315M params irrespective to what its quantised to, to fit into memory.

So the large model of Wav2Vec2 is somewhere between small and medium (bit bigger than small) of Whisper and if you do the WER compare at that level then Wav2Vec2 is more accurate and since Wav2Vec2 further advancements have been made with conformers and like.

So projects such as sherpa — sherpa 1.3 documentation or Welcome to wenet’s documentation! — wenet documentation likely have much better frameworks for a Pi4 and above as all benefit from less parameters for better accuracy, but they are limited on the trained language models avail, but 100% opensource.
Its a shame someone with the GPU’s or access isn’t training as they are excellent and being English I am not excluded from what often is a choice of En or Cn, so it would be great if someone could add additional langs.

You can finetune Whisper GitHub - jumon/whisper-finetuning: [WIP] Scripts for fine-tuning Whisper What I Learned from Whisper Fine-Tuning Event | by bofeng huang | Medium but I don’t see the point as its the equivalent of training a fully opensource model that likely will be more accurate for far less parameters.

I guess you could have a go at finetuning whisper with the specific command sentences you aim to use.

I have started becoming a real fan of what the guys are doing at wenet as its very easy to on the fly create a custom domain specific LM on entity data of a skill say HomeAssistant, but they are very forward thinking and nearly always a couple steps ahead of the rest.

1 Like

Hey. Thank you that’s a lot of stuff to check out. I think there is a whole new world to discover.
The OPi5 sounds good, i’ll take a little research maybe that’s a early christmas present for myself.
Open transcription is still the big problem, because you have to add more data to some commands, for example create appointments.

You’ve turned me on to some very interesting projects and even given me a hint of hardware I’ve been looking for lately. thanks for that. I may be able to give this back to the community when I 'm more into it, especially since Rhasspy3 is still in a very early stage.

The Opi5 is the cheapest Radxa do a Rock5b that is getting more direct attention and is the model used for mainlining.
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=next-20230803&qt=grep&q=rk3588
Arm have announced they are actively helping to create opensource drivers for all Mali products.

Which should mean PanCSF: A new DRM driver for Mali CSF-based GPUs is fairly imminent

Its not just the RK3588(s) but that is definately one of the best but Arm in general is becoming mainline and a real challenger to x86, especially due to Apple silicon.
If you really want to give yourself a xMas treat then a Mac Mini gives jaw dropping ML perf that gives well over the x10 perf of its cost over the rk3588 with amazing energy efficiency of 7watt idle and strangely for apple with the perf it gives quite cost effective, as bang for buck they are sort of the same.
Still though a cluster of rk3588 SBCs has the possibility to partition ASR/LLM/Server with an upgrade path of add another.
The Pi4 especially with LLM’s just falls short and even with the x4-x5 of a rk3588 you may want to cluster a couple as a central brain that due to the diversifcation of use its a perfect client-server infrastructure to distributed mics/kws.

You can use TTS (text to speech) to create audio sentences but I would be extremely careful about how much synthetic speech you add to a dataset as with filters that produce inaudable artefacts can still reduce recognition as I found out with deep-filter.net and Whisper but its very easy to preprocess a dataset with any filter of use, but how to apply synthetics to a speech dataset…?

You don’t need to custom train a full ASR model though as you can create on the fly custom dictionaries and context biasing as documented in LM for WeNet — wenet documentation and Context Biasing — wenet documentation

Which for me is great as I have been banging on about that for a while, whilst procrastinating on what and how to implement it.
I think the two main sources are openslr.org and https://arxiv.org/pdf/1807.10311.pdf but when doing native Google searches I am sure you can find more datasets that you can concatenate to be even bigger.

1 Like