Anyone checked out Whisper by OpenAI yet for ASR?
Yeah, I have been doing some tests, and the ASR is really good, but on my CPU-only machine, even with the tiny.en
model it runs at best realtime, so it takes around 3s to transcribe a 3s utterance, which IMO is too slow for being usable.
For reference, Iâm running a base+satellite configuration, with ASR done on the base, an Intel(R) Core⢠i7-6700T CPU @ 2.80GHz with 24 GB RAM as my home server.
$ time whisper setkitchenlights20percent.wav --model tiny.en --language en --fp16 False
[00:00.000 --> 00:03.000] Set the kitchen lights to 20%.
real 0m3.573s
user 0m9.131s
sys 0m1.310s
The tiny.en
and base.en
models transcribed âcancel the timerâ to âpencil the timerâ. small.en
got it right, but took 17 seconds on my CPU.
$ time whisper canceltimer.wav --model small.en --language en --fp16 False --task transcribe
[00:00.000 --> 00:02.000] Cancel the timer.
real 0m17.479s
user 0m52.986s
sys 0m4.118s
My current config is RPi 4b as satellites, the i7 as base (also running Home Assistant, Node-RED, MQTT, etc⌠) I am really happy with the performance of âHey Mycroftâ wake word, Mozilla DeepSpeech for ASR, Fsticufs, and Mimic3 ljspeech TTS (and Node-RED for fulfilment.)
DeepSpeech seems resilient to music + microwave running in a kitchen with lots of reverberation.
Hopefully Whisper CPU performance improves, or ⌠it can run on a Coral TPU? I think many home automation setups already have or want a Coral for running Frigate, would be nice to use for ASR as well.
I created Google Colab notebook to test with a GPU
model = whisper.load_model("small.en")
options = dict(task="transcribe", language="en")
st = time.time()
result = model.transcribe("whatstheweathertoday.wav", **options)
print(time.time() - st)
print(result["text"])
0.4277827739715576
What's the weather today?
0.4s on GPU vs 17s on CPU.
With the tiny.en
model it takes 0.2s on GPU vs 3s on CPU.
The tiny.en
model also correctly transcribed âStart a timer for 13 minutes and 33 seconds.â whereas Mozilla DeepSpeech (for me) confuses number like 33 with 53.
Yo I kind of want to see your full setup with the satellites and base! Sounds like youâre using multiple satellites and in the kitchen? What are you using to process utterances to execute commands? NodeRED? Whatâs your timer setup like? Does the timer go off on all the satellites or just the one it came from?
Yes, Iâm helping with a university project for older people (of which Iâm one). The main thing is Pi4 based so it clearly wonât cope (not a criticism, really good transcription needs power).
So Iâm thinking about wrapping the whisper executable into a web service/server and doing a âplease waitâ especially as I have large chunks of voice. If I get anywhere, Iâll open source the bits and pieces, more integration than pure development.
Hi @Hugh_Barnard - have you seen openai/whisper â Run with an API on Replicate ?
They offer GPU transcriptions via API. I am thinking of running a simple Flask webserver locally that then posts to the Replicate API, to get access to GPU without having to buy one myself. Pricing is pretty good, $0.00055 per second.
Thanks, this certainly suits me for proof of concept work.
I made a custom integration for Whisper based on the Remote HTTP method here https://github.com/seifane/whisper-rhasspy-http.
After testing the integration I found that I had very good performance using CPU only with my AMD Ryzen 7 5800X.
Hi!
Cool stuff you made there! I have a question⌠I have one of those Coral.ai USB devices that I use for the Frigate NVR image processing and detection, and somewhere in the documentation of the Coral.ai device I saw that this could also be used to process audioâŚ
What would be need to use the Coral device instead of the CPU / GPU?
Thanks!
Hmm very good question. I am not familiar with that device. But my guess is that if it could be passthrough with docker and is compatible with pytorch it should work âout of the boxâ or almost. Sorry I cannot help more
In full transparency⌠Iâm a noob in these parts of ML and AI⌠What I read was that this uses TensorFlow⌠Iâve read somewhere we can convert pytorch to TensorFlow but⌠Yeah no idea what Iâm doing haha
I guess a VM with a shared GPU will have to do the trick thanks!
Anyone tried this repo?
High-performance inference of OpenAIâs Whisper automatic speech recognition (ASR) model:
Plain C/C++ implementation without dependencies
Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
AVX intrinsics support for x86 architectures
Mixed F16 / F32 precision
Low memory usage (Flash Attention + Flash Forward)
Zero memory allocations at runtime
Runs on the CPU
C-style API
Supported platforms: Linux, Mac OS (Intel and Arm), Windows (MSVC and MinGW), WebAssembly, Raspberry Pi, Android
To be honest with stock availability my default raspberry platform is looking ever in doubt, but the above is cpu based. I donât know how much faster it is than the openAI source but it is optimised for cpu, but guessing at best it will be the tiny model (which is not that great as its the small model up where Whisper excels).
Think I will give it a try on my getting old workstation Intel(R) Xeon(R) CPU E3-1245 and new toy of a Rock5/hardware/5b - Radxa Wiki (OkDo are going to start stocking) as alternatives to Pi are becoming very valid and see what model works before pushing over realtime, doubt it will do the medium model but tiny, base and small to choose from.
Has a nice streaming input also https://github.com/ggerganov/whisper.cpp#real-time-audio-input-example and some interesting examples.
ROCK 5B Rockchip RK3588 ARM Cortex-A76
rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 318.74 ms
whisper_print_timings: mel time = 123.62 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 6228.12 ms / 1038.02 ms per layer
whisper_print_timings: decode time = 758.88 ms / 126.48 ms per layer
whisper_print_timings: total time = 7442.09 ms
Xeon(R) CPU E3-1245
./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 221.60 ms
whisper_print_timings: mel time = 85.55 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 1707.26 ms / 284.54 ms per layer
whisper_print_timings: decode time = 183.90 ms / 30.65 ms per layer
whisper_print_timings: total time = 2211.89 ms
Playing some more and setting threads to max=8 and a compare of the tiny, base, small & medium models on the Rock5b Rk3588
./main -m models/ggml-tiny.en.bin -f samples/jfk.wav -t 8
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:07.740] And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740] ask what you can do for your country
whisper_print_timings: load time = 1431.40 ms
whisper_print_timings: mel time = 114.11 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2746.18 ms / 686.54 ms per layer
whisper_print_timings: decode time = 353.36 ms / 88.34 ms per layer
whisper_print_timings: total time = 4663.89 ms
./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 8
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 320.30 ms
whisper_print_timings: mel time = 111.54 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 6148.25 ms / 1024.71 ms per layer
whisper_print_timings: decode time = 580.88 ms / 96.81 ms per layer
whisper_print_timings: total time = 7173.88 ms
./main -m models/ggml-small.en.bin -f samples/jfk.wav -t 8
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000] Ask what you can do for your country.
whisper_print_timings: load time = 644.22 ms
whisper_print_timings: mel time = 122.85 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 24924.77 ms / 2077.06 ms per layer
whisper_print_timings: decode time = 2036.42 ms / 169.70 ms per layer
whisper_print_timings: total time = 27742.79 ms
./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 24878.33 ms
whisper_print_timings: mel time = 122.06 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 87195.62 ms / 3633.15 ms per layer
whisper_print_timings: decode time = 4881.50 ms / 203.40 ms per layer
whisper_print_timings: total time = 117097.61 ms
Running from nvme also helps as the above is on the sdcard as is booting from.
./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 2024.17 ms
whisper_print_timings: mel time = 108.58 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 86100.46 ms / 3587.52 ms per layer
whisper_print_timings: decode time = 4895.51 ms / 203.98 ms per layer
whisper_print_timings: total time = 93143.08 ms
Adding my own results to @rolyan_trauts for the same 11 second WAV file.
On an AMD Ryzen 9 5950X 16-Core Processor:
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
...
whisper_print_timings: load time = 103.60 ms
whisper_print_timings: mel time = 67.62 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 901.81 ms / 150.30 ms per layer
whisper_print_timings: decode time = 117.51 ms / 19.59 ms per layer
whisper_print_timings: total time = 1198.73 ms
and using a different number of threads:
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
...
whisper_print_timings: load time = 106.87 ms
whisper_print_timings: mel time = 36.87 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 610.55 ms / 101.76 ms per layer
whisper_print_timings: decode time = 132.92 ms / 22.15 ms per layer
whisper_print_timings: total time = 895.48 ms
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 16
...
whisper_print_timings: load time = 101.06 ms
whisper_print_timings: mel time = 22.97 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 456.84 ms / 76.14 ms per layer
whisper_print_timings: decode time = 199.03 ms / 33.17 ms per layer
whisper_print_timings: total time = 788.38 ms
with the tiny model and 16 threads:
$ ./main -m models/ggml-tiny.en.bin -f samples/jfk.wav -t 16
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
...
whisper_print_timings: load time = 76.03 ms
whisper_print_timings: mel time = 21.28 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 231.51 ms / 57.88 ms per layer
whisper_print_timings: decode time = 100.68 ms / 25.17 ms per layer
whisper_print_timings: total time = 437.92 ms
On a Raspberry Pi 4 (tiny model, 4 threads):
$ ./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
...
whisper_print_timings: load time = 800.65 ms
whisper_print_timings: mel time = 208.19 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 9256.59 ms / 2314.15 ms per layer
whisper_print_timings: decode time = 708.53 ms / 177.13 ms per layer
whisper_print_timings: total time = 10999.69 ms
Yeah it scales really well on threads/cores especially if you have a monster PC like yours.
Its something to do with the author as he is some brilliant scientist or something as he has created his own Tensor library for machine learning https://github.com/ggerganov/ggml that is cpu based and optimised for Neon & AVX.
I presume that is a really good test for AMD AVX vs Intel as is it the 5950X (Might be the 7950x) that does AVX512 in x2 256 passes almost like AVX hyperthreading?
The dev passed me his bench on the Pi4 for the base model as no longer have one.
Also on the readme MacBook M1 Pro, using medium.en model:
$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8
whisper_model_load: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem_required = 2610.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 1644.97 MB
whisper_model_load: memory size = 182.62 MB
whisper_model_load: model size = 1462.12 MB
main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00.000 --> 00:08.000] My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:08.000 --> 00:17.000] At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
[00:17.000 --> 00:23.000] A short time later, debris was seen falling from the skies above Texas.
[00:23.000 --> 00:29.000] The Columbia's lost. There are no survivors.
[00:29.000 --> 00:32.000] On board was a crew of seven.
[00:32.000 --> 00:39.000] Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
[00:39.000 --> 00:48.000] Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
[00:48.000 --> 00:52.000] a colonel in the Israeli Air Force.
[00:52.000 --> 00:58.000] These men and women assumed great risk in the service to all humanity.
[00:58.000 --> 01:03.000] In an age when space flight has come to seem almost routine,
[01:03.000 --> 01:07.000] it is easy to overlook the dangers of travel by rocket
[01:07.000 --> 01:12.000] and the difficulties of navigating the fierce outer atmosphere of the Earth.
[01:12.000 --> 01:18.000] These astronauts knew the dangers, and they faced them willingly,
[01:18.000 --> 01:23.000] knowing they had a high and noble purpose in life.
[01:23.000 --> 01:31.000] Because of their courage and daring and idealism, we will miss them all the more.
[01:31.000 --> 01:36.000] All Americans today are thinking as well of the families of these men and women
[01:36.000 --> 01:40.000] who have been given this sudden shock and grief.
[01:40.000 --> 01:45.000] You're not alone. Our entire nation grieves with you,
[01:45.000 --> 01:52.000] and those you love will always have the respect and gratitude of this country.
[01:52.000 --> 01:56.000] The cause in which they died will continue.
[01:56.000 --> 02:04.000] Mankind is led into the darkness beyond our world by the inspiration of discovery
[02:04.000 --> 02:11.000] and the longing to understand. Our journey into space will go on.
[02:11.000 --> 02:16.000] In the skies today, we saw destruction and tragedy.
[02:16.000 --> 02:22.000] Yet farther than we can see, there is comfort and hope.
[02:22.000 --> 02:29.000] In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
[02:29.000 --> 02:35.000] who created all these. He who brings out the starry hosts one by one
[02:35.000 --> 02:39.000] and calls them each by name."
[02:39.000 --> 02:46.000] Because of His great power and mighty strength, not one of them is missing.
[02:46.000 --> 02:55.000] The same Creator who names the stars also knows the names of the seven souls we mourn today.
[02:55.000 --> 03:01.000] The crew of the shuttle Columbia did not return safely to earth,
[03:01.000 --> 03:05.000] yet we can pray that all are safely home.
[03:05.000 --> 03:13.000] May God bless the grieving families, and may God continue to bless America.
[03:13.000 --> 03:41.000] Audio
whisper_print_timings: load time = 575.92 ms
whisper_print_timings: mel time = 230.60 ms
whisper_print_timings: sample time = 73.19 ms
whisper_print_timings: encode time = 19552.61 ms / 814.69 ms per layer
whisper_print_timings: decode time = 13249.96 ms / 552.08 ms per layer
whisper_print_timings: total time = 33686.27 ms
The tiny & base model sort of drop off the accuracy Whisper has, but boy the bigger models are amazing as on Small and above Elon becomes Ilan Ramon, which is actually correct.
Its a strange repo as Whisper is an absolute accuracy monster that you really the model prob needs partitioning to run on GPU/CPU and maybe a NPU if avail as you are going to use it for accuracy as that is where it is optimised in the bigger models.
The small model from what I have read really is the minimum but the above medium on a Mac Book Pro M1 @synesthesiam what does your AMD monster manage out of curiosity as a comparison?
I asked the dev for a bench of the Pi4 with base
Rock5b
rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jf k.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 313.91 ms
whisper_print_timings: mel time = 107.60 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 6165.18 ms / 1027.53 ms per layer
whisper_print_timings: decode time = 657.71 ms / 109.62 ms per layer
whisper_print_timings: total time = 7256.87 ms
Pi4
pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 1851.33 ms
whisper_print_timings: mel time = 270.67 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings: decode time = 1287.69 ms / 214.61 ms per layer
whisper_print_timings: total time = 37281.19 ms
Rock5b 5.137 times faster than a Pi4 which is interesting prob due to the Mac optimisation which is ARM8.2
architecture and cores?
I have been shocked how well it scales as with (tensorflow DTLN) on x2 threads you get an improvement but not x2 and then with x4 threads it seems to make little difference to x2.
Its an amazing repo how it scales on CPU but really its a model that you wouldnât want to run on cpu alone or the smaller models.
Been extremely impressed of the accuracy of the bigger models as wow FâŚ
PS going back to the original Whisper code to get cuda11.6 I used one of the Nvidia docker containers instead.
But install the right torch 1st
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Then install Whisper
pip install git+https://github.com/openai/whisper.git
Using https://commons.wikimedia.org/wiki/File:Reagan_Space_Shuttle_Challenger_Speech.ogv
4m:48s
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model medium.en --threads=8
real | 0m42.072s |
---|---|
user | 0m46.303s |
sys | 0m3.591s |
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model small.en --threads=8
real | 0m22.323s |
---|---|
user | 0m24.127s |
sys | 0m2.545s |
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model base.en --threads=8
real | 0m13.119s |
---|---|
user | 0m14.324s |
sys | 0m2.137s |
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model tiny.en --threads=8
real | 0m10.855s |
---|---|
user | 0m11.907s |
sys | 0m2.106s |
My GPU is just a RTX3050 desktop but also would be interesting @synesthesiam what your GPUâs will attain, also the translation to English might be interesting on the multilingual models.
Its great what Georgi has done but would you really want to run on cpu?
Iâve been testing Whisper.cpp, and there is roughly a 1.5x to 2x speedup compared to Openaiâs Whisper model, but on my Intel i7-6700T only the tiny model sort of usable.
$ ./main -m models/ggml-tiny.en.bin -f rhasspy/setkitchenlights20percent.wav -t 8
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 84.99 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB
system_info: n_threads = 8 / 8 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'rhasspy/setkitchenlights20percent.wav' (48480 samples, 3.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.000] Set the kitchen lights to 20%.
whisper_print_timings: load time = 295.38 ms
whisper_print_timings: mel time = 18.83 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 1026.03 ms / 256.51 ms per layer
whisper_print_timings: decode time = 54.58 ms / 13.65 ms per layer
whisper_print_timings: total time = 1400.99 ms
From my previous tests using the openai model (see my post above)
- tiny: openai 3.6s vs 1.4 with cpp
- small: openat 17s vs 10s with cpp
Interestingly, there seems to be a âstartup costâ to the model. For instance, the jfk sentence is realtime 11s, and is transcribed in about 1.5s. My utterance of âHow much time is there until the laundry timer is done?â is 6s realtime and is transcribed in 1.7s, and âActivate workingâ utterance is 2s realtime and takes 1.3s to transcribe. (All tiny model.)
I also used ffmped to speed up the audio file by 1.5x, and transcription took the same amount of time. (Accuracy of transcription was unaffected.)
I was hoping that the transcription time would be linear with the length of the audio, so that short utterances (like used for a voice assistant) would be fast.
Yeah presume the âstartup costâ is loading the model and each time we are running ./main
we are loading afresh where likely the code code be hacked to retain the model in memory on each audio submission.
Its a really interesting model but not so great for a voice assistant as basically it works on a 30 sec sliding window and its accuracy isnât with what it hears its how it interprets that context.
That is why sometimes it can get things very wrong but will still make perfect sense in English.
Because of that it really doesnât fit a streaming model and the streaming examples seem to up the load maybe x4-6 times and cpu wise I could only get the tiny.en model to work and the results where much poorer than feeding none streamed audio on a Intel(R) Xeon(R) CPU E3-1245 v5.
Is it possible to accelerate whisper with something like a google coral on a raspberry pi?
Yes.
Someone here has converted from pytorch to tflite and then quantised that model to int8 so thinking its likely but still dependent on layers as still some layers such as GRU/LSTM donât convert.
It should however accelerate standard tflite cpu inference over the pytorch model and examples can be found here.
PS Whisper.cpp has had some updates and the streaming model is supposedly working much better.