Whisper by OpenAI

Anyone checked out Whisper by OpenAI yet for ASR?

4 Likes

Yeah, I have been doing some tests, and the ASR is really good, but on my CPU-only machine, even with the tiny.en model it runs at best realtime, so it takes around 3s to transcribe a 3s utterance, which IMO is too slow for being usable.

For reference, I’m running a base+satellite configuration, with ASR done on the base, an Intel(R) Core™ i7-6700T CPU @ 2.80GHz with 24 GB RAM as my home server.

$ time whisper setkitchenlights20percent.wav --model tiny.en --language en --fp16 False

[00:00.000 --> 00:03.000]  Set the kitchen lights to 20%.

real	0m3.573s
user	0m9.131s
sys	0m1.310s

The tiny.en and base.en models transcribed “cancel the timer” to “pencil the timer”. small.en got it right, but took 17 seconds on my CPU.

$ time whisper canceltimer.wav --model small.en --language en --fp16 False --task transcribe
[00:00.000 --> 00:02.000]  Cancel the timer.

real	0m17.479s
user	0m52.986s
sys	0m4.118s

My current config is RPi 4b as satellites, the i7 as base (also running Home Assistant, Node-RED, MQTT, etc… ) I am really happy with the performance of “Hey Mycroft” wake word, Mozilla DeepSpeech for ASR, Fsticufs, and Mimic3 ljspeech TTS (and Node-RED for fulfilment.)

DeepSpeech seems resilient to music + microwave running in a kitchen with lots of reverberation.

Hopefully Whisper CPU performance improves, or … it can run on a Coral TPU? I think many home automation setups already have or want a Coral for running Frigate, would be nice to use for ASR as well.

I created Google Colab notebook to test with a GPU

model = whisper.load_model("small.en")

options = dict(task="transcribe", language="en")

st = time.time()
result = model.transcribe("whatstheweathertoday.wav", **options)
print(time.time() - st)

print(result["text"])
0.4277827739715576
 What's the weather today?

0.4s on GPU vs 17s on CPU.

With the tiny.en model it takes 0.2s on GPU vs 3s on CPU.

The tiny.en model also correctly transcribed “Start a timer for 13 minutes and 33 seconds.” whereas Mozilla DeepSpeech (for me) confuses number like 33 with 53.

Yo I kind of want to see your full setup with the satellites and base! Sounds like you’re using multiple satellites and in the kitchen? What are you using to process utterances to execute commands? NodeRED? What’s your timer setup like? Does the timer go off on all the satellites or just the one it came from?

Yes, I’m helping with a university project for older people (of which I’m one). The main thing is Pi4 based so it clearly won’t cope (not a criticism, really good transcription needs power).

So I’m thinking about wrapping the whisper executable into a web service/server and doing a ‘please wait’ especially as I have large chunks of voice. If I get anywhere, I’ll open source the bits and pieces, more integration than pure development.

Hi @Hugh_Barnard - have you seen openai/whisper – Run with an API on Replicate ?

They offer GPU transcriptions via API. I am thinking of running a simple Flask webserver locally that then posts to the Replicate API, to get access to GPU without having to buy one myself. Pricing is pretty good, $0.00055 per second.

1 Like

Thanks, this certainly suits me for proof of concept work.

I made a custom integration for Whisper based on the Remote HTTP method here https://github.com/seifane/whisper-rhasspy-http.

After testing the integration I found that I had very good performance using CPU only with my AMD Ryzen 7 5800X.

3 Likes

Hi!

Cool stuff you made there! I have a question… I have one of those Coral.ai USB devices that I use for the Frigate NVR image processing and detection, and somewhere in the documentation of the Coral.ai device I saw that this could also be used to process audio…

What would be need to use the Coral device instead of the CPU / GPU?

Thanks! :slight_smile:

Hmm very good question. I am not familiar with that device. But my guess is that if it could be passthrough with docker and is compatible with pytorch it should work “out of the box” or almost. Sorry I cannot help more :confused:

In full transparency… I’m a noob in these parts of ML and AI… What I read was that this uses TensorFlow… I’ve read somewhere we can convert pytorch to TensorFlow but… Yeah no idea what I’m doing haha

I guess a VM with a shared GPU will have to do the trick :slight_smile: thanks!

1 Like

Anyone tried this repo?

High-performance inference of OpenAI’s Whisper automatic speech recognition (ASR) model:

Plain C/C++ implementation without dependencies
Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
AVX intrinsics support for x86 architectures
Mixed F16 / F32 precision
Low memory usage (Flash Attention + Flash Forward)
Zero memory allocations at runtime
Runs on the CPU
C-style API
Supported platforms: Linux, Mac OS (Intel and Arm), Windows (MSVC and MinGW), WebAssembly, Raspberry Pi, Android

To be honest with stock availability my default raspberry platform is looking ever in doubt, but the above is cpu based. I don’t know how much faster it is than the openAI source but it is optimised for cpu, but guessing at best it will be the tiny model (which is not that great as its the small model up where Whisper excels).

Think I will give it a try on my getting old workstation Intel(R) Xeon(R) CPU E3-1245 and new toy of a Rock5/hardware/5b - Radxa Wiki (OkDo are going to start stocking) as alternatives to Pi are becoming very valid and see what model works before pushing over realtime, doubt it will do the medium model but tiny, base and small to choose from.

Has a nice streaming input also https://github.com/ggerganov/whisper.cpp#real-time-audio-input-example and some interesting examples.

ROCK 5B Rockchip RK3588 ARM Cortex-A76

rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   318.74 ms
whisper_print_timings:      mel time =   123.62 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6228.12 ms / 1038.02 ms per layer
whisper_print_timings:   decode time =   758.88 ms / 126.48 ms per layer
whisper_print_timings:    total time =  7442.09 ms

Xeon(R) CPU E3-1245

./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   221.60 ms
whisper_print_timings:      mel time =    85.55 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  1707.26 ms / 284.54 ms per layer
whisper_print_timings:   decode time =   183.90 ms / 30.65 ms per layer
whisper_print_timings:    total time =  2211.89 ms

Playing some more and setting threads to max=8 and a compare of the tiny, base, small & medium models on the Rock5b Rk3588

./main -m models/ggml-tiny.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country


whisper_print_timings:     load time =  1431.40 ms
whisper_print_timings:      mel time =   114.11 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2746.18 ms / 686.54 ms per layer
whisper_print_timings:   decode time =   353.36 ms / 88.34 ms per layer
whisper_print_timings:    total time =  4663.89 ms
./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   320.30 ms
whisper_print_timings:      mel time =   111.54 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6148.25 ms / 1024.71 ms per layer
whisper_print_timings:   decode time =   580.88 ms / 96.81 ms per layer
whisper_print_timings:    total time =  7173.88 ms
./main -m models/ggml-small.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =   644.22 ms
whisper_print_timings:      mel time =   122.85 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 24924.77 ms / 2077.06 ms per layer
whisper_print_timings:   decode time =  2036.42 ms / 169.70 ms per layer
whisper_print_timings:    total time = 27742.79 ms
./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...

./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8
[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time = 24878.33 ms
whisper_print_timings:      mel time =   122.06 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 87195.62 ms / 3633.15 ms per layer
whisper_print_timings:   decode time =  4881.50 ms / 203.40 ms per layer
whisper_print_timings:    total time = 117097.61 ms

Running from nvme also helps as the above is on the sdcard as is booting from.

./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  2024.17 ms
whisper_print_timings:      mel time =   108.58 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 86100.46 ms / 3587.52 ms per layer
whisper_print_timings:   decode time =  4895.51 ms / 203.98 ms per layer
whisper_print_timings:    total time = 93143.08 ms
1 Like

Adding my own results to @rolyan_trauts for the same 11 second WAV file.

On an AMD Ryzen 9 5950X 16-Core Processor:

$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
...
whisper_print_timings:     load time =   103.60 ms
whisper_print_timings:      mel time =    67.62 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =   901.81 ms / 150.30 ms per layer
whisper_print_timings:   decode time =   117.51 ms / 19.59 ms per layer
whisper_print_timings:    total time =  1198.73 ms

and using a different number of threads:

$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
...
whisper_print_timings:     load time =   106.87 ms
whisper_print_timings:      mel time =    36.87 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =   610.55 ms / 101.76 ms per layer
whisper_print_timings:   decode time =   132.92 ms / 22.15 ms per layer
whisper_print_timings:    total time =   895.48 ms
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 16
...
whisper_print_timings:     load time =   101.06 ms
whisper_print_timings:      mel time =    22.97 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =   456.84 ms / 76.14 ms per layer
whisper_print_timings:   decode time =   199.03 ms / 33.17 ms per layer
whisper_print_timings:    total time =   788.38 ms

with the tiny model and 16 threads:

$ ./main -m models/ggml-tiny.en.bin -f samples/jfk.wav -t 16
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
...
whisper_print_timings:     load time =    76.03 ms
whisper_print_timings:      mel time =    21.28 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =   231.51 ms / 57.88 ms per layer
whisper_print_timings:   decode time =   100.68 ms / 25.17 ms per layer
whisper_print_timings:    total time =   437.92 ms

On a Raspberry Pi 4 (tiny model, 4 threads):

$ ./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
...
whisper_print_timings:     load time =   800.65 ms
whisper_print_timings:      mel time =   208.19 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  9256.59 ms / 2314.15 ms per layer
whisper_print_timings:   decode time =   708.53 ms / 177.13 ms per layer
whisper_print_timings:    total time = 10999.69 ms

Yeah it scales really well on threads/cores especially if you have a monster PC like yours.
Its something to do with the author as he is some brilliant scientist or something as he has created his own Tensor library for machine learning https://github.com/ggerganov/ggml that is cpu based and optimised for Neon & AVX.

I presume that is a really good test for AMD AVX vs Intel as is it the 5950X (Might be the 7950x) that does AVX512 in x2 256 passes almost like AVX hyperthreading?

The dev passed me his bench on the Pi4 for the base model as no longer have one.

Also on the readme MacBook M1 Pro, using medium.en model:

$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8

whisper_model_load: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 1644.97 MB
whisper_model_load: memory size =   182.62 MB
whisper_model_load: model size  =  1462.12 MB

main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...

[00:00.000 --> 00:08.000]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:08.000 --> 00:17.000]   At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
[00:17.000 --> 00:23.000]   A short time later, debris was seen falling from the skies above Texas.
[00:23.000 --> 00:29.000]   The Columbia's lost. There are no survivors.
[00:29.000 --> 00:32.000]   On board was a crew of seven.
[00:32.000 --> 00:39.000]   Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
[00:39.000 --> 00:48.000]   Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
[00:48.000 --> 00:52.000]   a colonel in the Israeli Air Force.
[00:52.000 --> 00:58.000]   These men and women assumed great risk in the service to all humanity.
[00:58.000 --> 01:03.000]   In an age when space flight has come to seem almost routine,
[01:03.000 --> 01:07.000]   it is easy to overlook the dangers of travel by rocket
[01:07.000 --> 01:12.000]   and the difficulties of navigating the fierce outer atmosphere of the Earth.
[01:12.000 --> 01:18.000]   These astronauts knew the dangers, and they faced them willingly,
[01:18.000 --> 01:23.000]   knowing they had a high and noble purpose in life.
[01:23.000 --> 01:31.000]   Because of their courage and daring and idealism, we will miss them all the more.
[01:31.000 --> 01:36.000]   All Americans today are thinking as well of the families of these men and women
[01:36.000 --> 01:40.000]   who have been given this sudden shock and grief.
[01:40.000 --> 01:45.000]   You're not alone. Our entire nation grieves with you,
[01:45.000 --> 01:52.000]   and those you love will always have the respect and gratitude of this country.
[01:52.000 --> 01:56.000]   The cause in which they died will continue.
[01:56.000 --> 02:04.000]   Mankind is led into the darkness beyond our world by the inspiration of discovery
[02:04.000 --> 02:11.000]   and the longing to understand. Our journey into space will go on.
[02:11.000 --> 02:16.000]   In the skies today, we saw destruction and tragedy.
[02:16.000 --> 02:22.000]   Yet farther than we can see, there is comfort and hope.
[02:22.000 --> 02:29.000]   In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
[02:29.000 --> 02:35.000]   who created all these. He who brings out the starry hosts one by one
[02:35.000 --> 02:39.000]   and calls them each by name."
[02:39.000 --> 02:46.000]   Because of His great power and mighty strength, not one of them is missing.
[02:46.000 --> 02:55.000]   The same Creator who names the stars also knows the names of the seven souls we mourn today.
[02:55.000 --> 03:01.000]   The crew of the shuttle Columbia did not return safely to earth,
[03:01.000 --> 03:05.000]   yet we can pray that all are safely home.
[03:05.000 --> 03:13.000]   May God bless the grieving families, and may God continue to bless America.
[03:13.000 --> 03:41.000]   Audio


whisper_print_timings:     load time =   575.92 ms
whisper_print_timings:      mel time =   230.60 ms
whisper_print_timings:   sample time =    73.19 ms
whisper_print_timings:   encode time = 19552.61 ms / 814.69 ms per layer
whisper_print_timings:   decode time = 13249.96 ms / 552.08 ms per layer
whisper_print_timings:    total time = 33686.27 ms

The tiny & base model sort of drop off the accuracy Whisper has, but boy the bigger models are amazing as on Small and above Elon becomes Ilan Ramon, which is actually correct.
Its a strange repo as Whisper is an absolute accuracy monster that you really the model prob needs partitioning to run on GPU/CPU and maybe a NPU if avail as you are going to use it for accuracy as that is where it is optimised in the bigger models.
The small model from what I have read really is the minimum but the above medium on a Mac Book Pro M1 @synesthesiam what does your AMD monster manage out of curiosity as a comparison?

I asked the dev for a bench of the Pi4 with base

Rock5b

rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jf                                                                                                                                   k.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang =                                                                                                                                    en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your                                                                                                                                    country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   313.91 ms
whisper_print_timings:      mel time =   107.60 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6165.18 ms / 1027.53 ms per layer
whisper_print_timings:   decode time =   657.71 ms / 109.62 ms per layer
whisper_print_timings:    total time =  7256.87 ms

Pi4

pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1851.33 ms
whisper_print_timings:      mel time =   270.67 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings:   decode time =  1287.69 ms / 214.61 ms per layer
whisper_print_timings:    total time = 37281.19 ms

Rock5b 5.137 times faster than a Pi4 which is interesting prob due to the Mac optimisation which is ARM8.2 architecture and cores?

I have been shocked how well it scales as with (tensorflow DTLN) on x2 threads you get an improvement but not x2 and then with x4 threads it seems to make little difference to x2.
Its an amazing repo how it scales on CPU but really its a model that you wouldn’t want to run on cpu alone or the smaller models.
Been extremely impressed of the accuracy of the bigger models as wow F…

1 Like

PS going back to the original Whisper code to get cuda11.6 I used one of the Nvidia docker containers instead.
But install the right torch 1st
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Then install Whisper
pip install git+https://github.com/openai/whisper.git
Using https://commons.wikimedia.org/wiki/File:Reagan_Space_Shuttle_Challenger_Speech.ogv 4m:48s
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model medium.en --threads=8

real 0m42.072s
user 0m46.303s
sys 0m3.591s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model small.en --threads=8

real 0m22.323s
user 0m24.127s
sys 0m2.545s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model base.en --threads=8

real 0m13.119s
user 0m14.324s
sys 0m2.137s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model tiny.en --threads=8

real 0m10.855s
user 0m11.907s
sys 0m2.106s

My GPU is just a RTX3050 desktop but also would be interesting @synesthesiam what your GPU’s will attain, also the translation to English might be interesting on the multilingual models.
Its great what Georgi has done but would you really want to run on cpu?

I’ve been testing Whisper.cpp, and there is roughly a 1.5x to 2x speedup compared to Openai’s Whisper model, but on my Intel i7-6700T only the tiny model sort of usable.

$ ./main -m models/ggml-tiny.en.bin -f rhasspy/setkitchenlights20percent.wav -t 8 
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
whisper_model_load: memory size =    11.41 MB 
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 8 / 8 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'rhasspy/setkitchenlights20percent.wav' (48480 samples, 3.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.000]   Set the kitchen lights to 20%.


whisper_print_timings:     load time =   295.38 ms
whisper_print_timings:      mel time =    18.83 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  1026.03 ms / 256.51 ms per layer
whisper_print_timings:   decode time =    54.58 ms / 13.65 ms per layer
whisper_print_timings:    total time =  1400.99 ms

From my previous tests using the openai model (see my post above)

  • tiny: openai 3.6s vs 1.4 with cpp
  • small: openat 17s vs 10s with cpp

Interestingly, there seems to be a “startup cost” to the model. For instance, the jfk sentence is realtime 11s, and is transcribed in about 1.5s. My utterance of “How much time is there until the laundry timer is done?” is 6s realtime and is transcribed in 1.7s, and “Activate working” utterance is 2s realtime and takes 1.3s to transcribe. (All tiny model.)

I also used ffmped to speed up the audio file by 1.5x, and transcription took the same amount of time. (Accuracy of transcription was unaffected.)

I was hoping that the transcription time would be linear with the length of the audio, so that short utterances (like used for a voice assistant) would be fast.

Yeah presume the “startup cost” is loading the model and each time we are running ./main we are loading afresh where likely the code code be hacked to retain the model in memory on each audio submission.

Its a really interesting model but not so great for a voice assistant as basically it works on a 30 sec sliding window and its accuracy isn’t with what it hears its how it interprets that context.
That is why sometimes it can get things very wrong but will still make perfect sense in English.
Because of that it really doesn’t fit a streaming model and the streaming examples seem to up the load maybe x4-6 times and cpu wise I could only get the tiny.en model to work and the results where much poorer than feeding none streamed audio on a Intel(R) Xeon(R) CPU E3-1245 v5.

Is it possible to accelerate whisper with something like a google coral on a raspberry pi?

Yes.

Someone here has converted from pytorch to tflite and then quantised that model to int8 so thinking its likely but still dependent on layers as still some layers such as GRU/LSTM don’t convert.
It should however accelerate standard tflite cpu inference over the pytorch model and examples can be found here.

1 Like

PS Whisper.cpp has had some updates and the streaming model is supposedly working much better.

1 Like