Whisper by OpenAI

Anyone checked out Whisper by OpenAI yet for ASR?

4 Likes

Yeah, I have been doing some tests, and the ASR is really good, but on my CPU-only machine, even with the tiny.en model it runs at best realtime, so it takes around 3s to transcribe a 3s utterance, which IMO is too slow for being usable.

For reference, I’m running a base+satellite configuration, with ASR done on the base, an Intel(R) Core™ i7-6700T CPU @ 2.80GHz with 24 GB RAM as my home server.

$ time whisper setkitchenlights20percent.wav --model tiny.en --language en --fp16 False

[00:00.000 --> 00:03.000]  Set the kitchen lights to 20%.

real	0m3.573s
user	0m9.131s
sys	0m1.310s

The tiny.en and base.en models transcribed “cancel the timer” to “pencil the timer”. small.en got it right, but took 17 seconds on my CPU.

$ time whisper canceltimer.wav --model small.en --language en --fp16 False --task transcribe
[00:00.000 --> 00:02.000]  Cancel the timer.

real	0m17.479s
user	0m52.986s
sys	0m4.118s

My current config is RPi 4b as satellites, the i7 as base (also running Home Assistant, Node-RED, MQTT, etc… ) I am really happy with the performance of “Hey Mycroft” wake word, Mozilla DeepSpeech for ASR, Fsticufs, and Mimic3 ljspeech TTS (and Node-RED for fulfilment.)

DeepSpeech seems resilient to music + microwave running in a kitchen with lots of reverberation.

Hopefully Whisper CPU performance improves, or … it can run on a Coral TPU? I think many home automation setups already have or want a Coral for running Frigate, would be nice to use for ASR as well.

I created Google Colab notebook to test with a GPU

model = whisper.load_model("small.en")

options = dict(task="transcribe", language="en")

st = time.time()
result = model.transcribe("whatstheweathertoday.wav", **options)
print(time.time() - st)

print(result["text"])
0.4277827739715576
 What's the weather today?

0.4s on GPU vs 17s on CPU.

With the tiny.en model it takes 0.2s on GPU vs 3s on CPU.

The tiny.en model also correctly transcribed “Start a timer for 13 minutes and 33 seconds.” whereas Mozilla DeepSpeech (for me) confuses number like 33 with 53.

Yo I kind of want to see your full setup with the satellites and base! Sounds like you’re using multiple satellites and in the kitchen? What are you using to process utterances to execute commands? NodeRED? What’s your timer setup like? Does the timer go off on all the satellites or just the one it came from?

Yes, I’m helping with a university project for older people (of which I’m one). The main thing is Pi4 based so it clearly won’t cope (not a criticism, really good transcription needs power).

So I’m thinking about wrapping the whisper executable into a web service/server and doing a ‘please wait’ especially as I have large chunks of voice. If I get anywhere, I’ll open source the bits and pieces, more integration than pure development.

Hi @Hugh_Barnard - have you seen openai/whisper – Run with an API on Replicate ?

They offer GPU transcriptions via API. I am thinking of running a simple Flask webserver locally that then posts to the Replicate API, to get access to GPU without having to buy one myself. Pricing is pretty good, $0.00055 per second.

1 Like

Thanks, this certainly suits me for proof of concept work.