Using OpenAI's Whisper API for speech-to-text

I am glad that you got it working. Thanks for posting your notes to help others in the future.

From your rhasspy/profiles/ru I’m assuming you’re speaking Russian? My experience of Whisper for speech to text has been good, but I am a native English speaker using Whisper in English, which may show better accuracy than Russian.

Whispers API docs includes a language field. Perhaps add to the curl command:

  -F language=ru \

to hint that you’re speaking Russian, and try constrain the model. Will be interesting if that improves your experience with music playing.


BTW, since Whisper will transcribe any speech into text (not just what’s in sentences.ini, you might want to explore using GPT3 for “smarter” intent resolution.

1 Like

That’s right, I am a native speaker of Russian. Thank you, it helped, now it recognizes clearly in Russian. I checked the recognition when music is playing and it works better than Kaldi. It would be great if something similar would be without a cloud solution, so that there would be a local solution.

I’ll leave an example of the code with the addition of the Russian language. You can change the language to your own

#!/usr/bin/env bash
# First argument is OpenAi API key

wav_file=/home/respeaker/.config/rhasspy/profiles/ru/speech.wav
cat > $wav_file

curl https://api.openai.com/v1/audio/transcriptions \
  -X POST \
  -H "Authorization: Bearer YOURAPIKEY" \
  -H "Content-Type: multipart/form-data" \
  -F file=@/home/respeaker/.config/rhasspy/profiles/ru/speech.wav \
  -F model=whisper-1 \
  -F response_format=text \
  -F language=ru \
  | sed -e "s/[[:punct:]]//g" | sed -e "s/\(.*\)/\L\1/"
1 Like

Glad it’s working better. Thanks for the updated script.

There are local-only options

Plus Dockerised versions (see my example usage.)

The challenge is that the larger models are slow on CPU, and the small/tiny models are not very good (and still slower than Kaldi.) You could run on a GPU (but then pay for power usage), and AFAIK none support CoralTPU yet.

If you do manage to get a performant, local-only setup, let us know!

1 Like

I’m very impressed with whisper, currently using faster-whisper on a laptop cpu tiny.en model (damn pi shortage). It can hear my requests 6 meters away, on a basic built in mic quite easily.
I am that impressed, currently experimenting with doing away with waiting for hotword trigger. So can just say “{hotword} play some music” or whatever and it does it. Instead of {hotword} wait for response, give command.

It feels much more natural and intuitive way to interact with the system, the only problem I foresee is with VAD in noisy environments and may require a more flexible intent parser to handle it better. Which is arguably some of the same problems we have with current system.

In short I think whisper is the way ahead of the competition for a STT engine, just need to find the right tech to run it on for your needs.

Its also part of Rhasspy 3.0

I am not so sure about Whisper on lower end hardware especially a Pi as WER increases drastically as model size decreases.

Also this didn’t test command style files such as ‘turn on the light’ or conversational types, that Wer increases drastically with less context for the inbuilt nlp.
I am finding that really I want to use the small model as the best tradeoff but due to load having to use base or tiny.

Mainly I test on whisper.cpp as many others are still using its ggml as a base.

If I run the bench test from whisper.cpp q8_0 quantisation seems to give best results

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| rk3588 | Ubuntu 22.04 |  NEON | tiny | 4 | 124 | 1196 | d458fcb |
| rk3588 | Ubuntu 22.04 |  NEON | tiny-q8_0 | 4 | 95 | 1031 | d458fcb |
| rk3588 | Ubuntu 22.04 |  NEON | base | 4 | 156 | 2900 | d458fcb |
| rk3588 | Ubuntu 22.04 |  NEON | base-q8_0 | 4 | 118 | 2511 | d458fcb |
| rk3588 | Ubuntu 22.04 |  NEON | small | 4 | 378 | 10874 | d458fcb |
| rk3588 | Ubuntu 22.04 |  NEON | small-q8_0 | 4 | 244 | 7944 | d458fcb |
./main -m models/ggml-tiny-q8_0.bin -f ./samples/jfk.wav
whisper_print_timings:     load time =   114.51 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   287.05 ms
whisper_print_timings:   sample time =    24.77 ms /    25 runs (    0.99 ms per run)
whisper_print_timings:   encode time =  1032.13 ms /     1 runs ( 1032.13 ms per run)
whisper_print_timings:   decode time =   121.98 ms /    25 runs (    4.88 ms per run)
whisper_print_timings:    total time =  1640.46 ms

./main -m models/ggml-base-q8_0.bin -f ./samples/jfk.wav
whisper_print_timings:     load time =   142.35 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   286.55 ms
whisper_print_timings:   sample time =    25.72 ms /    26 runs (    0.99 ms per run)
whisper_print_timings:   encode time =  2493.89 ms /     1 runs ( 2493.89 ms per run)
whisper_print_timings:   decode time =   185.91 ms /    26 runs (    7.15 ms per run)
whisper_print_timings:    total time =  3214.44 ms

./main -m models/ggml-small-q8_0.bin -f ./samples/jfk.wav
whisper_print_timings:     load time =   265.50 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   287.76 ms
whisper_print_timings:   sample time =    26.94 ms /    27 runs (    1.00 ms per run)
whisper_print_timings:   encode time =  7844.49 ms /     1 runs ( 7844.49 ms per run)
whisper_print_timings:   decode time =   475.21 ms /    27 runs (   17.60 ms per run)
whisper_print_timings:    total time =  9036.22 ms

Also it works for many languages but not all with the same level of accuracy but it varies hugely to how you run it and the WER levels are not that great compared to what is supposedly a 2nd best of Wav2Vec2, ESPNet2 whilst Kaldi seems to be being left behind.

I have been looking for a Wav2Vec2.cpp but failing as you can greatly increase accuracy by splitting the awesome all-in-one of Whisper into specific domains that for load give better accuracy.

The Raspbery stock situation has been drastic and likely have much effect on Raspberry as it has created a much wider market of alternatives.

I think the Opi5 4GB £63.08 is prob a great bit of kit with all the great models of late.
1.6watt idle at the plug, but far more power in the cpu and has a modern MaliG610 GPU & 6 Tops NPU which really is 3x 2 Tops but hey.

When you have purchased sd, cooling, psu even be it a Pi then there are alternatives and the commercial Micro PC’s especially the I3s are pretty good for money.
I have a Dell 3050 Micro I3-7100T 8GB that is sort of between a Nuc & Mac-Mini in form factor that just needed a 2nd user 240gb £15 NVME that I got off ebay and has a great case, cooling and PSU.
So for £75 it could even be a tad cheaper.
Idle goes up to 8 watts but still good and is a bit faster than the RK3588.