State of "the art" as it were

Hi all and thank-you for putting into my brain a great amount of things that I would never have thought of even if I don’t always know what the hell you are on about.
Anyway, I am embarking on this new to me RPi project to displace a small herd of Alexas which is probably more than I am capable of so need all the help I can get for starters I would very much like a kind of rundown on where I can expect to get to and with what hw/sw I would have the best chances of getting there. I have searched under various terms from OS to microphones and have learnt a lot of stuff and I do understand that the knowledgeable may not want to promote a particular item/brand whatever but an updated 2024 guide for the willing but less-than-expert who wants to order some stuff and get started but doesn’t want to end up with a drawer full of junk that he never used…
Thanks for everything you have already done and thanks in advance for that which you might do in the future.

Hi Johnny, and welcome to the rabbit warren that is voice assistants :wink:

Firstly, You will not get the same experience from any Open Source project as you do from Alexa. They have invested huge $ in developing their hardware and software, and especially in keeping their key algorithms proprietary. Rhasspy (and other open source projects) are getting closer … but their big advantage is local control if you can put up with a few rough edges.

Secondly, if you are using Home Assistant I recommend you investigate HA Voice Assist instead. It has been built by Mike Hansen who developed Rhasspy; and is the latest version, closely integrated with Home Assistant - which seems to be what 75% of users were using Rhasspy for anyway.

Thirdly, voice assistant devices is a hot topic. When Rhasspy was developed, Raspberry Pi’s were cheaper and readily available, and with reSpeaker (and clone) HATs made a good voice assistant satellite … but I no longer recommend them unless you already have a RasPi sitting around unused. Also note that there are plenty of multi-mic boards which do not have firmware or drivers available which take advantage of more than one mic.

Over at Home Assistant, they have done work improving the Espressif ESP32-S3-Box as a voice assistant, and there are hints that Nabu Casa are “looking at” making their own satellite hardware devices. Conference mics can be good, but tend to be rather expensive and assume no background noise.

Basically there is no clear winner in the voice satellite hardware category at the moment.

Thanks for your reply, yes I am happy to put up with a few rough edges in exchange for the ability to program out bad behaviour which I experience with Alexa. I find myself regularly shouting “Alexa, ALEXA !! STOP !!” Because she “misheard” my song choice and went ahead and played whatever the hell song without any confirmation. I hope that I would be able to program into the behaviour of “RaspBarry” (what my kids have started calling the as-yet non-existent Alexa replacement that I keep mentioning) a quick confirmation before playing if there was any doubt, I could program to avoid that kind of thing couldn’t I ?
Yes, I have a Raspberry Pi bought for a small kiosk task displaying some solar panel stats but I think the Raspberry can cope with a voice assistant task as well ?
I am not currently using HA but if it is a good thing to use for voice assistant then why not. It is able to operate local only and that’s what I want.

The main stumbling block currently is the ASR engine which uses OPenAi’s Whisper.

The above gives some info on the WER (Word Error Rate).
Whisper is a conversational ASR (Automatic Speech Recognition) based on a LLM (Large Language Model) and this is where the rated WER and what users might find start to depart.
It uses a LLM that works with a beam search that I have forgot how many branches and also the input length. 10 branches over 20 sec audio.
The more branches, more possibilities Whisper will consider and the longer the audio chunk the more context the LLM part of whisper has on what results it will choose.
To squeeze it into what was a Max of a Pi4 at the time much of the parameters have been shortened to reduce computational complexity at the cost of accuracy and WER starts to climb as the smaller models with very short beams searchs and audio chunk sizes where likely the worst scenario is in streaming use of very small audio chunks.
Also when it comes to language generally its native language speaker and regional or non native can also increase WER.

The good thing about Whisper is it was trained on a typical microphone situation similar to Streamers and gamers would use, which really is near-field <0.3m, but is tolerant and can be quite tolerant of room echo out to <1m but fails farfield.
So you can pick up a cheap USB mic or use onboard as likely they where in the training dataset as where audiobooks and ted talks and things that likley had a transcription.

You don’t need the confusion of Rhasspy to get to the 1st hurdle of where opensource departs from Commercial home assistants.
You can grab a copy of Whisper likely GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ is a good start as it has a numerous dev community that likely will help you along, and is littered with references and 3rd party code.

As for Alexa I have an Alexa 4th Gen echo as from time to time I do purchase the latest and greatest to review. That replaced my Nest Audio which I consider superior, lost interest with the Gen5 as even previous versions are still on a different level to what we have here and as far as I know there is no music player even though mentioned in the api.

Rhasspy and HA for ASR currently use Whisper which with the above with a simple desk mic for Dev trials you can get going with very little outlay, in fact none as likely you have a headset that you can plug into a Pi.
Currently there is no AEC or input audio processing and the playing audio will be above what Whisper can tolerate and no matter how many times you shout, it will not hear you.

Whisper is a good start if you want to get a quick introduction to Voice assistants that very much can be a rabbit warren.
For your case scenario of ‘Barge in’ Rhasspy doesn’t even have anything implemented and is hoping for magic hardware or conference mics.
Also with the weird and wonderful band names and tracks we have this can make for some very unusual sentences for a LLM.
Likley having a play will help you think about context as “Play me some music by the band The Beatles” is laden with context and likely will work much better than “Play me The Beatles” as an example for all bands with weird and wondeful titles.

“Alexa who is playing” annoys me the most than telling me what the current radio track is its even stranger to get the scores from the Bundesliga especially whilst I am English.
“Alexa stop” always works for me, but does often get the wrong band, but it would really annoy me to have to listen to an announcement of what was about to be played than just play it.
Again I am presuming its about context as notice short band names such as ‘Squid’ or ‘Shame’ are often the most problematic, but you have whisper which is the ASR of use and you can dev and test away, without much waste of time.

Home Assistant is very focussed on controlling and automating household operations - such as closing blinds and turning on lights at sunset, and reminding you that you have left the garage door open at 7pm.
Voice Assist is only one tiny component of the Home Assistant project, and so I suggest would be overkill for your requirement. However that is where I have come from, so I haven’t paid much attention to the other Voice Assistant projects out there. I expect other people can give better recommendations.

Right. Quite complex then. I’ve read on some of the threads on here about mics spread around the room, would that significantly improve the long-range ? I will also separate the mic(s) and speaker(s) by a couple of metres.

My Alexa does already announce what is about to play followed by “…by Amazon Music” so if this was more of a question rather than a statement plus a half-second wait for a “yes” or “no” I would be ok with that. I probably only want a confirmation if the certainty of understanding was low, assuming there is some useable measurement of that.

On another topic: I want to have a voice-operated intercom like Alexa should provide but doesn’t anymore. Is this easy enough to do with Rhasspy or HA or am I opening a whole new rabbit warren ?

Haven’t tried but from the options of Whisper.cpp

usage: ./main [options] file0.wav file1.wav ...

  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [5      ] number of best candidates to keep
  -bs N,     --beam-size N       [5      ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -debug,    --debug-mode        [false  ] enable debug mode (eg. dump log_mel)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -ojf,      --output-json-full  [false  ] include more information in the JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference
  -ls,       --log-score         [false  ] log best decoder scores of tokens
  -ng,       --no-gpu            [false  ] disable GPU

If I remember rightly Rhasspy uses FasterWhisper but when it gets embedded you lose much of the api that Whisper.cpp has.
Its why I suggest Whisper.cpp as with the above cli options from number of best candidates to keep to word timestamp probability threshold likely the answer is yes.
Also running the biggest and best model with a headset and seeing the results of some of the command sentences Alexa fails on with Whisper should give you a good idea.
Unless your going to dedicate a MacMini or GPU to Whisper then I am pretty sure only the Large model was peer reviewed, but the WER on the smaller ones is extremely optimistic. (Could be my Brit English vs American)
You don’t have to think about Mics as you can just use a headset and test what would likely be the very best results you will ever get.
I think if you play with the CLI options of Whisper.cpp and the json output options, you can get something like you are asking for.

Wasn’t the voice-operated intercom part of the frigate/camera setup and just used the 2way audio of security cams?

It should be very easy to implement if we had a zonal mic and audio system.

PS multiple distributed zone mics is simple physical position that you will always be nearer to one of the mics, so you don’t need to try to do long range, or at least minimise the range.

So a Pi 4 with 8gb ram is probably not powerful enough for the best version of Whisper ? I don’t think I would get anything bigger just yet so if that is the case I will have to make do for the moment.
So I need to look into a zonal mic and audio system.

If I remember rightly with a Pi4 really we where talking the tiny or base models that have been quantised and also gain a bit more WER through quantisation.
Likely you can run the large model but the delay for transcription will be substantial, but for testing use possible.
Due to the newer instructions of ArmV8.2 the Pi5 & Opi5 with ML can be x6 faster than a Pi4 but still for use the small or medium models.
The newer CortexA55 little cores due to mat/mul vector instructions are faster than the CortexA73 big cores of a Pi4.

Its always confused me that with a emphasis for Raspberry Whisper was chosen especially this was before the Pi5.
Whisper with large models post SotA scores but when your running heavily quantised tiny or base models WER is no longer the same by quite a long way.

Have a play as you can still run the models with a 8GB especially with zram but transcription will be slow.
With music and wonderful band names, these are quite nonsensical sentences to the likely training data of whisper, give it a test and see how you go.

There are 2 great zonal audio systems Snapcast and Squeezelite which are likely a bit simpler than DiY with a Pulseaudio or another audio server.

My fave is Snapcast that you create streams and you can join as many clients to that stream as you wish. Generally a stream is a Zone (Room) with a single client.
I think it will run on a Pi0 but with such little difference in cost to a Pi02 whch will have no problem.

Squeezelite is even lighter and also runs on ESP32 boards, but as said have a pref for Snapcast.
Dunno about zonal microphones but its something I have been going on about for a while that Keyword and server activated zonal microphones should be a thing.

PS I have tried a few budget pan-tilt cams with bi-directional audio and so far they have been pretty useless with big latency.
I do have a Anker C300 and its dual mic is better than the Anker Powerconf I have.
The advantage of the Powerconf ‘conference mic’ is the built in audio and AEC.
You can do the same with SpeexAEC that is opensource with a webcam and likely better than a pan tilt as the servo’s are extremely noisy and probable cause for the bad audio.
I guess there are premium product that is much better, but out of my testing price range.

I have loads of questions: I will go away and do some more study and come back when I’ve got a better idea of what I need to ask you.
Thanks for your input so far.

Quick question though: talking zonal, do we mean zones within the same room for mic response or zones within a house as in for the intercom function ?

Edit: Any difference between 32 bit and 64 bit Pi OS for my purpose ?

for me the whisper error rate is too high., its like the old pocketsphinx libs from Carnegie Mellon. my success rate is near 10%

one of my projects uses the google speech reco, while cloud, is pretty responsive and accurate.

i made a wyoming asr using it

wyoming uses centralized asr, with satellite speech capture and playback.
but reco is done at the end of the speech not inline word by word

no one has built a good person speaker location dependent engine yet. alexa timers etc are alexa echo box specific… not where i am.

there are some that are trying, but human speaker location is a problem, as you currently have to announce yourself. speaker dependant is also a ways off. (recognize joe vs sue vs bob)

and then we need the skills type interface.
what words invoke what service to get you the results you want. ( and where you are)

Its not just the accuracy it inheritting a huge model like that for training is even worse.
The resources needed to finetune Whisper as not many have a RTX3090 24gb spare.

Wenet are quite a bit ahead by deliberately choosing older tech that they can create on-the-fly language models and also create multi-modal ASR.
Certain operations aka Domains don’t need a general purpose complete language ASR.
A Domain such as control that has a small number of predicates (Turn on, set to…) and entities (light, curtain…) can have small language models specific to that domain.
This allow much smaller models to run on lesser hardware because the language model is purely for a specific domain.
Posted this before but they explain this eloquently.

LM for WeNet
WeNet uses n-gram based statistical language model and the WFST framework to support the custom language model. And LM is only supported in runtime of WeNet.

Why n-gram based LM? This may be the first question many people will ask. Now that LM based on RNN and Transformer is in full swing, why does WeNet go backward? The reason is simple, it is for productivity. The n-gram-based language model has mature and complete training tools, any amount of corpus can be trained, the training is very fast, the hotfix is easy, and it has a wide range of mature applications in actual products.

Speechbrain also and many others such as the likes of Vosk.

Also Bigdata doesn’t want your voice in the cloud as its expensive and Google has halted all Assistant Dev unless its ondevice, where there Tensor TPU in Pixel phones and tablet uses the users electricity and hardware for free.
Apple apparently have cancelled the Mac Mini M3 as the next release will be a M4 with Siri and generative AI ondevice.
Amazon just leaks $ like a sieve and makes a loss on some great sounding Gen4 Echo devices :slight_smile:

Each realises revenue comes from the services they provide and they aim to provide the best devices to trap you to there services as surveillance capitalism has and is failing to create revenue as we know the articles published for some time now.

This is old news but just to add offline on-device makes little difference if you are going to use the services that the provide such as Amazon or Youtube music.
Bigdata realises it doesn’t need surveillance as that info is easier to capture via service API’s that are vastly more efficient to run.

Its even crazy to embed a singular engine in the fast world of AI and this branding and cloning consumer smart speakers is not smart at all. We need a opensource Linux Speech Framework not a collection of hoovered permisive licenced opensource, refactored as branded as own.
ASR does a single job, its presented with audio (speech) that is transcribed (text) the ASR module in a Linux Speech Framework is purely a container for any engine that has various methods for collecting audio and sending the transcription to the next container of some form of LLM or Skillserver.

A Smart Voice Assitant is a serial chain of containers all providing specific function and its incredably simple and must processes are completely agnostic of protocol as all they need to do is pass a binary/text pair of audio and a yaml of collected data of that command session.

You don’t even need a skills type interface as that is another container that is merely a transcript router that uses transcript predicate as a choice of what skill server to route to…

Depending on audio out such as Snapcast or Squeezelite, or the zonal audio in system currently we don’t have.
A zone is just a logical collection of devices, likely a room, but its your choice what you define.
With audio out you might have individual channels of a surround sound system that are all seperate or it could be a single amplifier.
Audio in is the same its just a logical collection of mics that may represent a room or even part of its area.
It just allow logical collections of audio in that you can associate to audio out that the basic assumption unless stated is Assistant responses return to the same zone they where initiated in.

With ML unless your using a Microcontroller 32bit halves the SIMD instructions the CPU can make in one pass.
Much faster engines use quantisation and the smaller multiple words passed to a large databus to be process via SIMD (Single Instruction Multiple Data) the faster it is near expotentially.
64bit ML on CortexA (pi) is near 2x faster than 32bit as you can pass 2x as many words with a single data instruction.
I think the Neon is 256bit from memory and can take 4 data instructions before processing the SIMD instruction ( I forget :).

Microcontrollers also have 32bit (some are 64 bit) but yeah its halves the amount of vectors you can pass in a single data instruction.

well, you got to have SOMETHING that matches gramar to actions.

whats the weather is a lot different than turn on x

i currently support smart mirror, a voice driven info panel.
each extension can register its gramar so that the router knows where to send it to.

on my list is to replace the hard coded snowboy/reco with wyoming. but its all js, so

I meant it doesn’t need to be part of a system, its very simple to link predicates (The action gramar) to an IP where transcription and metadata is sent.
The router can be standalone and is a router not an interface to a system as transcript and metadata is the protocol.
Routers can be simple text matching solutions to LLMs but a seperate container, that is the next stage in a Linux Speech Framework, so you can have choice than a dedicated system interface.

PS as for hardware likely state of the art is

Its the only hardware outside of a smartspeaker that does targetted voice extraction based on user profiles.
There could be problems with Whisper as certain filters create flat reverb free voice, again depends on alg and the signature of artifacts.
Whisper could work fine but after $138.00 pledge that is just a disclaimer as that is Whisper not the Speakerphone.

Google with voice-filter published the 1st paper on targetted voice extraction.
The key tech is that rather trying to cancel the unknown of noise, they extract voice by creating a voice profile.
This is the direction most on-device voice tech is going with even ASR learning user voice patterns via a small trained model that shifts the weights of a larger pretrained model.

I stopped buying tech and Dev but my last will be to review the S-600 as I can still publish how well it extracts voice and publish a few samples with other tech recording at the same time.

I will post some recordings when I have reviewed mine.

@rexxdad Have you tried Online ASR with Emformer RNN-T — Torchaudio 2.2.0.dev20240503 documentation as for embedded it might be a better fit for a Pi than Whisper and if you read up on rnn-t then personal LM & maybe ondevice training might be possible.

Or wenet as if you search both do have pretrained models but are very geared up for devs.
Products such as the above S600 or filters for optimum ASR need the ASR dataset to be processed so the ASR becomes optimised for that product or filter.
So lighter ASR with ready training frameworks can be a big advantage, especially with dev communties the size of Pytorch.

$138.00 obviously is a ridiculous amount just for a microphone and amp, but targetted voice extraction doesn’t seem to have equilvalent open source, so Anker have got me paying a premium.
It will be interesting to see how well the Anker alg works in the realworld as it has all the secret sauce that opensource is missing, $138.00 though!

Yes, and one for each room…

Yeah I know, I have been trying to advocate for a lower cost system on a ESP32-S3 that for some reason from the direction others have taken they don’t understand.
I keep repeating how important opensource datasets are for a number of years and we still don’t have them and could be collecting through use…

The Anker is SotA and I have been banging on about targetted & on device training for some time, so posted as thought sod it I will buy one.
Some are prepared to pay $100 for any speakerphone and guess they will but for me also far too much.

As said when recieved I will post results and if it works with Whisper, but really that is down to Whisper and not all ASR.

So the overall audio input and output component quality is paramount or is it more about the software inside these speaker phone devices ? I have a “UE Boom 2” speaker that has a mic in it and the output sound quality is pretty good but I have never used it for a phone call so I have no idea what the microphone is like but if the amp/speaker is anything to go by it should be not too bad. From what I can gather no-one has managed to do something with the alexa hardware have they (yet) ? I have been looking at ESP32 boards but I haven’t got a very good grasp of what is required.
I thought I’d found something that would do for a start-up piece of kit, the Korvo Wroover which was suggested to be at 22 euro but after reading the article and getting ready to buy, the link led to a “no longer available” Aliexpress page.

Its noise where the commercial units excel, not just what they are playing with but third party noise.
This also includes the reverberation that distance to a mic causes that harmonic mixing and can make a waveform look totally different in structure.
Humans have a natural DSP and we filter this automatically and how we focus on directional audio of interest is quite exceptional.
Alexa the bootloader is locked but much is the software algs and models they have.
I think the Korvo is another weird one where kit was produced for algs to be developed that never materialised and hence why “no longer avail”
Sort of similar to the respeaker stuff where they liberally supplied mics and flashy pixel rings without realising multiple mics causes nMics^ computation needs.
The esp32-s3 for much DSP can process near x10 faster that a esp32 ( ESP32-S3-Korvo still exists) but even then is still optimistic as Espressif have only ever released a 2 mic technology demonstrator in the esp32-s3-box, that uses a software approach.
However the esp32-s3 was highlighted as its a lowcost <10$ microcontroller but when it starts to become $30-$50 dev kits it has lost all value as apllication SBC such as Pi compete in that price space.

There is likely enough ‘opensource’ to do both esp32-s3 or Pi like SBC, likely the biggest hindrance is lacking opensource datasets and knowledge to co-ordinate.