State of "the art" as it were


Respeaker microphone array and Respeaker Core.

I’ve just picked up these items for next to nothing and I know they will probably be rubbish but that said, what can I do with them ? A quick search and I find everything on the internet is about the V2 whereas mine are V1. Are they any use for anything, (my first goal is a voice controlled wifi intercom).

I seem to remember all the way back to Mycroft days that the ver 1.0 was pretty weak, hence the Ver 2.0 and even then the algs struggled with full channel audio, but that is from memory @fastjack might comment as its vague.

State of the art is prob the Anker S600 that uses a combo of tech, but what it does is use voiceprint voice extraction than what was back then fairly simple beamforming.

I hacked together a delaysum 2 channel that if someone looked at the sources more channels could be added easily.

I paste that as all algs seem to suffer that computation is raised by nMics array. The delay sum part is just that a sum and no load but the TDOA delay is near all the calcs and fairly intensive.
Every mic adds an extra TDOA to get that mic delay for the calc and that is a basis for most.
More mics is computaionally heavy and think that was a prob with the above that even the Ver2.0 was an element of fail.

They also lack co-ordination with the KWS that can give the TDOA of the keyword and in essence lock on to target direction for a command sentence.

So here I am going to likely do one of those baffling long replies that you can stop here if you wish.
SotA (State of the Art) is likely voice extraction than trying to cancel every possible unknown.
Its trained to extract a voice embedding out of a recorded brief silence and then a paragraph, recorded on your phone.
Bluetooth has much going for audio with full 44.1Khz 2channel and multi channel audio.

You get a 5.3 dongle which I did deliberate as the Anker is V5.3 and pair, which means additional units are a another dongle. The extended range of BT 5 and LC3 codec is good, devices are expensive but likely the way.

Opensource and a shout again as the Pi5 would likely blitz the above beamformer as it does with much ML. Its near x6 of a Pi4 due to its V8.2 instruction set especially vectors.
My C/C++ is just pure hacking so someone could and the Radxa Zero with A55 little cores with V8.2 slugs it out with a Pi4.

DietPi - Lightweight justice for your SBC! as a OKdo ROCK ZERO 3W 1GB with Wi-Fi/BLE without GPIO - OKdo will give an often Pi4 beating ML £18.60

The vector instructions in Arm v8.2 is no longer SotA, but is a big improvent (x3?) over A53 counterparts, making them great for ML/DSP.

If your interested in opensource then there are a lot of models especially for RK35xx SBC such as EK3566 (Zero 3) and the RK3558 (OrangePiZer0) that really apart from ecosphere outdoes a Pi because Broadcom seem to of lost there way, with that type of application Soc.

Anything less especially the age of that ver1.0 just remind of a long rocky road.
You can but having the same algs that scale on bigger computers, so you can create and manipulate datasets, training and finetuning much faster than realtime is a dev godsend.

If opensource collated even a modicum of datasets much would be very easy and it constantly blows my mind that opensource with an option to turn on local data collection.
You can capture Keyword and Command Sentences so easily through use that it would be very easy to contribute data and metadata. Some Metadata that would be handy is the region you associate your accent to as that region would be able to finetune in that region.
Age bracket and Gender would be great info again to cherry pick data for finetuning.
You don’t need more and you could but apart from that we have a demographic than any identity, that is for user finetuning and if you opt in to share that with others.

Also I don’t get the current implementation of GitHub - rhasspy/wyoming: Peer-to-peer protocol for voice assistants with it being peer2peer with a protocol that IMO is TWAIN (Technology With an Interesting Name).
It just seems so logical to make Wyoming a Websockets server and have clients that are associated with zones or at least an addition.
What you want is a server that is different to a HomeAssistant one and have the schedulers to max performance in a Race-Till-Idle config.
It slows down when idle but any queues are ASAP.

You can have a pi5 like or above upstream server that can connect via Peer2Peer Wyoming and because the Assistant path is serial its a simple route or queue.
Websockets the underlying bit of Wyoming transmits to distinct datapackets Text/Binary and the API can detect which.
Its a file (json dict?) with markup and your audio without need of much that is current.
Simple route and queue handling are part of the websockets API or at least busy can be detected of stream 1, persist 1, busy states.
Zones are just collections of connections associated with a Room. A Zone is its ID and a Room is the abstraction of whatever collection of connected devices.

Same with BT devices as an array of dongles with BT 5+ coping better with congestion.
Its another radio, client server of just BT audio streams that with no Wyoming or Websockets but still uses the same route and queue of the file and binary data.

The current input audio processing its just more absent than not good.

No, please carry on with the long answers, I learn stuff through them but there should be an emoji with the “it went straight over my head” kind of thing.

Regarding the V1.0 Respeaker, I know it’s mickey-mouse by today’s standards but 7 years ago it was presented as the bee’s knees so I just want to try and do something with it to start my journey without spending anything much. It seems that Seeed have basically ditched the V1.0 and the V2.0 isn’t exactly supported either but back the day, 2017 ish, somebody must have done a few things with these items…
Mine sits on the table and the leds light up in the general direction of where the noise is coming from, it’s quite cool really, but I haven’t managed to get any sound out of it yet.

Seriously its sort pile of poop as when it did work it didn’t work that well.
The respeaker range IMO seemed to be devices to look the part and leave the voice algs for you the dev.
It was a bit like the matrix device as was a multi-mic looking the part, without a big enough cpu to do the beamform compute and so is just multi-mic

Seriously the V2.0 was a big upgrade and that didn’t get great reviews, never mind the V1.
I guess if you root around the respeaker site the orig firmware will be there somewhere
I guess you can usb connect it if you have the firmware but as for audio don’t expect much more than any mic.
As far as I know no-one did do very much with them but the github is still there.

If your talking SotA then Google do actually publish things and sometimes drop a few hints.

Is likely what the use on device in there Pixel tablet.

As a later might hint to that.

https://arxiv.org/pdf/2401.08864

What they are doing is combing spatial cues of beamforming with speech enhancement or voice extraction based on a recorded embedding.
I haven’t been able to find a repo of ready made model of that as just papers seem to exist whilst source code is being withheld.
Sooner or later some source code or models will turn up likely when they have found alternative methods (Also maybe another Google search search might turn something up.

Those early devices had very simple multimics 360 degree sound whilst both Google Nest and the full blown echoes they realised hold on this is a speaker and it faces you unless for conferences.
That means they can cut the number of mics and have forward facing devices massively reducing compute.

Still many are using conference types that work OK but they bounce round to whatever predominant sound / voice they have whilst a assistant will lock onto and beam towards a voice for the full command sentance then reset.
Also the sound seperation in a domestic scenario on average only has 1 clashing noise source such as a vacuum cleaner, TV, HiFi or radio…
So binaural as we humans are and 2 mics is becoming more common on embedded devices.

If your talking a Pi4/5 then there are various algs avail but tend to be a very tight squeeze on a Pi3.
Thats why I mention the Radxa Zero3 as its near Pi Zero2 price but actually outdoes a Pi4 as its A55 small cores are 2 generations up from A53(pi3 ‘small core’) and A73 (pi4 large cores).
The Pi5 is 3 gen up with a A76 large core.

You get to that sort of level and quite a few options are avail but my memory is so foggy that the Respeaker v1.0 had a Xmos soundproceesor but no license for Xmos closed source ?.. I have forgot.

Xmos XVSM-2000 apparently…

I do eventually want SotA but I’ve realised that this is much more complex than I thought and a lot of this stuff is still very much experimental, much more so than casual googling led me to believe so I am not wanting to spend too much and find that the whole SotA thing has moved on before I even get what I’ve bought up and running.

Baby steps at the moment…

I say the next gen SotA might be out before you get that up and running and its still far off.

Only problem is its seems to take much time for SotA to filter down into opensource.

Likely some lateral thought of using what you have to wifi/bt the audio streams to a upstream bigger cpu and handle multiple connections.
Don’t stick your mics on top of a speaker and just have some wireless ones or at least wireless plugged into an amp.

I could post you a load of repos to have a look at for Pi type boards but microcontrollers tend to be much lower level and need quite a high level of knowledge.
Maybe use as is as a wifi streamer, dunno about audio it has but has pins but maybe line level.

Xmos XVSM-2000 you have, there closed source paywall SDK you don’t.

I don’t doubt that you are right but I am a pig-headed sort of a person and I can’t stop myself from wasting my time with this respeaker…
At the moment I am up sh*t creek because the firmware is no longer where it is supposed to be (on One drive apparently…) I am missing the “Juci frontend, PocketSphinx, Python libraries and Mopidy” but I don;t know how to get over this enormous obstacle.

GitHub - mkschreder/juci: JUCI JavaScript Webgui for embedded devices running OpenWRT just google but how they where implemented who knows.

Thanks for that but I am struggling to actually make sense of it all. I’ve been googling till my brain hurts…

It seems the original firmware has the missing packages but then the updated firmware that suits the Meow King speaker and battery unit doesn’t have these items to “save space” but without them I don’t know how it’s supposed to do all it’s magic. I am getting there slowly but I think I need the original firmware which is no longer to be found and the ReSpeaker page is no longer supported. If I can get something out of it with the original firmware and a separate speaker then I can try and get the Meow speaker working later then eventually use it as a satellite to the RPi 4 some day in the future.

I think ‘designed for’ means its a a 4 ohm 5 watt speaker in a housing that fits the case.
Thing is I don’t think it ever did its magic, I can remember a user had one on the Mycroft forum but that doesn’t exist anymore.
Enjoy as nothing I can add as its an old obselete unit that the demo video’s never show it working and used it as a fruit piano instead as the touch buttons did, which speaks volumes at least.
I just can not remember, but have a vague recollection it got a bad review by the Mycroft user, but what was said … dunno can not remember.

I know it’s a total pile but it did have some sort of functioning directional mic thing going on before i started messing. It would light up the leds in the general direction of the person speaking…
I would be just looking to get it doing that again and outputting sound to the Meow King thing and it would give me the inspiration to do something “proper” with the Pi after. Then the ReSpeaker can be a satellite: I definitely need at least one satellite as the first thing I want to do is have the voice-operated intercom that used to work on the Alexas.

Edit: It still does the directional mic but I had the 7-way mic thing disconnected so it wasn’t doing it… DOH!

I assume that function is built in to the Mic array ?

Yeah that must be the xmos the TDOA (time difference of arrival) 2ch_delay_sum/tdoa.h at main · StuartIanNaylor/2ch_delay_sum · GitHub is usually calculated and if delay sum, it is just the sum of the calculated delay.
That is with broadside arrays, when you have a mic behind a mic (endfire) I think its sum again but delay seignal is inverted.
Haven’t a clue with the geometry you have and maybe uses a more complex beamformer such as GSC or MVDR.
Its so long back that I remember very little.

Don’t laugh but I’m looking at a Seeed Snips now. Apparently Seeed sold it to Sonos and they blocked it or something. Anyway, I can buy it for peanuts but is it any better/easier to make use of than the Respeaker ? Sorry for the idiot questions and I know this thread has become the exact opposite of “State of the Art” for which I apologise.

Its ok as Rhasspy or HA doesn’t have any input audio processing, its completely lacking SotA.
Seed snips is just a Pi3 B+ and a Respeaker 2 mic hat and there used to be some C libs and api documentation, that I am not sure if it exists anymore. Again it was another thing about licensing, but my memory fades.

You have gone full circle now, where you can use a Pi3-B+ or even a PiZero2, but a £18 Radxa Zero 3 will give Pi4 like performance.
You don’t have to have Snips, it was just software in fact the Respeaker 2 mic hat as due to it being a hat its really awkward to isolate the mics with many users having them pointing upwards.
The 2 channel ADC from Plugable USB Audio Adapter – Plugable Technologies which allows you to mount x2 https://www.aliexpress.com/item/1005005348094991.html as the Max9814 has a great analogue AGC as they mount using a rubber grommet.

I can do all this with various methods the problems really are upstream, but from beamforming, BSS to filters, much is avail and possible just not particularly SotA but far better than the complete absense of audio processing Rhasspy and HA have.