this is soon - looking forward to it!
This is a really nice idea ! - like it!
Finally, it would be amazing if you could maintain the hotwords for all satellites centrally (e.g. in Rhasspy or so).
I am experimenting with this Build a Keyword Spotting Model with Your Own Voice in 30K RAM
Already implemented it with a default yes/no because I was lazy.
What I need is a simple tool to upload and train data, which above mentioned seems to provide.
You still need to provide a lot of data for the keywords, but basically “lights on” and “lights off” should be a good start.
It does not seem to perform well in my satellite code, but stripped down to the core is works on my M5 Atom Echo
Maintaining in Rhasspy is a good idea, but I have not given that much thought yet.
There is also similar for the pico
But like everything we have without audio processing libs of beamforming or blind source separation they return pretty poor results in the presence of noise.
Still not sure what the state is with the esp32-s3 which is far more powerful vector math than just simple microcontrollers that is specifically aimed at a complete voice package.
There are some kits out £20 and they seem to be going through revisions and mainly engineering samples, but think everything is being effected by the silicon shortages.
Also now the big thing is federated learning that solves huge problems in privacy a general purpose global model can be trained and have its weights effected by a locally trained small model so that user data can garner specific user accuracy without problems of privacy.
Its all part of Android 12 and especially the new pixel 6 phones where through use and correction a local model begins to learn user regional accent and expressions. Which all part of the new Tensor chip they have provided with quite a strong embedded TPU (vector math).
Yeah, I was wondering about that as well but I am just experimenting a bit
Just noticed a new audio project for xtensa based
The box itself is very cheap $50 and think its using the Alexa repo for esp32 but will have to do some more reading.
The devkitC is available for $15 https://www.mouser.co.uk/ProductDetail/Espressif/ESP32-S3-DevKitC-1-N8R8?qs=sGAEpiMZZMv0NwlthflBi%2BzwRcElzYQ0Q0bvCS%2BJ0vw%3D
As what is of interest is how well
https://www.espressif.com/en/solutions/audio-solutions/esp-afe
Works as AEC, BSS & NS finally arrive on a platform that makes extremely cost effective satellites but unless Hermes audio and IMO the manner of satellite implementation has changed. we have the satellites but Rhasspy still is in need of major change.
Still though to have mics included, case & display, $50 is hard to beat.
I ordered from aliexpress cost £45.46 delivered to the UK
I ordered one as well, see if I van get local wakeword running and have something left for other processes.
At least support for the esp32 satellite. With local wakeword, audio streaming over mqtt is still not a big issue in my opinion.
Local wakeword for a pi is already easy, so I still agee that mqtt is not the best for audio streaming alone, but is still a good choice as glue between al the different functionaliteit. And streaming audio also works well actually, so I still disagree that Rhasspy needs a major change on this.
Ordered one also ! Hope this will fill the need.
Yeah it should run any tensorflow model @romkabouter and I keep meaning to have a look at the newer federated learning that is in there pixel phones.
In the pixel phones it combines a standard model with local training model, but how and what exactly dunno as haven’t been active with tensorflow lately.
I got one just to see how espressif have it set up and delve into the code as its much easier to hack away then build up from scratch.
I dunno if the mics are pdm or I2s and also inetrested in what they have used.
Also with Rhasspy haven’t paid much attention but still calling for audio input/output zones so multiple satellites and rooms can be provided for by a single brain.
Uncompressed raw audio over any protocol in the 2020’s is a horrendous solution in its own right but to encapsulate that in a lightweight control broadcast traffic is only glue as s hit apparently sticks.
When there is local KWS on the satellite, there is only a short while of actually sending uncompressed audio.
I assume that does not outweigh the major change needed for a different solution atm.
So while you have a good point in theory about raw audio, the actual problem with it is not so very big in my opinion. Maybe Mike is going to make that major change you so much desire in the future, I do not know and do not really care either
Yes, I am curious about that too. Probably no custom keywords, but interesting none the less
Again Hermes audio and the satellite implementation of Rhasspy did not have local KWS on a satelilite and like the Matrix and ESP products you seem to support this is your firmware we are talking about here they are purely always on broadcast satellites.
Even if a local KWS satellite was used you seem to forget about the all important ASR that contains the intent rather than purely a keyword to start broadcasting.
I am aware you do not know and don’t really care even though you often comment on the matter.
Zones is something else fundamental that Rhasspy misses call it Zones, Rooms or whatever but its absolutely essential to be able to easily group and associate input/output which audio is one of those.
Its sort of sad to see what was a very busy and active community turn into the trickle that it is today that can not be attributed to some of the community not caring, but hey at least they tried.
Maybe you should put more effort in actually contributing to Rhasspy to resolve the issues you are so passionatly about instead of always typing the same long texts regarding the matter.
Would be more productive, you post a lot of negativity but have zero contribution on change
And Rhasspy actually supports groups by the way.
I did at one stage, but things are so far out of whack and certain members are far too ready to turn to ad hominem than discuss or state they don’t care.
Rhasspy may have an over complex programmatical method of providing groups that like skills really there should be a plethora of uses for in existence, but the absence speaks volumes to current implementation needs and requirements.
Every time I hear Hermes Audio mentioned its a red flag to not waste my time, so I no longer do.
Occasionally as in the 1st reply of this thread, I do post on what has been some long term voids in an adequate voice /smart AI.
I cannot recall that I saw any pull request from you. Ever. But if you did, sorry about that and keep up the good work.
Also, keep up the good work for Rhasspy with regards to implementing the missing voids so we can move away from raw audio streaming. Hopefully some pull requests can be seen from you
I do not disagree with you on raw audio streaming, but I am not a Rhasspy developer and unlike you I do not have those big red flags on it. I would otherwise most probably have done something trying to change it. Instead my firmware just supports Rhasspy as it is, like it did a long time ago for Snips (where the whole mqtt audio shizzle is coming from in the first place)
If Rhasspy changes, the firmware will change along with it.
If I am going to be honest @romkabouter Rhasspy sort of doesn’t fit my needs anymore and I also am not a developer even though I have done quite a bit of work with tensorflow my main interest has been getting the working audio engineering and how on a Raspberry Pi platform we are near totally absent of the critical audio processing stages.
The esp32-S3 could be a game changer but how well it works has still to be seen as it does have AEC (Acoustic Echo Cancellation), BSS (Blind Source Seperation) & NS (Noise Suppression).
BSS is an alternative to Beamforming and it will be interesting to results as the best solutions use both and the combination of beamforming & BSS is probably how at least the Google devices work.
Amazon might use multiple mic beamforming alone with 6 array microphones whilst google used 2 but has I think a 3 array linear in its Nest Audio products.
More microphones increases resolution and accuracy of beamforming but each additional microphone exponentially increases process load.
So how 2 mics with BSS alone will pan out currently is of interest but the algorithms are incredibly math intensive but they have advanced.
Speechbrain and professor Grondin F SpeechBrain: Speech Processing provides some great examples that hark from his excellent ODAS app. [2103.03954] ODAS: Open embeddeD Audition System
The other advantage to the ESP32-S3 is its a vector math enabled microprocessor of low cost. The dev-kit board alone has arrived @ $15 and is not far off what the original ESP32 did so if a success then economies of sale should see similar prices maybe as low as sub $5 ESP32 boards.
Due to low cost this should allow the possibility of distributed wide array microphones where multiple 2mic array microphones can use KW hit probability to select the best signal source for a voice sentence.
This means you garner a hybrid of a conference wide array microphone system with low cost distributed array microphones to rival and better commercial alternatives.
You have been providing firmware for microphones that are absent of any audio pre-processing that have all the show stoppers that noise, distance and echo provide to recognition.
What you have been providing is essentially pointless as it provides no better results than a single el cheapo mic on a sound card.
You have disliked my posts informing others that apart from being a remote microphone do not expect anything more or anything like Google or Amazon performance as certain critical essential processes of voice AI are missing from your firmware.
MQTT is a control protocol that is lightweight and should be encrypted and by embedding audio into its payload because its broadcast traffic nodes that are not audio specific have to decrypt heavy audio traffic to just reject as not needed.
This is fine on the audio sender/receiver but completely kills over the use of lightweight control nodes in that MQTT network.
Audio does not need a broadcast traffic it has a sender and a destination and metadata and websockets is far more appropriate as its very simple to detect binary and string packets (audio & metadata) providing a very efficient way to transmit and separate the two whilst only occurring between the sender and recieve nodes than a whole MQTT network.
Also because only command sentence is sent after KW authorisation its very likely no encryption is needed or at least encryption is choice to create even lighter audio system transmit.
Where many room/zone systems can work over fairly low bandwidth WiFi/ethernet but transmission is sperate from MQTT as its pretty essential for any control protocol to be lightweight & encrypted and filling with unnecessary audio binaries creates huge load whilst there is not need for MQTT on a microphone/audio satelite like most commercial RTP protocols work, so MQTT can remain lightweight & secure and a control protocol.
That is incorrect. I actually stated myself in various occasions that the firmware is not more than a remote microphone.
Indeed it does not better results than a single cheap mike, but that is also not ever what I have advertised in any way.
The firmware offers a way to have a mike away from a server hardware having any sound input.
If that is pointless in your opinion, then that is what it is, your opinion.
I also created it for myself in the first place, because my server is hidden in a closet somewhere and I wanted a remote mike. But I saw interest from others for exactly that reason so opened it up on Github.
If you or someone else can provide good AEC and NS for my firmware, I would very much appreciate that because I have about zero knowledge about that audio processing stuff.
For MQTT and audio I agree that it not ideal, but you really should blame Snips (and now Sonos)
Rhasspy did nothing more than jump in the hole when Snips was sold. Also MQTT is not a control protocol, it is a message protocol for large variety of use-cases.
And basically, if you have Rhasspy setup with Raspberry’s or a NUC, you would never have to stream audio over the network. Everything can be local and no MQTT audio is not needed for that at all.
The firmware I wrote does sadly not support local KWS atm, so the only way to achieve a remote mike system was to connect an audio stream. I already had that working for Snips via MQTT so it was quickly done for Rhasspy.
I found other features more important than to find another way for audio broadcast. And hopefully I can find some time in the future to do local KWS.
That will result in a very short time of audio broadcasting, so that is why I think that this whole MQTT audio issue is blown very much out of proportion by you.
I had a branch for websockets audio as well, but stopped working on it due to time issues.
If you really need AEC and NS, which I can very much understand, then do not use the ESP32 firmware. But that is true for every mike you blindly connect to Rhasspy, because a simple USB mike does no AEC or NS either.
So can we please leave this pointless discussion behind us and spend our energy in a positive way?