Wakeword engine

sepia-assistant · July 24, 2020, 7:23am

Great to see that you guys are working on an alternative solution!
I’m still using Porcupine but unfortunately what @KiboOst wrote also applies to their support for SEPIA and they will not update the “Hey SEPIA” keyword made back in 2018.
The worst thing is that they are the only ones who have a library that works cross-platform and can be used in the SEPIA HTML client app

I quickly checked the code and was wondering … any chance this could be adapted to work in the browser? Or is there anything Node.js native included?

fastjack · July 24, 2020, 8:00am

I’m using a native module to extract the MFCC features but I think the WebAudio API can be used to do the same in a browser.

From my profiling on the Node version, the heavier part is the MFCC features extraction. The DTW does not seem to be that consuming (with averaging it also helps a lot to reduce calculations).

I’m still convinced that a C++/Rust library that ingest audio chunks and do the pre-emphasis, windowing, features extraction, DTW and output prediction will be the optimal path resource wise (as most of these steps are loops).

sepia-assistant · July 24, 2020, 8:11am

That would be this here? const Gist = require('@mathquis/node-gist')

fastjack · July 24, 2020, 9:09am

Yes. The FFT part of the MFCC extraction seems to be the heavier part of the lib.

fastjack · July 24, 2020, 9:56am

Actually, the gnu-ne10-mfcc library is not really required (and after looking at the code, I think it does not really work as advertised or at all)…

I think the essential part is the NE10 library that provides a heavily optimised FFT calculation for Arm-based CPUs equiped with NEON SIMD capabilities (complex and real):

github.com

projectNe10/Ne10/blob/master/samples/NE10_sample_complex_fft.c

/*
 *  Copyright 2011-16 ARM Limited and Contributors.
 *  All rights reserved.
 *
 *  Redistribution and use in source and binary forms, with or without
 *  modification, are permitted provided that the following conditions are met:
 *    * Redistributions of source code must retain the above copyright
 *      notice, this list of conditions and the following disclaimer.
 *    * Redistributions in binary form must reproduce the above copyright
 *      notice, this list of conditions and the following disclaimer in the
 *      documentation and/or other materials provided with the distribution.
 *    * Neither the name of ARM Limited nor the
 *      names of its contributors may be used to endorse or promote products
 *      derived from this software without specific prior written permission.
 *
 *  THIS SOFTWARE IS PROVIDED BY ARM LIMITED AND CONTRIBUTORS "AS IS" AND
 *  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 *  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 *  DISCLAIMED. IN NO EVENT SHALL ARM LIMITED AND CONTRIBUTORS BE LIABLE FOR ANY
 *  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

This file has been truncated. show original

koan · July 24, 2020, 10:56am

Does it require NEON support? Because then it won’t run on the Raspberry Pi Zero W, and that’s the device that needs an optimized library most (cfr. @synesthesiam’s 8-second delay for the wake word detection).

fastjack · July 24, 2020, 11:05am

It does require NEON support so that won’t work for the Pi Zero indeed but can improve performance for ARM7 and ARM8 architectures which can be nice also.

Other libraries can be used for FFT calculations like FFTW (the fastest according to their benchmark) or KISSFFT (the one I’m using with Gist).

fastjack · July 24, 2020, 12:14pm

Maybe this can be useful…

I’m testing node-personal-wakeword module on a Raspberry Pi Zero W and the CPU usage so far is as follow:

for no keyword:
between 4% (no voice detected by VAD) and 48% (voice detected + MFCCs)
for 1 keyword (using 3 templates averaged to 1)
between 4% (no voice detected by VAD) and 62% (voice detected + MFCCs + 1 DTW)
detection is immediate
for 2 keywords (using 3 templates averaged to 1 for each keyword)
between 4% (no voice detected by VAD) and 85% (voice detected + MFCCs + 2 DTW)
detection is immediate
for 3 keywords (using 3 templates averaged to 1 for each keyword)
between 4% (no voice detected by VAD) and max (voice detected + MFCCs + 3 DTW)
detection starts to lag but is still acceptable (less than a second)

Adding more keywords will only increase the CPU usage and add to detection lag to the point of triggering audio over-runs.

synesthesiam · July 24, 2020, 2:11pm

I got audio overruns immediately on the Zero, so you code must be quite a bit faster than mine. I wasn’t doing template averaging, but I would still expect a bad delay.

I haven’t done the MFCC optimization you mentioned yet; that will be my next step. From there, maybe I could try moving the DTW calculations out into a .pyx file?

fastjack · July 24, 2020, 3:56pm

@KiboOst I recall you were using Snips hotword before switching to Snowboy. Could you give us feedback about the CPU usage of both (Snips and Snowboy) with personal wakewords on a Raspberry Pi Zero… I wonder if they were doing better CPU wise…

fastjack · July 25, 2020, 7:46am

Found something interesting for FFT speed up on RPi Zero:
http://www.aholme.co.uk/GPU_FFT/Main.htm

KiboOst · July 25, 2020, 9:36am

I have both running so I will look at that. Not at home for a few days but give me a console cmd to look at it please (ssh).

koan · July 25, 2020, 9:39am

If you could just give the %CPU column output of the top command for Snips hotword and Snowboy, that could be helpful.

KiboOst · July 25, 2020, 11:56am

Will do both idle and detecting hotword for both once at home.

rolyan_trauts · July 25, 2020, 3:56pm

To be looking for new software and solutions its not a good idea to use something as the pi-zero as a critical selection for choice.
The zero is great but its design and hardware is absolutely antiquated by today’s standard and quite likely will be superceded.

Rockchip and Allwinner both have similar priced Socs that leap up A35 or even have codecs or accelerators.
The RockPi-S is $13.99 quad core A35 with audio codec.
I expected a few new releases and boards but much seems to of gone covid shape in the current uncertainty.

@fastjack I am not sure about gnu-ne10-mfcc as its not used by Linto as in their code to my surprise its Sonopy (think that was its name) but the Mycroft python lib.

I have posted this before but https://github.com/JuliaDSP/MFCC.jl looks really awesome and Julia for you Python guys isn’t supposed to be that much of a big deal, but hey that is all I have read

I have been looking at how MFCC & VAD is created and they seem to share a common heavy FFT process that for some reason is always split into two libs and threads whilst a MFCC lib could very well have a few additional parts of the VAD added to it.
Its been a hunch by me, but if someone had the talent it would prob definitely reduce a considerable amount of what is the heavy process feed to KWS.

I you are going to start programming and adopting something for the future then the critical selection would be to have neon support and not to exclude it.
The ne10 FFT routines are supposed to be really slick but maybe something more generic like FFT3W should be used as it will compile for many platforms as you never know someone might be bat shit crazy enough to want to use it on X86

I think that Julia lib does share process with VAD & MFCC but its SAD (Speech Activity Detection) not VAD and don’t joke about being SAD if you wish the author to ever converse again

PS the GPU FFT is a cul-de-sac as the memory transfer cost outweighs any benefit its actually on all pis in the opt directory samples and is really a relic.

maxbachmann · July 25, 2020, 10:33pm

I agree, that the hardware is not completely up to date, but on the other hand it is used by a ton of people and has reasonable software support, while all the rockchip/allwinner chips usually have a pretty small user base. And since it is quite cheap aswell it will be used a lot for satellite devices, thats why at least for the wakeword engine it is important to consider the pi zero aswell as long as it is used so much. (In my eyes what all these “raspberry pi killer boards” do not get is that the reason to use the pi is not not the superior performance, but the huge community behind it)

fastjack · July 25, 2020, 10:53pm

@rolyan_trauts I agree. The FFT part is indeed the heavier piece of the MFCC extraction. Offloading the FFT and maybe the DTW to the GPU might provide quite an improvement.

As I have said above, the gpu-ne10-mfcc lib looks like something that was never really used or even completed. A dead end for me (though the idea to use Neon might be valid)

Using the GPU of the Pi Zero (and of all the other Pis that use VideoCore IV) should not hurt performance and should greatly improve the CPU load (since at the moment everything is done by the CPU).

I’ll be testing FFTW3 on a Pi zero soon to see if it provides improvements on CPU usage (the Gist library I’m using can already use FFTW instead of the default Kiss FFT).

I hope (not a C++ dev so… fingers crossed ) to also add GPU FFT into Gist to see the difference between the 3 libs on a Pi Zero… I’ll keep you all posted.

Cheers

rolyan_trauts · July 26, 2020, 10:03am

The GPU method for the zero you can email andrew@aholme.co.uk if you wish as he is a nice guy and will reply.

Hello Stuart,

Although GPU_FFT was faster than the ARM on the original RPi, Peter Onion told me he got equivalent or better performance from FFTW using NEON instructions on a newer RPi.

Forgot what it was as did do more research might be the DMA access of the GPU but small but rapid frame swaps don’t really suit and anything PiII or above is slower.

I remember thinking zero increases are marginal and its slower on all the rest, sod that.
Its pure daft to waste time on a Pi Zero as a Pi3A+ gives approx 10x capacity for just 10x more.

The problem with the zero now is because its not good enough for purpose as you wouldn’t be looking for magic speedups.
Magic speedups don’t exist and all you are going to do is make something not fit for purpose fractionally faster.

@maxbachmann yeah in general use RaspiOS has much more support but when you have a very specific use like we have here you can be very specific.
I am not even suggesting use the RockPiS as I was just using that board to show how antiquated the Pi0 has become when a small scale supplier can provide Pi3 perf at Zero price.
My suggestion would be to tell users that a Pi3A+ is the minimum supported and there are alternatives such as the RockPiS and Pi0 but don’t waste time hampering the project.

When you are talking $15 satellites we are actually not in the realms of a general purpose SoC but much more akin to embedded of very specific purpose and the only criteria for many is price as costs can quickly add up when talking about the multiples that could be common.
Both the RockPiS & Zero come in at that price but the zero instantly hits walls of process wise the answer to any additional audio processing is no. The answer to further security (stunnel) is no, because there is no load space left and all that lovely compatible OS is worth jack shit because the Zero is already maxxed out and it has no room to fit any more.

Its totally pointless to be focussing on what is likely a product to soon be superceded that is different to every model since the Pi2.
There are raspberry products that solve all the problems with the Zero and its much better to be honest an say yeah spend $10 more than roll out crap where a community thinks what you supply is rather poor.

You are actually better with a Pi3 and a USB sound card as there is very little difference with end result price with 10x process load of the zero.
Or you can use a RockPiS and I have took the plunge and doing a systematic evaluation of cost, function, performance and quality you can expect but will do a write up here and yeah its very much a Zero killer for audio orientated applications as that was what it was designed for.
Its a modern A35 design specifically designed for VoiceAi style apps even has a DSP VAD that runs on interrupts for 4 mics and when in sleep mode monitors VAD on a single mic to wake.

Its so specific and at such a good cost price its totaly a brain fart of bean overload to ignore, but yeah for satellites as I commonly say my favourite Raspberry SoC by far is the Pi3A+ as it is still a kick ass piece of kit whilst the PiI inherited Zero really is starting to groan.

The RK3399 became very popular when we where stuck with the Pi3 but the Pi4 has sort of dented the Pine & Radxa offerings quite badly now.
RK3308 until we get a Zero-2 is actually a good option but when the zero-2 comes out probably less so.
Raspberry being Raspberry its likely to be later but as a betting man next year there might be a chance we see all that Pi2-Pi3 IP in a smaller process @ Zero prices.
As a betting man any Zero specific work you do now prob has a max of 2 years life.

koan · July 26, 2020, 10:21am

I do agree that the Pi3A+ is an often underestimated part of the Raspberry Pi family, and it makes an excellent (although probably too powerful) satellite device.

However, many people still have Raspberry Pi Zero Ws lying around, once used as Snips satellites. And with the right optimized software, these devices should be perfectly capable of executing this task (Snips managed to do it), and they are cheaper than the Pi 3A+.

I also wouldn’t call the Pi Zero W “soon to be superceded”. The Raspberry Pi Foundation claims that:

Raspberry Pi Zero W will remain in production until at least January 2026

So even if there comes a successor, it’s not like the Pi Zero W will disappear.

So, I wouldn’t focus all our resources on getting the wake word engine on the Pi Zero W working, but I think it’s still worthwhile to have it as a cheap option.

rolyan_trauts · July 26, 2020, 11:30am

The Pi3B+ will be available to January 2023 and @ $35 its obsolete by the Pi4 2gb which is a 1$ cheaper.

The Pi Zero will run till January 2026 but that doesn’t mean Raspberry will not have a replacement way before then and the January 2026 date has Zero relevance (Like what I did there ) to what we might see.
If we don’t there are already alternatives that are far better at the same price or a step up to Pi3A+.
It doesn’t matter what Snips did as its extremely debatable what Snips would do now as they had a team who were extremely forward thinking.

For me who likes EC functionality then the Zero isn’t an option as it just can not do it, full stop.
It’s also not a cheaper option as with Hat’s and ancillaries the difference to the RockPiS is extremely minimal and doesn’t need a Hat Codec.
A Pi3A+ with 2x I2S mics can use the 3.5mm and still upgrade to a HDMI2HDMI+Audio extractor without need to jettison the extremely cost effective mics.

Then again on both its still prob better to go for just an extremely cheap sound card and couple with an active mic module, but is the zero cheaper is debatable as it needs extra $ to make a working solution.

But this is why I am writing as yeah you can say the Pi0W is a cheap option or like me who has several gaining dust would I use one to create a complete satellite.
Hell no with all the extra cost to complete as it doesn’t make sense now never mind in a year or 2 and as for the off chance they may have a ZeroW it’s equally likely for the Raspberry niche that a Pi2,3 or even a new Pi4 to be collecting dust.
If your focus is existing owners its not true that the Pi0W is the most common and if they are already owned they are certainly no cheaper.
But why are you even thinking about the odd occurance of a Pi owner thinking I could repurpose that than someone with an interest in voiceAI who wants to try a working best price/perf solution of an opensource design?