Error: overrun! (at least XXX ms long)

Hi everyone,

I’m happily running Rhasspy and after a few hours (sometimes minutes) I’m getting a weird error out of the blue which looks like this (check extended output logs at the bottom of the post in Appendix A):

overrun!!! (at least 463171.246 ms long)
arecord: xrun:1642: xrun: prepare error: Input/output error

(the number of ms varies from a few minutes to several hours!)

this error seems to originate around https://github.com/rhasspy/rhasspy-wake-porcupine-hermes/blob/91af83bcd5fcbe34e3971fb802dcf0baf6cd7416/rhasspywake_porcupine_hermes/__init__.py#L266

A bit of info:
version: rhasspy 2.5.11 (the hash of rhasspy wake seems to be 91af83b according to github)
platform: raspberry pi zero w
uname -r output: Linux rpi0 5.15.84+ #1613 Thu Jan 5 11:58:09 GMT 2023 armv6l GNU/Linux
how it’s running: I’m running it in a docker container using host’s pulse through alsa as per https://github.com/rhasspy/rhasspy/tree/master/examples/docker-compose-pulseaudio
audio I/O: I’m using a ReSpeaker 6-Mic Circular Array kit for Raspberry Pi and a forked version of the official drivers from here https://github.com/HinTak/seeed-voicecard/tree/v5.16 (nothing major, they’re just applying some patches for forward compatibility)
alsa/pulse host config: default configs provided by the seeed repo here https://github.com/respeaker/seeed-voicecard/tree/8cce4e8ffa77e1e2b89812e5e2ccf6cfbc1086cf/pulseaudio (just commented out minor stuff like suspend-on-idle and rescue-stream as they were respectively not necessary and deprecated)
startup mode: satellite communicating with base over MQTT (snapcast runs as a client as well for what it matters)

By the same means I’m running a snapcast container that, when the overrun occurs in rhasspy, coincidentally throws the following repeatedly

ALSA lib pulse.c:242:(pulse_connect) PulseAudio: Unable to connect: Connection refused

2023-01-27 13-22-06.726 [Error] (Alsa) Exception in initAlsa: Can't open default, error: Connection refused
ALSA lib pulse.c:242:(pulse_connect) PulseAudio: Unable to connect: Connection refused
2023-01-27 13-22-06.849 [Error] (Alsa) Exception in initAlsa: Can't open default, error: Connection refused

If I restart both containers (rhasspy and snapcast client) everything works again. (no need to restart pulse or the raspberry itself).

By the symptoms it looks like wakeword is listening and filling up the audio buffer without really reading it until it explodes, but I might be wrong and maybe someone more familiar with the code might help.
The error seems to originate from alsa-utils arecord itself https://github.com/alsa-project/alsa-utils/blob/v1.1.8/aplay/aplay.c#L1609.
That’s the right version as arecord --version returns

arecord: version 1.1.8 by Jaroslav Kysela <perex@perex.cz>

also yes, aplay/arecord are in the same source-file under aplay.c

$ ls -lah $(which arecord)
lrwxrwxrwx 1 root root 5 Mar 18  2021 /usr/bin/arecord -> aplay

Allow me to re-iterate that when everything works, everything works amazingly (both rhasspy and snapcast); Just after a few hours (sometime minutes) this happens out of the blue and it breaks both of the containers.

It’s totally possible that I’m asking too much to the raspi and I should try to move to the zero 2 w, but before getting there I’d like to confirm my suspects.

Thank you everyone :slight_smile:

edit: I tested it on a raspberry pi zero 2 w with a 64bit OS and got the same issue. Also I tried running the rhasspy container without running the snapcast one and it triggered the issue after about 30s after it started listening for the wake word

Appendix A extended log output:

[DEBUG:2023-01-26 15:50:37,245] rhasspyspeakers_cli_hermes: ['aplay', '-q', '-t', 'wav', '-D', 'pulse']
[DEBUG:2023-01-26 15:50:38,102] rhasspyserver_hermes: <- NluIntent(input='turn off Kitchen', intent=Intent(intent_name='HassTurnOff', confidence_score=1.0), site_id='satellite1', id=None, slots=[Slot(entity='hass/entities', value={'kind': 'Unknown', 'value': 'Kitchen'}, slot_name='name', raw_value='kitchen', confidence=1.0, range=SlotRange(start=9, end=16, raw_start=9, raw_end=16))], session_id='satellite1-porcupine_raspberry-pi-db4ac4e2-5e1e-440b-acb4-4b8a3918cf64', custom_data='porcupine_raspberry-pi', asr_tokens=[[AsrToken(value='turn', confidence=1.0, range_start=0, range_end=4, time=None), AsrToken(value='off', confidence=1.0, range_start=5, range_end=8, time=None), AsrToken(value='Kitchen', confidence=1.0, range_start=9, range_end=16, time=None)]], asr_confidence=0.999133561, raw_input='turn off kitchen', wakeword_id='porcupine_raspberry-pi', lang=None)
[DEBUG:2023-01-26 15:50:38,142] rhasspyserver_hermes: Sent 660 char(s) to websocket
[DEBUG:2023-01-26 15:50:39,334] rhasspyspeakers_cli_hermes: -> AudioPlayFinished(id='3e94564d-94f8-40f7-b5fc-cab203babc43', session_id='3e94564d-94f8-40f7-b5fc-cab203babc43')
[DEBUG:2023-01-26 15:50:39,351] rhasspyspeakers_cli_hermes: Publishing 99 bytes(s) to hermes/audioServer/satellite1/playFinished
[DEBUG:2023-01-26 15:51:03,431] rhasspymicrophone_cli_hermes: <- AsrStopListening(site_id='satellite1', session_id='satellite1-porcupine_raspberry-pi-db4ac4e2-5e1e-440b-acb4-4b8a3918cf64')
[DEBUG:2023-01-26 15:51:03,446] rhasspywake_porcupine_hermes: <- HotwordToggleOn(site_id='satellite1', reason=<HotwordToggleReason.DIALOGUE_SESSION: 'dialogueSession'>)
[DEBUG:2023-01-26 15:51:03,465] rhasspymicrophone_cli_hermes: Enable UDP output
[DEBUG:2023-01-26 15:51:03,479] rhasspywake_porcupine_hermes: Enabled
[DEBUG:2023-01-26 15:51:03,512] rhasspywake_porcupine_hermes: Receiving audio satellite1
overrun!!! (at least 463171.246 ms long)
arecord: xrun:1642: xrun: prepare error: Input/output error
[DEBUG:2023-01-26 21:15:53,342] rhasspyspeakers_cli_hermes: ['aplay', '-q', '-t', 'wav', '-D', 'pulse']
[DEBUG:2023-01-26 21:15:54,173] rhasspyserver_hermes: <- NluIntent(input='turn off Living Room TV', intent=Intent(intent_name='HassTurnOff', confidence_score=1.0), site_id='satellite1', id=None, slots=[Slot(entity='hass/entities', value={'kind': 'Unknown', 'value': 'Living Room TV'}, slot_name='name', raw_value='living room tv', confidence=1.0, range=SlotRange(start=9, end=23, raw_start=9, raw_end=23))], session_id='satellite1-porcupine_raspberry-pi-7fcc5b72-2d7b-4975-855c-40a1bb715592', custom_data='porcupine_raspberry-pi', asr_tokens=[[AsrToken(value='turn', confidence=1.0, range_start=0, range_end=4, time=None), AsrToken(value='off', confidence=1.0, range_start=5, range_end=8, time=None), AsrToken(value='Living', confidence=1.0, range_start=9, range_end=15, time=None), AsrToken(value='Room', confidence=1.0, range_start=16, range_end=20, time=None), AsrToken(value='TV', confidence=1.0, range_start=21, range_end=23, time=None)]], asr_confidence=0.9881059, raw_input='turn off living room tv', wakeword_id='porcupine_raspberry-pi', lang=None)
[DEBUG:2023-01-26 21:15:56,063] rhasspyspeakers_cli_hermes: -> AudioPlayFinished(id='b3c195ca-6fea-46c6-bf81-3cccf3ef02ec', session_id='b3c195ca-6fea-46c6-bf81-3cccf3ef02ec')
[DEBUG:2023-01-26 21:15:56,095] rhasspyspeakers_cli_hermes: Publishing 99 bytes(s) to hermes/audioServer/satellite1/playFinished
[DEBUG:2023-01-26 21:16:18,770] rhasspywake_porcupine_hermes: <- HotwordToggleOn(site_id='satellite1', reason=<HotwordToggleReason.DIALOGUE_SESSION: 'dialogueSession'>)
[DEBUG:2023-01-26 21:16:18,780] rhasspywake_porcupine_hermes: Enabled
[DEBUG:2023-01-26 21:16:18,764] rhasspymicrophone_cli_hermes: <- AsrStopListening(site_id='satellite1', session_id='satellite1-porcupine_raspberry-pi-7fcc5b72-2d7b-4975-855c-40a1bb715592')
[DEBUG:2023-01-26 21:16:18,794] rhasspymicrophone_cli_hermes: Enable UDP output
[DEBUG:2023-01-26 21:16:18,788] rhasspywake_porcupine_hermes: Receiving audio satellite1
overrun!!! (at least 22620297.518 ms long)
arecord: xrun:1642: xrun: prepare error: Input/output error

additional info: I tried using pyaudio and the error is similar

Expression 'alsa_snd_pcm_poll_descriptors_revents( self->pcm, pfds, self->nfds, &revents )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 3665
Expression 'PaAlsaStreamComponent_EndPolling( &self->capture, capturePfds, &pollCapture, &xrun )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 3887
Expression 'PaAlsaStream_WaitForFrames( stream, &framesAvail, &xrun )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 4285

If I remember rightly xrun is a buffer underun.

If you can just start vanilla with PiOS lite and the respeaker driver and just test the install with just aplay / arecord.
Then install rhasspy, you using docker?
As you need to share the /etc/asound.conf into the container and keep all at default.

Start simple and build up but prob something to do with driver or /etc/asound.conf or docker if using.

One of those I guess

Hey thanks for getting back!

So, xrun refers to SND_PCM_STATE_XRUN ALSA project - the C library reference: PCM Interface
which means overrun for recording and underrun for playing (that’s why the X there). In our case it’s referring to the first case.

I confirm you I’m running all of this on a clean setup exactly like you mentioned (headless fresh Raspberry Pi OS Lite) and with just docker/docker-compose/seeed drivers setup.

Also about my current setup: I don’t need to mount /etc/asound.conf into the container since I’m connecting to the host’s pulse instance through the alsa plugin (this setup is described in the rhasspy github link I wrote in my last post)

I also built it gradually as you said, in fact initially works fine but it breaks after a while. I’ve been doing some reading and I’m starting to believe it’s due to the whole stop-and-go when it comes to recording and playing.
I found a similar issue on a sister project of this one and they narrowed it down to the ding wakeword confirmation https://github.com/evancohen/smart-mirror/issues/287

This might be further corroborated by the seeed’s 6 mic wiki page statement:

Limit for developer using 6-Mic Circular Array Kit(or 4-Mics Linear Array Kit) doing capture & playback the same time:

-1. capture must be start first, or else the capture channels will possibly be disorder.

-2. playback output channels must fill with 8 same channels data or 4 same stereo channels data, or else the speaker or headphone will output nothing possibly.

-3. If you want to play and record at the same time, the aplay music file must be mono, or you can not use this command to play.

which seems to indicate that for some reason (which they keep opaque) this device comes with limitations that I can’t be entirely sure are respected while using rhasspy (which indeed rhasspy shouldn’t care about).

Now: I’m using pulse mostly because of its mixing capabilities. It seems that I could replace it with ALSA dmix device (and yes! here I’ll need to mount /etc/asound.conf into both of my containers) if I want to have both rhasspy and snapcast running (just need ipc: host directive in my compose manifest apparently). also I’m planning to get AEC working anyway so I’ll try to replicate as much as I can as described here Speaker Cancellation.

If I manage to get dmix working I don’t see why I’d need pulse since it would just add CPU overhead. It seems though that pulse is a bit more performant and compatible in comparison to a pure alsa+dmix setup so IDK, maybe someone with a bit more experience can add more context here

I think the key is getting 2 loopback devices started (one for recording and one for playing) in the right order and with the right setup (as per seeed wiki).
I believe the source of all the headaches is the continuous start/stop recording/playing as the seed 6 mic module seems too fussy about.

If I’m gonna be using AEC it seems I’ll need these constantly running devices anyway so I might get two birds with one stone

Yeah basically the Pi has a single I2S port that I think normally can go up to 192khz but are 2 channel there is a L/R clock count.
If you want more channels what you do is essentially cheat and you halve the fequency and multiplex the channels and to get 8 channels we are at 48Khz max and this is called TDM mode Time-division multiplexing

There is another limitation is that TDM mode isn’t hardware supported and the channels end up in a random order as there is no sync for which time division the 2 channel words are.
I didn’t even know about the playback problems and you know I also have one and I am in the same boat, but I consider at least the 6 mic to be absolute ewaste.
As even if it worked as desired it has no algs to use the mic array to create anything functional that will run on a Pi.
Its just one of those things that was created because they could but the geometry is based on what fitted and looked good with an absolute absense of audio egineering.

I always post this as a great 101 to basic beamforming https://invensense.wpenginepowered.com/wp-content/uploads/2015/02/Microphone-Array-Beamforming.pdf

Really with the Respeaker Hats the 2mic is the only one of any use as the other just merely look the part and generally for the far field effect many other higher cost units are not worth the effect they have.
In commercial units the beamforming is used in conjunction with other algs in a highly intergrated and engineered system and we are sort of trying to do F1 on a box cart.

I have working beamforming code GitHub - StuartIanNaylor/2ch_delay_sum: 2 channel delay sum beamformer and and GitHub - voice-engine/ec: Echo Canceller, part of Voice Engine project does a reasonable job of AEC.

You can try with the 6 mic if you wish but share the /etc/asound.conf and for some reason respeaker conf has the IPC line but misses the Permissions Perm line.
Unless you use dmix/dsnoop with IPC alsa is single use blocking so any instance of direct HW not using the dmix/dsnoop will block all else.
But if you do use the /etc/asound.conf that sets the system default and use the default in all it should work no problem.

Pulseaudio would be great but its a pain in a container and it can just add to the confusion but think if the asound.conf is set then it also should just use default but again if you don’t it will create on alsa a single use blocking pcm as its sits ontop of alsa.

Thanks for the background info about the TDM issue!
I read that because of a production mistake the PI3 I2S clock was stuck at 100khz without possibility to change it. do you know if this is true also for the PI zero 2 w? I can’t find any info specifically for it.
It seems the PI4 can go up to 400khz (?) pi 4 - What is Rpi's I2C Maximum Speed? - Raspberry Pi Stack Exchange

About the random channel ordering I just issued a PR which seems to fix it.
You can find it above my comment here [Bug]: reference channels shifts (disordered) over multiple recordings · Issue #309 · respeaker/seeed-voicecard · GitHub and you can check the rest of the discussion for some background. I still don’t fully understand why it works but it seems like I actually get consistent channel ordering!

Also now that I had consistent channel ordering I managed to get ec_hw working with the following config

./ec/ec_hw -i plughw:seeed8micvoicec -c 8 -l 6 -m 0,1,2,3,4,5
aplay -D plughw:seeed8micvoicec -r 16000 16k_s16le_mono_audio.wav

(Yes, ec_hw doesn’t provide an input pipe)
To record you’ll need to setup an alsa or pulse input device with 6 inputs

If you want to instead verify that the 2 loop channels are at the end of the group just trigger

arecord -D plughw:seeed8micvoicec -f S16 -r 16000 -c 8 to_be_record.wav

and open it in audacity.

To address a point you made some time ago on the mycroft forum as for why bothering with hw AEC when you can just use the playback output:
The point is having a single synchronized clock for input and output so you don’t have to bother with delay.
This is the only advantage this type of hardware brings us. They wanted us to do sw AEC from the beginning. I’m still not sure why do we have 2 loopback channels (stereo? are we supposed to average them?).
I checked the waveform and it’s indeed different even with a mono output.

Also I got an error at a certain point BUG: scheduling while atomic. So I had to switch to a 32 bit version of raspbian which could be an acceptable workaround for now.

My main problem now! the last piece of the puzzle: I’m a bit stuck since I created a pulse input with the following config:

# /etc/pulse/default.pa

load-module module-pipe-source source_name=ec.source format=s16 rate=16000 channels=6 file=/tmp/ec.output
[...]
set-default-source ec.source
set-default-sink alsa_output.platform-soc_sound.seeed-2ch

and I also left only the seeed-2ch config

# /usr/share/pulseaudio/alsa-mixer/profile-sets/seeed-voiced.conf

[General]
auto-profiles = no
[Mapping seeed-2ch]
device-strings = hw:%f
channel-map = front-left,front-right,rear-left,rear-right,front-center,lfe,side-left,side-right
exact-channels = false
exact-channels = false
fallback = yes
paths-output = seeed-2ch
direction = output
priority = 2
[Profile output:seeed-2ch]
output-mappings = seeed-2ch
priority = 4
skip-probe = yes

When I’m running it through pulse, as soon as I trigger the recording, the whole pulse crashes with

pulseaudio[4762]: Assertion 'pa_frame_aligned(chunk->length, &o->source->sample_spec)' failed at pulsecore/source-output.c:754, function pa_source_output_push(). Aborting.

which comes from here pulseaudio/source-output.c at master · pulseaudio/pulseaudio · GitHub and in turn from pulseaudio/sample-util.c at 3349e1c471f16f46251a51acfc1740cdf012a098 · pulseaudio/pulseaudio · GitHub

now I don’t understand why the frame-chunk size should be a multiple of the channel sample rate but that’s what seems to be the issue.

It should be the case since

16000 * 10 / 1000 = 160

which is indeed multiple of 16000 so I’m not sure :confused:

I also tried to hardcode 256 (also multiple of 16000) as a frame size in ec_hw as you suggested somewhere else but I still got the same problem.

I hope this’ll bring more people up to speed and help us to file those last corners. I think we’re close to revive this piece of junk! :smiley:

You seem to be missing the main point why to have a multimic array which is so you can do beamforming.
You have just gone through all that and still your using a single channel (single mic) with Rhasspy that even as I say a couple of $ usb sound card can and you just have to plug it in and plug in a mic.
From cost to function you have manged zero advantage for speech processing, but this is great for you as at least your able to use the thing even if now your stuck on 32bit and the speedup 64bit Neon processing of all ML is now halved.

I think its great you have gone to all this effort, but for anyone else especially noobs who can not or be bothered, that for no advantage the effort you have just undertook and spent £40 hat, likely should be made clear essentially this is a piece of pointless e-junk.

Mycroft was my introduction and the 4mic linear was my 1st hat and eventually I worked out how to install speexdsp and that the pulseaudio aec is relatively flakey.
I think actually it was the respeaker 2mic that I might of given when I sold the 4mic linear, its my remaining Keyes studio 2mic that is also ewaste as its noisy as hell as just bought another respeaker 2 test the difference and its pretty huge. My memory though the respeaker was just the same but it isn’t.
Its not just hardware AEC its non linear hardware AEC that is really nice, but I never got the VoiceEN AEC loopback version working.
Its so long now but maybe just stopped short as the main purpose of beamforming via an array had no way of working.

There is a solution that is much easier and its don’t copy commcerial units verbatum and you have a choice of the $10 respeaker 2mic or a USB soundcard as said from a couple of $ to 2 channel ADC ones like the plugable for $15.
Use some laterial thought and with the 2mic create a microphone unit on a Pi3 or OrangePi02 (my fave alternative due to stock) and create a seperate speaker wired active speaker to the mic unit that via the mic unit is also a wireless speaker.
Or go the other way and create a wireless speaker unit that via a usb soundcard has a wired mic.

That physical seperation of mic and speaker in conjuntion with Speex AEC and 2 channel beamforming gives pretty reasonable results or like C64 did with the S330 which @ £60 for a plug and play unit aint all that bad and unlike Respeaker, Anker actually update driver and firmware and actually fix things.

So this is just for anyone new in the community as with all the above and continuing effort that big lump of a planar mic array hat still isn’t doing what an array mic is used for which is beamforming for far-field, then just don’t buy irrespective of how loosely you apply the term revived as there are much superior, much cheaper and much easier solutions.

You and I have one so we are stuck with it and I will let you continue and let mine gather dust as for me IMO opinion its just a big lump of ewaste as its still not doing what it was intended for and the $10 alternative 2 mic is better for quite a number of reasons.
PS with the 2mic have it vertical so the port isn’t pointing to the ceiling as that again will improve performance and the pcb itself will provide an element of rear rejection from reverberation.

Not everybody is playing music or wants to be able to tell there Rhasspy to ‘stop’ and you don’t have to seperate speaker and mic, but building a good housing that has an element of isolation is another huge undertaking, whilst seperation does make it so much easier. As probably the main killer is not playback but 3rd party noise from TV and other media.
Even then if you want to combine the 2mic being a quarter of the price is such a better format and much less work and does run on 64bit. USB are just plug and play and unlike a hat you can add as many as your system will run with a hub, or like C64 the Anker S330 for $60 looks a reasonable buy.

So a long winded reply and not really directed at you but for others who may not have a repeaker 4/6 mic hat just don’t, as without much effort and a knowledge level like jacopo your going to have a nightmare especially with kernel updates. :slight_smile:

I have been waiting for the slow boat from China as likely will do some full hardware build articles with the 2mic/USB and prob supply an image that is ready to go, of various build types and demonstrate how much better they really are.
I have been meaning to do it for ages but my preffered pi of the Zero2 seems permanently out of stock as think Farnel resent me invoices last week just to wind me up!
Its go back to my previous fave of the Pi3A+ for ‘satelite’ or my current fave alternative of the OrangePi02.

1 Like

have you tried the ODAS demo? I just gave it a go and seems to be able to detect the direction of multiple sources.

I was thinking to glue EC on top of ODAS and mix down the 6 mics into 1 after the EC step.
Something similar to what you did here 2ch_delay_sum/ds.h at main · StuartIanNaylor/2ch_delay_sum · GitHub but with a bit more math (converting azimut and elevation into attenuation for each of the 6 mic).

It’ll look like an ugly Frankenstein but on the bright side I can get a visual feedback through the webapp for troubleshooting (which will be disabled during normal operation)

To add to this, after having tested the Anker S330 for a full week in two locations in my house, the unit has performed very well for speech recognition and its speaker output is loud, robust, and clear. Installation is painless and at least in a Debian OS, it’s truly plug and play.

I think for around $60, the device is definitely worth the small investment, especially if you are setting up a master and satellite configuration like I am. In fact, I’ve ordered two more so I can tap into Rhasspy nearly anywhere in the house.

@C64ever Yeah the S330 is half the price I paid for the original Powerconf and if you want plug and play stick in and go.
From my last test I thought audio wise it was ok and for a subjective review its not as good sounding as some of the ‘mini’ variants of commercial smart speakers, but that is the the older Powerconf.
I am calling it a day on buying product to give a review as over the last 2 years I have created a mountain of failed devices, so Anker S330 for $60 is a good price.
It is primarily a conference device and doesn’t do targetted beamforming that commercial smart-speakers do and that subtle difference can have effect.

@jacopo
Yeah I tried ODAS a long time back, it sort of maxes out a Pi3+, its tech is pure alg only whilst likely NN ML based solutions will provide much better results for much less load.
Give it a go if you wish, but I have my sights on being able to use something like the Pi02 or Pi3A+ as networked conference mics to a single bigger more capable smart-assistant as voice is absolute textbook client server as the majority of the time its idle.

Same thing with load as currently the only load is a single TDOA calc with the 2ch_delay_sum/ds.h at main · StuartIanNaylor/2ch_delay_sum · GitHub that is the delay calc and the sum is just that and a simple sum of the channels and zero load in comparison.
TDOA measures the time difference of arrival beween mics. There is no elevation with a planar array and if its not planar yeah you could ‘Frankenstein’ it to get elevation but no-one really bothers with elevation as generally with voice we are roughly of similar height and azimuth is the only need. Smart speakers being speakers have even started to focus that they are a speaker and generally front facing and even planar 360 is not needed as very rarely central like boardroom table mics.

Delay-sum only lends itself to broadside & endfire arrays where a hybrid of the 2 can create a forward facing steerable array, but the geometry of the 6mic is totally fubar for that purpose due to aliasing.
There are more complex algs MVDR & GSC that calc the delay for individual frequency bins to combat the aliasing problem and they are orders more load and maybe what ODAS does.

Beamforming even with a 6mic planar is not very exact and its biggest function is to deverberate a target and why Apple in its new Siri has reduced mics to 4 to reduce cost as the effect it gives over a certain level is just not worth it and that is in a $250 smart speaker from tech giants like Apple.
You actually only need 3 mics for 360’ directional but 4 likely gives more resolution, I have run C/C++ MVDR/GSC and it got binned because of the load.
I could do a delay sum on 4 mics probably using only a single TDOA as rear direction is the oppisite the front 2 as its facing the oppisite direction and with a known geomtery its pythagoras than TDOA to find the distance between x2 broadsides to make a hybrid endfire, but the mic geometry is critical.
By pure luck the width of the Pi Zero is relatively close to the broadside geometery of a 2mic array that only beamforms on the X axis and has no rejection on Y apart from by pointing the mic ports to face incoming voice than have them facing the ceiling.

Do that simple orientation and creating an enclosure that isolates the rear and further ports the forward facing mics is how many very effective 2mic cam mics work and because of that deverberation is attenuated and far field extended and rear rejection is via the enclosure.

Generally because you need the hardware before you can do the software is the only reason we have certain hats as they did build the hardware, but the software turned out a relative flop, as some bright spark realised hey why am I trying to this beat physics thing and just do what they do in real professional conference rooms and have simply multiple single mics.

So yeah @jacopo have a play as its really interesting, use the great invensense 101 app note https://invensense.tdk.com/wp-content/uploads/2015/02/Microphone-Array-Beamforming.pdf as a base and explore.
For noobs who just want something low cost that will work OK get a respeaker 2mic and have it vertical facing the speaker as on a Pi there is not a Magic Mic hat and you will just end up more like Channing Tatum.
In fact use a existing usb soundcard as unlike a hat if you want another mic just plug in another usb card.

There are x2 really exciting projects that in current guise again hit load barriers but if quantised and a smattering of low level optimsed coding maybe they could be very useful for Linux input voice processing.

The 1st one is the basis of why Google have dropped to lower cost 2mics, but they have taken targeted voice extraction to a whole new level via a short enrollment process, whilst Amazon is on 6 mics and leaking loss like a seive. Google is actively pushing voice processing local as otherwise its a whole load of server load with no revenue as if they have the platform they can still dictate services that are the revenue earners.
The 2nd is a real shame as likely 32Khz due to Nyquist–Shannon for 16Khz would be no different in quality than the fullband 48Khz audio, but the real kicker is the NN framework chosen of Sonos Tract as its single thread only as the great optimised Rust code is hamstrung by those two by varying degree, but run it on something with enough oomf and its awsome (With a core much faster than a Pi that the single thread dictates).

I managed to get to work both EC and beanforming combining ec and ODAS!

I have a small issue with buzzings for both inputs and outputs which are causing issues with the wakeword and Speech to Text.
As instance when rhasspy talks to confirm the action you hear “Turn livin[bzz] of[bzz]” instead of a clear “Turn livingroom off”.
Same way if I re-play my commands I hear “Turn o[bzz] livin[bzz] light”.
Does this sound familiar to anyone?
I’d point my finger to the different frame sizes in the input but I get the same issue on the output and that’s mostly what came from the pre-existing seeed docs so I’m not really sure it’s related.

edit: it seems that the buzzing is due to incorrect sampling rate. changing seeed-voicecard/daemon.conf at 8cce4e8ffa77e1e2b89812e5e2ccf6cfbc1086cf · respeaker/seeed-voicecard · GitHub to 16000 really improved things.
I still have the same issue on the input, so I guess I’ll have to tweak the sampling rate settings for that as well. probably in multiple places also on the odas config (?)

====================
Hereon is what I did in order to get EC+ODAS working, feel free to skip if not interested:
I added this to the pulse config

# /etc/pulse/default.pa
load-module module-pipe-source source_name=ec1.source format=s16 rate=16000 channels=1 file=/tmp/ec.output
load-module module-pipe-source source_name=ec.source format=s16 rate=44100 channels=4 file=/tmp/postfiltered.raw
[...]
set-default-source ec.source
set-default-sink alsa_output.platform-soc_sound.seeed-2ch

(the first source is not really used, is just to let pulse handle the fifo files. also 1 channel is not correct but putting 6 will result in a pulse crash due to i2c sync issues)

and these are the sections I changed to the odas config provided in the seeed wiki.

raw: 
{

    fS = 16000;
    hopSize = 128;
    nBits = 16;
    nChannels = 6; 

    # Input with raw signal from microphones
    interface: {
        type = "file";
        path = "/tmp/ec.output";
    }

}
# ...
postfiltered: {

        fS = 44100;
        hopSize = 512;
        nBits = 16;        

        interface: {
            type = "file";
            path = "/tmp/postfiltered.raw";
        }        

    };


now what happens is that ec writes to /tmp/ec.output, odas read from there and writes the postfiltered output into a pipe in /tmp/postfiltered.raw from which pulse reads.
the output goes through pulse as per original instructions, no changes been made there.

I’ll write a recap on a github markdown or a blogpost which goes through everything I did but all I did it’s mostly in this thread

ok I think I made it! the last thing was to set a good downsampling algorithm for pulse.
just set soxr-vhq here seeed-voicecard/daemon.conf at master · respeaker/seeed-voicecard · GitHub
and you should have good quality audio coming both from the microphone and from the speaker!

Just to recap:
I have a working satellite running on a raspberry pi zero 2 w.
on this satellite I’m running both rhasspy (satellite) and snapcast (client) in docker containers.
Audio (both input and output) is provided by the host pulseaudio instance.
EC and ODAS are providing respectively echo cancellation and source detection/beamforming.
CPU load is about 40% (+/- 5%).

I’m gonna tidy up things (init scripts, some testing) and redact a writeup where I go through all the steps.

To anyone reading this: 90% of what was necessary to achieve the same result is described above, the remaining 10% was linked or from official sources so don’t wait for me if I haven’t posted the writeup. I can’t promise when it’s gonna land :slight_smile:

1 Like

I have just been playing with voice-en AEC and really it does a really good job for load in comparison to the delay-sum beamformer load I hacked.

Here is a wav with running through AEC
https://drive.google.com/file/d/1gsRN8jYFZuqkl53J6-p5YU-Dz9jTPERo/view?usp=sharing

Here is same volume same wav with no AEC direct
https://drive.google.com/file/d/1Dsi5Qu4owV1_EMFz1tLavrvblhxn2HkE/view?usp=share_link

@jacopo Suppose your not a Wiz with C/C++ as I have got that beamformer working purely by hacking existing, but the code is terrible and have hunch coud see up to a 400% improvement, maybe its so bad even more.

I have to reappraise Speex AEC as it is attenuation but wow it does it with low load 20% of a core on a Pi02 if I get the settings correct as above.

So after a bit of testing I noticed the mics are a bit too dinamic: If I spoke from the middle of the room rhasspy didn’t pick up the wakeword, if I was too close it saturated if I pumped too much the mics in alsamixer.

This sounds like a job a compressor; Unfortunately despite many people suggested Steve Harris’s LADSPA plugins I wasn’t able to apply them to my input.

I therefore implemented a solution to apply a chain of effects to a named pipe which I called pipefx by adapting voice-engine/ec.
Now my input is finally levelled and more or less the same volume whether I’m talking next to the mic or from the other side of the room.
The CPU load is pretty much inexistent (~3% if I apply only the compressor).

The other thing I finally finished and published was a system to drive the LEDs and react to the button presses relying on MQTT. Please give it a go mqtt respeaker pixel ring

Suppose your not a Wiz with C/C++ as I have got that beamformer working purely by hacking existing, but the code is terrible and have hunch coud see up to a 400% improvement, maybe its so bad even more.

I don’t have much experience profiling C/C++ on linux. I did some for node.js/java/MSVC. I know on GNU/Linux the state of the art for profiling is Valgrind.
You might want to give it a try to see if you can identify any bottleneck?

P.S. I have a feeling as well something’s off with voice-engine/EC as I’m still getting some output in my inputs. The examples of speex aec I saw were slightly more complex than his impementation. EC’s just calling speex_echo_cancellation with no preprocess or anything else. seems like something’s missing :confused:

Lols doesn’t need valgrind its just a 1st time c/c++ hack and just bad :slight_smile:

Speex is just attenuation not cancelation but unlike webrtc aec it doesn’t fail above a certain level.
Like in the above 2 wavs you need to get the delay right due to latency from the speaker and also try to have some form of dampening isolation between mic and speaker as it can resonate and often create a distored reverberated input at the mic.
The above is what generally I expect in attenuation with Voice-en EC and speex.

Start ec with the -s and it will record just speaker output whilst playing and in the util folder run the py scripts to get the dealy of the recorded tmp files that you can enter.

There is also Speex_AGC but always on debian there is a mis-match with the version of speex and what Alsa_plugins expects so compile the Release of Speex to replace the release candidate and then Recompile Alsa plugins and you should now have the AGC even though as it as I am not a fan as really it should have a Max_Gain as its only setting is a sort of ‘rate’ and it can be annoying as it will ramp the noise floor on no input far too high but lower the current setting and the speed of change will be too slow.
Its fairly easy to add a Maxgain to the plugin.

I’m not sure I’m understand what you mean by Speex is just attenuation not cancelation
They call it Acoustic Echo Canceller here Speex: SpeexEchoState: Acoustic echo canceller do you think is incorrect?

About the delay: you shouldn’t need that if you use HW loopback channels since the echo canceller should be able to look ahead slightly; after all echo is by definition

a reflection of sound that arrives at the listener with a delay after the direct sound

The whole purpose of having loopback channels is to have a sync’d clock which shouldn’t drift due to slight differences in timing due to kernel scheduling.
That’s why in the ec_hw version of ec you don’t have a delay parameter.
In non HW ec you can input a delay in order to compensate for kernel drifting and reduce the window that the AEC needs to look through.
Also the mics are located differently so they’ll always be slightly offset respect to the output.

I did some debugging on a trimmed down implementation (simple pre-recorded file read and write processed output to file) and it seems the issue with ec_hw was due to the filter_length parameter.
Lowering it from the hardcoded 4096 value to a much smaller 160 (indeed 10 ms) provided much better results (please see attached image)
The suggested values provided by speex docs of 100ms 500ms did produce intermediate results.
Going under 10ms didn’t produce any meaningful improvement.
The AEC is able to converge much faster after this change.

I tried applying different filters with the following code to each individual channel:

       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_ECHO_STATE, st);
       n = 1;
       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_DENOISE, &n);
       n=0;
       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_AGC, &n);
       n=8000;
       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_AGC_LEVEL, &n);
       n=0;
       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_DEREVERB, &n);
       f=.0;
       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_DEREVERB_DECAY, &f);
       f=.0;
       speex_preprocess_ctl(den_mc[i], SPEEX_PREPROCESS_SET_DEREVERB_LEVEL, &f);

it didn’t provide any meaningful improvement apart from SPEEX_PREPROCESS_SET_ECHO_STATE which was quite negligible (I’d say roughly a 3% additional silencing factor)

You don’t get any clock drift if the capture and playback share the same clock. Usually a hardware loopback is needed when capture and playback are from different sources. It’s purely latency of the processing and they may be an offset (delay) but because you are recording and playing from the same clock the offset should be fixed in the stream.

Also the delay is always going to be there as its the time, sound travels from speaker to mic and all a hardware loopback is doing is removing hardware latency that without you can just add to the delay.
I use an Alexa on a 3.5mm jack and there is always a reasonable distance that I need to set a delay in samples.
Sound will travel 21.43mm per sample at 16000khz so you prob want a tail just a bit larger than the frame to give it some wiggle room.

They are are all just general terms as really there is no ‘echo’ as such, just reverberation of sound bouncing off surfaces. But echo and reverberation are the same thing but reverberation tends to be many short delay echo’s whilst echo often relates to a single longer delay echo but really just terms.
WebRTC when you listen cancels better than Speex at lower SNR and then above a threshold compltely fails whilst Speex continues to attenuate, but it depends on the SNR as at fairly low volumes it will look like cancelation but at higher SNR it will continue to attenuate. As you can see in the above where I am just clipping with the played music.

I call the filter length the ‘tail’ and a 10 millisec frame searches for that pattern in the ‘tail’ window and should of said I usually drop to 2048 as the bigger it is the more load.
With a ‘tail’ of 160 which is same as frame rate that is like exact as if there is no leeway for travel like some ‘line cancellers’ do for telephones to stop the feeback of audio played, but how I had using Alexa as an eternal speaker I am going to get more reverberation do to everything that surrounds so have a bigger tail.

If you crank the volume up which attually I had too loud as its clipping and causing more failure just at the peaks with a biiger tail it attenuates than cancels but likely enough so a KWS would still work.
Clipping is bad as it creates harmonics and what is outside is lost so effectively the waveform has changed hence prob why you can hear the glitches of the bass beat.

All AEC is attenuation as none is perfect as speaker and travel distorts the signal, but just terms I use to differentiate beween WebRTC and Speex as WebRTC IMO is sort of useless as it fails over a SNR threshold but seems to do a better job of cancelation at lower levels. WebRTC cancels but fails above a threshold whilst Speex continues to attenuate and just how I have come to differentiate them.
I always found bass with Speex seems to bleed through more, but sort of confused as SPEEX_PREPROCESS_SET_ECHO_STATE sets if AEC is on or not as in

? 3% the deverb was never completed as far as I know and the denoise just creates atrefacts, but confused if you turn AEC off you get approx only a 3% difference or am I reading that wrong?

PS whilst you are tweaking try setting the frame to a power of 2 than 10ms say 128 or 256 instead of the 10ms default it creates on 16000Hz with Voice-en EC as that is prob why your ‘tail’ fails less than 160 as its becomes less than the farme, but you could try a smaller frame with a smaller ‘tail’.
‘power of 2’ if I remember rightly is supposedly more efficient even if hard to tell.
I never did have speaker and mic combined in an enclosure as need to find a better solution to isolate the Mics especially with the bass that seems to resonate and flood the mics, where your turning down the gain so low, the voice input is near non existant to filter, even with pre and post AGC.

You don’t get any clock drift […]

The problem is that the scheduling from the kernel can’t be predicted and varies between platforms and load (unless you’re doing it on an RTOS where you know exactly what everything is doing on each clock cycle).
So every time you need to estimate roughly the delay between playback and loopback in order to allow the ec to look through a smaller window.
Having loopback channels is handy in this case so you don’t have to bother estimating this offset.

Also the delay is always going to be there […]

Agreed: and compensating for a small delay windows is the EC job. I don’t follow you when you say I need a larger tail as 10 ms would equate to 160 samples and (according to your calculations) should provide enough time to allow sound to travel 3428.8mm (roughly 3.4m) which is way more than the distance between my speaker and the respeaker mics. If anything a smaller window is better as it would allow the EC to converge much faster (provided it’d cover the distance between speaker and mics)
so in our case if my speaker is kept say at 5 cm from my mics, 3 samples for the tail should be enough and should help the EC converging much faster.
BTW I just tried that and seems to be the case. It actually works even with 1 sample for the tail. interesting…

If you crank the volume up which attually I had too loud as its clipping […]

I think there’s no much point trying to cancel clipped audio as basically what you’re getting is garbage even if it wasn’t for harmonics: even from a simple math standpoint if you try to subtract a loopback sample of 0.5 to a saturated mic sample of 1.0 when its actual value is 1.5, you’d get 0 instead of 0.5

All AEC is attenuation as none is perfect as speaker and travel distorts the signal […]

And creates harmonics depending onto which surface is bouncing on… probably an AEC based on LIDAR which knows the geometry of the room around it? but even then, different materials refract wavelengths at a different angle and velocity so you’d still get echo.
You’d need something complex like RayTracing + LIDAR which would still have a limited number of bounces but we’re looking at something which is waaaay overkill for what we’re trying to achieve here :smiley:

? 3% the deverb was never completed as far as I know and the denoise just creates atrefacts, but confused if you turn AEC off you get approx only a 3% difference or am I reading that wrong?

let me clarify this


as you can see there’s not much difference between the last two waves but I saw that enabling SPEEX_PREPROCESS_SET_ECHO_STATE just added additional 10% load on my rpi02w so I ended up leaving it enabled cause it was pretty cheap after all.

PS whilst you are tweaking try setting the frame to a power of 2 than 10ms say 128 or 256 instead of the 10ms default it creates on 16000Hz […]

done here


I changed the samples I’m reading from the input to 256.
I can see some better muting in certain sections which I initially thought because the EC was benefitting of the larger window compared to 10ms (256 > 160) but then I tried with 20ms (320 > 256) and the power of 2 was still superior. you might be onto something there. is worth playing a bit more.
Although I couldn’t hear any noticeable difference the wave is definitely flatter.

BTW it took me a while to answer cause I was working on a big addition to the voice-engine/ec project
I made this PR ec_hw auto channel detection by jacopomaroli · Pull Request #27 · voice-engine/ec · GitHub which adds automatic loopback channels detection and enables the preprocess filters for speex. give it a go when you have some time.

Yeah what you need is pre and post AEC AGC, I have never tried the Speex AGC plugin multichannel and if it can not maybe it code be hacked so. It would work for post filter but prob needs some work anyway as with the alsa-pluging there is only a single control param which is equivalent a sort rate for attack, delay gain and really a max gain should also be included so it doesn’t ramp up the noise floor on quiet passages.

The construction and isolation of the mic to speaker is really important here as if the speaker floods the mic due to close proximity then the voice level will be well down into the noise floor, if gain is adjusted to stop clipping where no correlation to the ref signal will continue.
You would want the mics behind the front plane of the speaker and mounted so it doesn’t pick up through resonance through an enclosure.

I will have look some time ec_hw auto channel detection by jacopomaroli · Pull Request #27 · voice-engine/ec · GitHub