Test Results of Acusis S Linear Beamforming Microphone Array with AEC

RaspiManu · January 6, 2021, 2:25pm

Ok, now I understand what you were talking about. The standard AEC functionality works great when you send all the output over Acusis S to its audio jack and then remove the output audio waves from the audio input. You thought about sending background noise from other microphones to the Acusis S to filter it out via AEC, what would normally cause this microphone streams to come out of the connected speaker. I don’t know if you could configure the Acusis S to only output a special system audio stream and don’t output the others (background noise). It might be possible (maybe only with modifying the XMOS code), but I cannot answer your question, sorry.

rolyan_trauts · January 6, 2021, 2:50pm

Yeah it was just the Philips BeClear documentation that got me hoping the audio jack was in and connected to an ADC purely for AEC.
Then you can just split your output and also mix in, dunno why someone hasn’t had that idea but I don’t really have the rocket science degree that can work out why aec seems so sensitive to clock drift that the norm is to play and do aec via the dac.

The BeClear software can do multi-channel AEC but what you need for that I don’t know.
I might have a way to do it with a bit of alsa trickery and an additional sound card but its an off chance due latency matching that one time I will get round to testing.
Just surprised no system seems to of thought of a captured noise channel apart from high end conference equipment as presuming the DSP on that thing is far better than anything I can do with software.

Thanks for the replies as was curious but generally from what I hear impressed with the Acusis.

Enc3ph4l0n · January 7, 2021, 5:35pm

I received my Acusis S yesterday. I haven’t had a great deal of time to use it yet. In my limited testing so far I have to say it’s much more effortless in terms of being plug and play, and it’s performance than the ReSpeakers (drivers) & Matrix Voice. I moved around the house whispering intents from nearby rooms and it peformed admirably.

Having said that, I hooked it into my TV for AEC and found I could not wake Rhasspy on what I would say is my usual volume. I don’t yet know why, I haven’t yet investigated/analysed the recordings. Is AEC not working? Is the noise too great? Have I connected it incorrectly due to my tiredness last night? Is the default config not appropriate? Have I totally misunderstood how AEC works? All possible answers at this moment

Edit - I did look at the default configuration. From my brief play I found their software to interact with the microphone doesn’t work on a Pi. I had to hook plug it into my Windows laptop for a few moments to look at the configuration. Hopefully I won’t need to change too much…

rolyan_trauts · January 7, 2021, 7:25pm

From what I can gather AEC on that unit is like any other and not sure how you could hook it up to a TV
The usb in is an audio dac and should be the output of rhasspy and it uses that to attenuate that signal from the mic in.
The audio out jack should go to an amp.

The conf software from what I read is just selecting beam coverage so again guessing but if you did have a noise source you could make a narrower band?

Its weekend soon so hopefully you will have time to give it a trial.

RaspiManu · January 7, 2021, 8:09pm

I made the same experience when trying to play music via the Acusis S audio jack and record myself talking at the same time. The music or TV gets attenuated to a very low volume level in comparison to your recorded voice, but it seems that the AEC is mainly made for filtering another person talking out in a conference setting. This works like a charm. I had an online meeting with some friends and used the Acusis S and my soundbar for it. They never head themselves talking although I had them on a high volume. When trying to play loud music, my friends did hear it at a low volume level. The reason for this might be, that it’s focused on voice frequencies or that a talking person has less audio input than a song with 4 instruments and someone singing and it maybe does not have that much processing power, to filter it out.

What you could try is getting the wake word recognized and let Rhasspy lower the volume of the music or TV to get all the low volume sound cancelled out while talking. Cancelling low volume sound out is possible. I had some kind of success when I tried it. If you try this whole idea and get it working, please tell me about it

rolyan_trauts · January 8, 2021, 5:41am

AEC the type we are using of from telephony days of hearing back your own voice with the delay of transmission wasn’t good.
AEC doesn’t attenuate noise that doesn’t go through the noise channel it passes through and the only noise channel it has currently is what it plays.
Like telephony the aec cancels its own voice as in what Rhasspy plays.

It uses Philips BeClear technology and yeah it focuses on voice.

https://www.ip.philips.com/licensing/program/114#:~:text=BeClear%20Speech%20Enhancement%20is%20a,to%205%20meters%20and%20beyond.

Its really just a distance and clarity beast but noise is a killer of all, as the Acusis seems to be the king of current beamformers and I am really sorry but I have mentioned it so many times with the technology we have distributed mics where 1 can always be near and far from noise is the only solution.

Amazon & Google train KWS with noise and also noise suppression, part of the reason you have noise sensitive mics is that the KWS you use is noise sensitive.
Also another problem is choice of KWS in Rhasspy because in terms of accuracy some are OK and some are truly woeful and irrespective of that your comparisons mean nothing if your not using the same.
If you take the technology of say Raven its extremely easy to train and works well in a silent room but when it comes to error rate and noise there are reasons why no modern system elsewhere employs that type of basic audio recognition.
Precise as a name is probably an oxymoron and likely the worst thing in an opensource project the best performer is Porcupine and that is closed source.

So are hi-tech beamformers if fed into sub par KWS as until the model and audio processing chain is improved any mic in the presence of noise is likely to fail at fairly low thresholds.
You have an amazing mic far in excess of Amazon and Google units and the failure is due to what happens next with the signal.

One of the reasons I feel a soundbar is the best format for Rhasppy is that you can input the biggest sources of domestic noise TV & Hifi and play through Rhasspy as if you do connect noise channels to a unit it will absolutely trounce the like of google & amazon if we had the tensor cores they have in the cloud

In the software config narrow the beam to a small radian and point away at far sources as bet its great fun but the technology of AI to decide what to point at is currently cutting edge with the likes of facebook https://github.com/facebookresearch/denoiser and https://www.nvidia.com/en-gb/geforce/guides/nvidia-rtx-voice-setup-guide/ and things like voice separation.
https://github.com/facebookresearch/svoice

But in a domestic situation common noise sources are known so you purely capture at source and it doesn’t take AI to decide that.
I would really love to connect that mic you have to RTX Voice as an experiment but wow that is so far out of the reaches of my pockets but the mics you have are pretty damn amazing.

Hook your output up to an amp play some audio and then check how well the AEC is working.
The USB on the Mic is USB UAC1 and UAC2 which is 24-bit/192kHz and Windows only supports the lower class 1.

JGKK · January 8, 2021, 6:24am

I politely disagree. The problem with precise is that there are not many good models available for download and the documentation is not very good at all, especially on training. I had to figure out a lot of things myself and the learning curve was a little steep.
I myself have trained very robust models with precise that are both noise resistant and “precise”.
It just takes a bit of effort. The biggest negative is that its a bit cpu heavy and so cant run on a pi zero.
I get good results when I train on 40 plus wakeword samples from all household members that are than duplicated with added random noise.
I than train this dataset incremental against a about 10 hour long collection of 1 minute pieces of random audio that i have accumulated since i started training precise models.
Those are everything from recorded pieces around our household or while we were watching tv to chopped up audio from youtube videos like 1 hour of relaxing coffeeshop noises.
Everytime i find audio that triggers major false positives i just add it to the collection so its there the next time i train.
Our model right now has about 3-4 false positives a day and works from a couple of meters away while the tv is running. (over 200 wakeword samples with and without noise trained against the collection above)
You can find more details and some useful scripts i use for adding random noise or recording wakeword samples here in this post:

rolyan_trauts · January 8, 2021, 6:30am

That is totally pointless as the effort and ability limits the majority to what we have and I am discussing what we have and disagree all you like but all others have to do is check results.

JGKK · January 8, 2021, 6:36am

And all im saying is precise can give results that are very good it just takes much more effort but for that effort you do get your personal wakeword you can choose and that doesn’t expire after 30 days. And also since the build in ones from porcupine often don’t work very well for non native speakers like myself.

rolyan_trauts · January 8, 2021, 6:48am

And for the majority who can not they will get those bad results that you confer as much effort is needed to get good results.

Also for your info Sonopy as a piece of code is flawed but works but as a feed to any KWS its sub optimal.
The RNN model of Precise is rather old and there are faster, lighter and newer architectures that have been available for some time.

Why we haven’t provided tools to make training easy is a curious one and why we haven’t adopted newer has left me bemused.

Even before we got Raven I pointed out its poor FApH but still it got included?!?

I have also been banging on about tools like Linto HMG to make model training easy and lightweight and confused that a opensource licence such as AGPL-3.0 License is so objectable.

Also why we haven’t progressed to self training mechanisms is also a bit of mystery and why people keep developing at the far end of the audio chain when the input of the system is currently as it is, is equally bemusing.

Porcupine you have to use a defined keyword or it is like you say but at least on initial install it works quite well and is light weight.
It why I have created an issue to maybe include ‘Raspberry’ or ‘RaspberryPi’ as maybe that could be the default for users who an not spare the effort or know how when simple tools are still missing.

rolyan_trauts · January 11, 2021, 7:23am

@RaspiManu @Enc3ph4l0n

Did you get any info back? As fingers crossed the USB will have sub devices connected to a hardware mixer where multi channel can be fed.

I had a go at ‘duping’ outputs as part of how I think I can trick AEC to use multiple channels, but just basic duping seems not to work as the Alsa documents say.
Prob could do it with pulse or python but dunno why the following sounds so bad.

   pcm.!default plug:both

ctl.!default {
  type hw
  card 0
}

pcm.both {
  type route;
  slave.pcm {
      type multi;
      slaves.a.pcm "card0";
      slaves.b.pcm "card1";
      slaves.a.channels 2;
      slaves.b.channels 2;
      bindings.0.slave a;
      bindings.0.channel 0;
      bindings.1.slave a;
      bindings.1.channel 1;

      bindings.2.slave b;
      bindings.2.channel 0;
      bindings.3.slave b;
      bindings.3.channel 1;
  }

  ttable.0.0 1;
  ttable.1.1 1;

  ttable.0.2 1; # front left
  ttable.1.3 1; # front right
}

ctl.both {
  type hw;
  card 0;
}


pcm.card1 {
   type dmix
   ipc_key 1112231
   slave {
       pcm "hw:1"
       period_time 0
       period_size 1024
       buffer_size 8192
#       buffer_size 65536
#       buffer_time 0
#       periods 128
       rate 48000
       channels 2
    }
    bindings {
       0 0
       1 1
    }
}

pcm.card0 {
   type dmix
   ipc_key 1112230
   slave {
       pcm "hw:0"
       period_time 0
       period_size 1024
       buffer_size 8192
#       buffer_size 65536
#       buffer_time 0
#       periods 128
       rate 48000
       channels 2
    }
    bindings {
       0 0
       1 1
    }
}

ctl.card1 {
   type hw
   card 1
}

ctl.card0 {
   type hw
   card 0
}

In fact looking at that it just occurred to me that I never tried another resampler but hey dunno at the moment

RaspiManu · January 11, 2021, 3:13pm

Hey @rolyan_trauts,

I got answers to your question and a question I had from antimatter.ai team. Here they are:

1: Cancel out audio from other mic
Question (Manuel):

Is it possible to input two audio streams with one beeing the audio stream that needs to be processed by AEC and send to the speaker and one beeing an audio stream of another microphone capturing background noise somewhere else in the room that also needs to be processed by AEC but not to be send to the speaker?

Answer (Andrew Walters, antimatter.ai audio expert):

It might be possible. You’d need to have something running on the host to route the audio from your other mic to one of the Acusis output channels and then configure the audio output so that you’re not sending that audio to your actual speakers that are plugged into the 3.5mm jack. Note we haven’t tried out a scenario like this, so it’s not certain how correlated and time-aligned the two AEC channels need to be for the BeClear AEC to work properly.

Andrew also included information about the audio channels from an upcoming post:

Configuring Audio Channels: The Basics

By default, Acusis S presents itself as a USB Audio Class 1.0 (UAC1) device with 2 input and 2 output channels, but these channels have a very flexible configuration. In this post, we’ll start with the basics and a few examples of how to configure audio inputs for some common use cases. If you haven’t already, check out the quick start which includes download links for the Acusis S configuration tool, which you’ll use below.

The input channels are configured as a pseudo-stereo microphone. That is, the mono audio that is produced by Acusis’ beamformer is panned based on the detected direction of arrival. So, if you’re standing to Acusis’ left, the left (first) audio channel will have higher volume than the right (second) channel. If you move to the right, the audio will pan toward the right.

When in UAC1 mode, the input channels are configured by the audio_in_map_uac1 parameter, which is accessible through the config tool. This parameter consists of two 4-bit fields, where the value of each field determines what is routed to the input, according to this table:

Value Channel

0x0 Left stereo channel for communication

0x1 Right stereo channel for communication

0x2 Mono channel for communication

0x3 Mono channel for speech recognition

0x4 Raw microphone 1 (left when looking at Acusis S)

0x5 Raw microphone 2

0x6 Raw microphone 3

0x7 Raw microphone 4 (rightmost)

0x8 Left output channel loopback

0x9 Right output channel loopback

0xf Mute

We’ll look at just the first four values for now, and delve into the others in future posts.

The default value of audio_in_map_uac1 is 0x10, which gives you the pseudo-stereo input shown above. Note that ‘left’ and ‘right’ here are from the point of view of Acusis S (or of a person or camera sitting behind Acusis S), useful if you’re using Acusis S with a conferencing app. If you want to reverse the sense of left and right, simply reverse the channel settings by setting audio_in_map_uac1 to 0x01:

aconfig --set audio_in_map_uac1 0x01

If you prefer mono audio input for your conferencing app, you can set audio_in_map_uac1 to 0x22. With this, Acusis S will send the same mono audio to both input channels. It’s worth noting that some apps that deal with audio (such as Audacity) allow you to select the number of input channels to record, but this will not automatically downmix the channels. It will simply select the first of the two channels, so the setting of 0x22 is a surefire way to make sure you’re getting mono audio, regardless of whether your app is doing downmixing.

Acusis S also provides audio that is optimized for automatic speech recognition (ASR). To the ear, this channel sounds similar to the communication channel, but some of the noise reduction algorithms are tuned to work better with ASR. So if you’re using Acusis S with a voice assistant app, set audio_in_map_uac1 to 0x33.

We’ll have more about audio channels in a future posts!

Hope you this answers your question. Setting audio_in_map_uac1 to 0x33 to optimize the Acusis S for usage with a voice assistant sounds very interesting to me

2: Use LEDs only on wake word detection
Question (Manuel):

Is it possible to connect the Acusis S with a voice assistant in a way, that the LEDs only light up and point towards the speaking person after a wake word was detected?

Answer (Andrew Walters, antimatter.ai audio expert):

The direction of arrival information can be extracted via the config tool or by communicating directly to the Acusis’ config interface which shows up as a virtual serial port. Also, the LEDs are completely controllable through the config interface, so with a little coding on the host side, you should be able to light up the LEDs and hold them steady when a wake word is detected (assuming your wake word detection is also running on the host, as Acusis S itself doesn’t do wake word detection).

They will be publishing some useful posts containing the channel part above and information about getting DOA angle (0-180°) and Detection of voice activity (0 or 1) etc. Andrew also said that he saw some interesting questions in our discussion and that he might join to answer them.

rolyan_trauts · January 11, 2021, 4:42pm

Have to say the antimatter.ai team do seem rather great as well does the functionality of the Acusis.

Its great that Andrew is so open as the noise channel with clock drift and matching the mix to mic capture is likely a no but worth a try. I like the attitude of worth a try even though the odds are stacked against, but you never know.

For me the Acusis is prob the ultimate soundbar mic as then we have the source rather than ‘noise’ mics.
But so many great features and they do need to publish more as wow thats more than my initial diggings on their site excavated

I know have a ton more questions

Platup · January 11, 2021, 6:40pm

Speaking of audio input…
FYI, I just happened to look at ReSpeaker’s page to see if there have been any firmware updates since I’ve messed with it and… Q5 in the FAQ got my attention.
How to enable 3.5mm audio port to receive the signal as well as usb port?

I may have to try and get the Mic Array V2 back up and running to try this… if I can figure out how to get audio out to this any my soundbar at the same time… however, I have a feeling that might be more difficult with the timing.

rolyan_trauts · January 11, 2021, 8:31pm

I think the idea was to use another soundcard capture ‘noise’ on that and playback on that but also play a mix with noise to the Acusis but just not use the acusis output but the soundcard.
The acusis will just be used for AEC on the mic in for played + noise.

Guess you could also do the same but how the AEC behaves with clock drift and matching the mix to the volumes the mic gets could mean its not possible or difficult to assess

fastjack · January 11, 2021, 9:10pm

Hey guys.

Is the Acusis S able to output audio playback above 16KHz (unlike the Respeaker Mic Array v2)?

Or is it also limited by the XMOS chip capabilities?

Cheers.

andrewwalters · January 12, 2021, 2:02am

Hi all! This is Andy from Antimatter. It’s great to see a lot of interest and discussion in Acusis S. I’ve been catching up on your discussion thread and beyond my initial answers to @RaspiManu via email, I’ll try to fill in some more details.

Regarding the potential use case of feeding audio from another mic into the AEC, as I said it might be possible but it’s untested and will need a little experimentation.

To set this up, you’d need to have something on the host that generates 2-channel audio stream, with one channel being the audio to be played out of your speaker (say, channel 1 or the ‘left’ channel) and the other channel being your remote mic audio (channel 2 or the ‘right’ channel). The AEC is stereo, so it should cancel both audio streams, but I don’t know if it expects the two channels to be correlated at all. The other potential issue is that if your remote mic is close enough to the Acusis, it will pick up desired audio, which could then get canceled out, so you’d need some experiments with the positioning of the two mics.

With that setup, on the speaker side of things, you can either physically wire things up so only the ‘left’ channel gets played out, or if that’s not possible, you can configure Acusis S to route that channel to both channels of the jack, effectively giving you mono output. To do that, run:

aconfig --set audio_out_map 0x88

Note that settings like this aren’t automatically saved, so behavior will revert back to default after the Acusis is power-cycled. The Acusis quick start shows how to create and save configurations.

andrewwalters · January 12, 2021, 2:12am

The Acusis S audio output runs at a 48kHz sampling rate, so you should get high quality audio from the 3.5mm jack (note that the jack is line-level only, so if you plug in some unamplified headphones, it won’t sound so great).

The input side also runs at 48kHz, although the processed mic inputs are bandlimited to 8kHz, so you could downsample to 16kHz without any change in quality. It is possible to access the raw microphone inputs (hinted at in the audio config information above), and these do actually run at full bandwidth.

andrewwalters · January 12, 2021, 2:38am

One more thing, I’ve seen some interest in this thread about trying out different open-source speech and audio toolkits. If you need to get at the raw microphone data to try out some of their features, you can put Acusis S into UAC2 mode:

aconfig --setdevicemode uac2

Unlike other settings, the device mode will get saved in flash immediately, and the firmware will reboot. Acusis S will then enumerate as a new device with a different name (“Acusis S (UAC2)”) and product ID. The big difference you’ll see is that you now get 8 input channels instead of 2. With everything else set to defaults, the raw mics will come through on channels 4 through 7 (if you number them starting from 0)-- leftmost mic is channel 4, rightmost is 7.

rolyan_trauts · January 12, 2021, 7:51am

The config software does that show up in alsamixer or amixer controls on linux or is the config software going to be ported to linux.

Can you post opensource details of the virtual serial port and a complete register list as DOA & VAD are intensive processes on the Pi so to be able to ‘offload’ to hardware is a big plus.
As if no direct port with the info maybe someone will provide something,

Also any chance you might be getting a simple forum where users can help and share info?

Also as well as 48Khz is that S16_LE or S24_3LE word length?

Value	Channel
0x0	Left stereo channel for communication
0x1	Right stereo channel for communication
0x2	Mono channel for communication
0x3	Mono channel for speech recognition
0x4	Raw microphone 1 (left when looking at Acusis S)
0x5	Raw microphone 2
0x6	Raw microphone 3
0x7	Raw microphone 4 (rightmost)
0x8	Left output channel loopback
0x9	Right output channel loopback
0xf	Mute