Test Results of Acusis S Linear Beamforming Microphone Array with AEC

I made the same experience when trying to play music via the Acusis S audio jack and record myself talking at the same time. The music or TV gets attenuated to a very low volume level in comparison to your recorded voice, but it seems that the AEC is mainly made for filtering another person talking out in a conference setting. This works like a charm. I had an online meeting with some friends and used the Acusis S and my soundbar for it. They never head themselves talking although I had them on a high volume. When trying to play loud music, my friends did hear it at a low volume level. The reason for this might be, that it’s focused on voice frequencies or that a talking person has less audio input than a song with 4 instruments and someone singing and it maybe does not have that much processing power, to filter it out.

What you could try is getting the wake word recognized and let Rhasspy lower the volume of the music or TV to get all the low volume sound cancelled out while talking. Cancelling low volume sound out is possible. I had some kind of success when I tried it. If you try this whole idea and get it working, please tell me about it :sweat_smile:

AEC the type we are using of from telephony days of hearing back your own voice with the delay of transmission wasn’t good.
AEC doesn’t attenuate noise that doesn’t go through the noise channel it passes through and the only noise channel it has currently is what it plays.
Like telephony the aec cancels its own voice as in what Rhasspy plays.

It uses Philips BeClear technology and yeah it focuses on voice.

https://www.ip.philips.com/licensing/program/114#:~:text=BeClear%20Speech%20Enhancement%20is%20a,to%205%20meters%20and%20beyond.

Its really just a distance and clarity beast but noise is a killer of all, as the Acusis seems to be the king of current beamformers and I am really sorry but I have mentioned it so many times with the technology we have distributed mics where 1 can always be near and far from noise is the only solution.

Amazon & Google train KWS with noise and also noise suppression, part of the reason you have noise sensitive mics is that the KWS you use is noise sensitive.
Also another problem is choice of KWS in Rhasspy because in terms of accuracy some are OK and some are truly woeful and irrespective of that your comparisons mean nothing if your not using the same.
If you take the technology of say Raven its extremely easy to train and works well in a silent room but when it comes to error rate and noise there are reasons why no modern system elsewhere employs that type of basic audio recognition.
Precise as a name is probably an oxymoron and likely the worst thing in an opensource project the best performer is Porcupine and that is closed source.

So are hi-tech beamformers if fed into sub par KWS as until the model and audio processing chain is improved any mic in the presence of noise is likely to fail at fairly low thresholds.
You have an amazing mic far in excess of Amazon and Google units and the failure is due to what happens next with the signal.

One of the reasons I feel a soundbar is the best format for Rhasppy is that you can input the biggest sources of domestic noise TV & Hifi and play through Rhasspy as if you do connect noise channels to a unit it will absolutely trounce the like of google & amazon if we had the tensor cores they have in the cloud :slight_smile:

In the software config narrow the beam to a small radian and point away at far sources as bet its great fun but the technology of AI to decide what to point at is currently cutting edge with the likes of facebook https://github.com/facebookresearch/denoiser and https://www.nvidia.com/en-gb/geforce/guides/nvidia-rtx-voice-setup-guide/ and things like voice separation.
https://github.com/facebookresearch/svoice

But in a domestic situation common noise sources are known so you purely capture at source and it doesn’t take AI to decide that.
I would really love to connect that mic you have to RTX Voice as an experiment but wow that is so far out of the reaches of my pockets but the mics you have are pretty damn amazing.

Hook your output up to an amp play some audio and then check how well the AEC is working.
The USB on the Mic is USB UAC1 and UAC2 which is 24-bit/192kHz and Windows only supports the lower class 1.

I politely disagree. The problem with precise is that there are not many good models available for download and the documentation is not very good at all, especially on training. I had to figure out a lot of things myself and the learning curve was a little steep.
I myself have trained very robust models with precise that are both noise resistant and “precise”.
It just takes a bit of effort. The biggest negative is that its a bit cpu heavy and so cant run on a pi zero.
I get good results when I train on 40 plus wakeword samples from all household members that are than duplicated with added random noise.
I than train this dataset incremental against a about 10 hour long collection of 1 minute pieces of random audio that i have accumulated since i started training precise models.
Those are everything from recorded pieces around our household or while we were watching tv to chopped up audio from youtube videos like 1 hour of relaxing coffeeshop noises.
Everytime i find audio that triggers major false positives i just add it to the collection so its there the next time i train.
Our model right now has about 3-4 false positives a day and works from a couple of meters away while the tv is running. (over 200 wakeword samples with and without noise trained against the collection above)
You can find more details and some useful scripts i use for adding random noise or recording wakeword samples here in this post:

1 Like

That is totally pointless as the effort and ability limits the majority to what we have and I am discussing what we have and disagree all you like but all others have to do is check results.

And all im saying is precise can give results that are very good it just takes much more effort but for that effort you do get your personal wakeword you can choose and that doesn’t expire after 30 days. And also since the build in ones from porcupine often don’t work very well for non native speakers like myself.

1 Like

And for the majority who can not they will get those bad results that you confer as much effort is needed to get good results.

Also for your info Sonopy as a piece of code is flawed but works but as a feed to any KWS its sub optimal.
The RNN model of Precise is rather old and there are faster, lighter and newer architectures that have been available for some time.

Why we haven’t provided tools to make training easy is a curious one and why we haven’t adopted newer has left me bemused.

Even before we got Raven I pointed out its poor FApH but still it got included?!?

I have also been banging on about tools like Linto HMG to make model training easy and lightweight and confused that a opensource licence such as AGPL-3.0 License is so objectable.

Also why we haven’t progressed to self training mechanisms is also a bit of mystery and why people keep developing at the far end of the audio chain when the input of the system is currently as it is, is equally bemusing.

Porcupine you have to use a defined keyword or it is like you say but at least on initial install it works quite well and is light weight.
It why I have created an issue to maybe include ‘Raspberry’ or ‘RaspberryPi’ as maybe that could be the default for users who an not spare the effort or know how when simple tools are still missing.

@RaspiManu @Enc3ph4l0n

Did you get any info back? As fingers crossed the USB will have sub devices connected to a hardware mixer where multi channel can be fed.

I had a go at ‘duping’ outputs as part of how I think I can trick AEC to use multiple channels, but just basic duping seems not to work as the Alsa documents say.
Prob could do it with pulse or python but dunno why the following sounds so bad.

   pcm.!default plug:both

ctl.!default {
  type hw
  card 0
}

pcm.both {
  type route;
  slave.pcm {
      type multi;
      slaves.a.pcm "card0";
      slaves.b.pcm "card1";
      slaves.a.channels 2;
      slaves.b.channels 2;
      bindings.0.slave a;
      bindings.0.channel 0;
      bindings.1.slave a;
      bindings.1.channel 1;

      bindings.2.slave b;
      bindings.2.channel 0;
      bindings.3.slave b;
      bindings.3.channel 1;
  }

  ttable.0.0 1;
  ttable.1.1 1;

  ttable.0.2 1; # front left
  ttable.1.3 1; # front right
}

ctl.both {
  type hw;
  card 0;
}


pcm.card1 {
   type dmix
   ipc_key 1112231
   slave {
       pcm "hw:1"
       period_time 0
       period_size 1024
       buffer_size 8192
#       buffer_size 65536
#       buffer_time 0
#       periods 128
       rate 48000
       channels 2
    }
    bindings {
       0 0
       1 1
    }
}

pcm.card0 {
   type dmix
   ipc_key 1112230
   slave {
       pcm "hw:0"
       period_time 0
       period_size 1024
       buffer_size 8192
#       buffer_size 65536
#       buffer_time 0
#       periods 128
       rate 48000
       channels 2
    }
    bindings {
       0 0
       1 1
    }
}

ctl.card1 {
   type hw
   card 1
}

ctl.card0 {
   type hw
   card 0
}

In fact looking at that it just occurred to me that I never tried another resampler but hey dunno at the moment

Hey @rolyan_trauts,

I got answers to your question and a question I had from antimatter.ai team. Here they are:

1: Cancel out audio from other mic
Question (Manuel):

Is it possible to input two audio streams with one beeing the audio stream that needs to be processed by AEC and send to the speaker and one beeing an audio stream of another microphone capturing background noise somewhere else in the room that also needs to be processed by AEC but not to be send to the speaker?

Answer (Andrew Walters, antimatter.ai audio expert):

It might be possible. You’d need to have something running on the host to route the audio from your other mic to one of the Acusis output channels and then configure the audio output so that you’re not sending that audio to your actual speakers that are plugged into the 3.5mm jack. Note we haven’t tried out a scenario like this, so it’s not certain how correlated and time-aligned the two AEC channels need to be for the BeClear AEC to work properly.

Andrew also included information about the audio channels from an upcoming post:

Configuring Audio Channels: The Basics

By default, Acusis S presents itself as a USB Audio Class 1.0 (UAC1) device with 2 input and 2 output channels, but these channels have a very flexible configuration. In this post, we’ll start with the basics and a few examples of how to configure audio inputs for some common use cases. If you haven’t already, check out the quick start which includes download links for the Acusis S configuration tool, which you’ll use below.

The input channels are configured as a pseudo-stereo microphone. That is, the mono audio that is produced by Acusis’ beamformer is panned based on the detected direction of arrival. So, if you’re standing to Acusis’ left, the left (first) audio channel will have higher volume than the right (second) channel. If you move to the right, the audio will pan toward the right.

When in UAC1 mode, the input channels are configured by the audio_in_map_uac1 parameter, which is accessible through the config tool. This parameter consists of two 4-bit fields, where the value of each field determines what is routed to the input, according to this table:

Value Channel
0x0 Left stereo channel for communication
0x1 Right stereo channel for communication
0x2 Mono channel for communication
0x3 Mono channel for speech recognition
0x4 Raw microphone 1 (left when looking at Acusis S)
0x5 Raw microphone 2
0x6 Raw microphone 3
0x7 Raw microphone 4 (rightmost)
0x8 Left output channel loopback
0x9 Right output channel loopback
0xf Mute

We’ll look at just the first four values for now, and delve into the others in future posts.

The default value of audio_in_map_uac1 is 0x10, which gives you the pseudo-stereo input shown above. Note that ‘left’ and ‘right’ here are from the point of view of Acusis S (or of a person or camera sitting behind Acusis S), useful if you’re using Acusis S with a conferencing app. If you want to reverse the sense of left and right, simply reverse the channel settings by setting audio_in_map_uac1 to 0x01:

aconfig --set audio_in_map_uac1 0x01

If you prefer mono audio input for your conferencing app, you can set audio_in_map_uac1 to 0x22. With this, Acusis S will send the same mono audio to both input channels. It’s worth noting that some apps that deal with audio (such as Audacity) allow you to select the number of input channels to record, but this will not automatically downmix the channels. It will simply select the first of the two channels, so the setting of 0x22 is a surefire way to make sure you’re getting mono audio, regardless of whether your app is doing downmixing.

Acusis S also provides audio that is optimized for automatic speech recognition (ASR). To the ear, this channel sounds similar to the communication channel, but some of the noise reduction algorithms are tuned to work better with ASR. So if you’re using Acusis S with a voice assistant app, set audio_in_map_uac1 to 0x33.

We’ll have more about audio channels in a future posts!

Hope you this answers your question. Setting audio_in_map_uac1 to 0x33 to optimize the Acusis S for usage with a voice assistant sounds very interesting to me :slight_smile:


2: Use LEDs only on wake word detection
Question (Manuel):

Is it possible to connect the Acusis S with a voice assistant in a way, that the LEDs only light up and point towards the speaking person after a wake word was detected?

Answer (Andrew Walters, antimatter.ai audio expert):

The direction of arrival information can be extracted via the config tool or by communicating directly to the Acusis’ config interface which shows up as a virtual serial port. Also, the LEDs are completely controllable through the config interface, so with a little coding on the host side, you should be able to light up the LEDs and hold them steady when a wake word is detected (assuming your wake word detection is also running on the host, as Acusis S itself doesn’t do wake word detection).


They will be publishing some useful posts containing the channel part above and information about getting DOA angle (0-180°) and Detection of voice activity (0 or 1) etc. Andrew also said that he saw some interesting questions in our discussion and that he might join to answer them.

Have to say the antimatter.ai team do seem rather great as well does the functionality of the Acusis.

Its great that Andrew is so open as the noise channel with clock drift and matching the mix to mic capture is likely a no but worth a try. I like the attitude of worth a try even though the odds are stacked against, but you never know.

For me the Acusis is prob the ultimate soundbar mic as then we have the source rather than ‘noise’ mics.
But so many great features and they do need to publish more as wow thats more than my initial diggings on their site excavated :slight_smile:

I know have a ton more questions :slight_smile:

Speaking of audio input…
FYI, I just happened to look at ReSpeaker’s page to see if there have been any firmware updates since I’ve messed with it and… Q5 in the FAQ got my attention.
How to enable 3.5mm audio port to receive the signal as well as usb port?

I may have to try and get the Mic Array V2 back up and running to try this… if I can figure out how to get audio out to this any my soundbar at the same time… however, I have a feeling that might be more difficult with the timing.

I think the idea was to use another soundcard capture ‘noise’ on that and playback on that but also play a mix with noise to the Acusis but just not use the acusis output but the soundcard.
The acusis will just be used for AEC on the mic in for played + noise.

Guess you could also do the same but how the AEC behaves with clock drift and matching the mix to the volumes the mic gets could mean its not possible or difficult to assess

Hey guys.

Is the Acusis S able to output audio playback above 16KHz (unlike the Respeaker Mic Array v2)?

Or is it also limited by the XMOS chip capabilities?

Cheers.

Hi all! This is Andy from Antimatter. It’s great to see a lot of interest and discussion in Acusis S. I’ve been catching up on your discussion thread and beyond my initial answers to @RaspiManu via email, I’ll try to fill in some more details.

Regarding the potential use case of feeding audio from another mic into the AEC, as I said it might be possible but it’s untested and will need a little experimentation.

To set this up, you’d need to have something on the host that generates 2-channel audio stream, with one channel being the audio to be played out of your speaker (say, channel 1 or the ‘left’ channel) and the other channel being your remote mic audio (channel 2 or the ‘right’ channel). The AEC is stereo, so it should cancel both audio streams, but I don’t know if it expects the two channels to be correlated at all. The other potential issue is that if your remote mic is close enough to the Acusis, it will pick up desired audio, which could then get canceled out, so you’d need some experiments with the positioning of the two mics.

With that setup, on the speaker side of things, you can either physically wire things up so only the ‘left’ channel gets played out, or if that’s not possible, you can configure Acusis S to route that channel to both channels of the jack, effectively giving you mono output. To do that, run:

aconfig --set audio_out_map 0x88

Note that settings like this aren’t automatically saved, so behavior will revert back to default after the Acusis is power-cycled. The Acusis quick start shows how to create and save configurations.

1 Like

The Acusis S audio output runs at a 48kHz sampling rate, so you should get high quality audio from the 3.5mm jack (note that the jack is line-level only, so if you plug in some unamplified headphones, it won’t sound so great).

The input side also runs at 48kHz, although the processed mic inputs are bandlimited to 8kHz, so you could downsample to 16kHz without any change in quality. It is possible to access the raw microphone inputs (hinted at in the audio config information above), and these do actually run at full bandwidth.

1 Like

One more thing, I’ve seen some interest in this thread about trying out different open-source speech and audio toolkits. If you need to get at the raw microphone data to try out some of their features, you can put Acusis S into UAC2 mode:

aconfig --setdevicemode uac2

Unlike other settings, the device mode will get saved in flash immediately, and the firmware will reboot. Acusis S will then enumerate as a new device with a different name (“Acusis S (UAC2)”) and product ID. The big difference you’ll see is that you now get 8 input channels instead of 2. With everything else set to defaults, the raw mics will come through on channels 4 through 7 (if you number them starting from 0)-- leftmost mic is channel 4, rightmost is 7.

The config software does that show up in alsamixer or amixer controls on linux or is the config software going to be ported to linux.

Can you post opensource details of the virtual serial port and a complete register list as DOA & VAD are intensive processes on the Pi so to be able to ‘offload’ to hardware is a big plus.
As if no direct port with the info maybe someone will provide something,

Also any chance you might be getting a simple forum where users can help and share info?

Also as well as 48Khz is that S16_LE or S24_3LE word length?

Hey Andy,
Welcome to the Rhasspy community and thank you very much for joining our discussion to answer some questions :slight_smile:
I tried to figure out, where you will be publishing the upcoming posts you showed us a part of. Will it be linked on the Acusis S product page?

1 Like

@RaspiManu - Yes, the upcoming posts will be on our product page. We’re also working on getting the quick start guide there too, as it’s currently only posted as part of the Datasheet on the DigiKey product page.

@rolyan_trauts - The config tool is a separate tool. There is already an x86_64 Linux version available. The link to download is included in the quick start guide. There is a lot of configuration available, so it’ll take a while to document the virtual serial port protocol for everything. A lot of the configuration things can be done with the config tool and then saved to flash, and I’m working on writing up how to do some of the more interesting ‘real time’ tasks that you might want to do, such as getting DoA and controlling LEDs. Speaking of, here is the upcoming one on DoA:

Using the config tool

Acusis S has a command-line configuration tool for Linux, Mac, and Windows 10 that allows you to view and modify many of the audio parameters, including direction-of-arrival information. See the Acusis S datasheet for information on how to download and set up the tool for your platform.

With the tool (named aconfig or aconfig.exe) installed, you can get the direction-of-arrival (DOA) at any instant with the command:

aconfig --get DOAANGLE

This will print a single line response with an angle between 0 and 180 degrees, for example:

DOAANGLE: 107

If you are speaking head-on to the Acusis S, so that the middle LEDs are lighting up, you’ll get an angle of around 90 degrees. If you move all the way to your left, you’ll get close to 0 degrees, and if you move all the way to your right, you’ll get close to 180 degrees. If you happen to be behind the Acusis S instead of in front of it, you’ll still get an angle of 0 to 180 degrees, just mirrored from when you’re in front of it.

This command will report the most recent DOA of any sound, even sounds that are too quiet to register as voice activity. If you also want to see if voice activity is triggered (that is, LEDs are lit up), you can use the following command:

aconfig --get VOICEACTIVITY

This will respond with a 0 or 1, indicating whether Acusis S is detecting that someone is speaking:

VOICEACTIVITY: 1

Using the virtual serial interface

The config above communicates with Acusis S through a virtual serial interface that shows up on most operating systems. If you want to integrate DOA tracking into your own application, you might want to communicate directly with the serial interface. Communicating with this interface is like communicating with any other serial interface, and how to do this will depend on your OS, programming language, and libraries available.

To try it out without coding, first get a hold of a terminal program such as PuTTY, Minicom, or screen. Then find the name of the virtual serial port. If you have the config tool already installed, you can get the name by running:

aconfig -l

On Mac, this will likely be in the form /dev/tty.usbmodemNNNNN, on Linux, /dev/ttyACMN, and on Windows COMN (where N represents a decimal digit).

Use your terminal program to connect to the virtual serial port. You might want to turn on local echo and translate line feeds to CR+LF to make the display more readable. Baud rate and communication settings shouldn’t matter since the port isn’t physical, but 11520 bps, no parity, 8 data bits, 1 stop bit (115200 N81) will work fine if you need to set something.

To get direction of arrival, enter the following in your terminal (followed by CR and/or LF):

rawr15c00400

You will get raw data back in JSON format, such as (followed by LF):

{"data":[107,0,0,0]}

The first byte, 107, indicates the angle in degrees. The remaining 3 bytes are unused and are always 0.

Similarly, you can get the voice activity indicator with:

rawr13e50400

And the response:

{"data":[1,0,0,0]}

…where the first byte is a 0 or 1 indicating whether someone is speaking.

2 Likes

Nice, I will definitely take a look at the product page every now and then and check for new things that might help me customize Acusis S :slight_smile:

Nope. Me neither. Completely lost the plot! I said I was tired :sweat_smile: I quietly “celebrated” a milestone birthday this week too, in lockdown of course, worth mentioning I think as perhaps it’s relevant! :smile:

I’m still playing with the position, and from how this topic has developed I’m interested to play with the configuration of the device.

I’ve read an awful lot of what you’ve written on audio & KWS @rolyan_trauts. I value a lot of what you’ve said across many topics and your ideas. Please forgive me for not replying to everything you’ve said but I want you to know I’ve found it really valuable. My knowledge in audio is basic at best, but I’m definitely beginning to get a grasp of the concepts and your posts have gone a long way to help.

I mute the TV/pause the Roku/Shield when the hotword is detected and that’s fine, but Porcupine, whilst convenient isn’t perfect. Too many false positives when I reduce the sensitivity, and vice-versa, but is easily drowned out with the TV playing or other background noise.

Hi @andrewwalters, thanks for taking the time to post! Some good information you’ve posted, and I’m looking forward to the posts you plan to make available on the product page. I pulled the config when I first received the device, but struggled to find information about each the of key/value pairs, so will be good to understand them better and tune the device as necessary.