DTLN Noise Suppression Setup

I recently found the DTLN project which does quite well on noise suppression and can be used on Pi. The also have a DTLN-aec but not directly usable on Pi as DTLN.

I tested the DTLN on my Pi 3B+. The quantization tflite model actually works with CPU load only around 70%. They have a real_time_dtln_audio.py which can specify an in device and out device to record while playing. I tested with very loud vacuum running and the noise it mostly removed and my voice can be heard.

Now the problem is I want to chain it after some voice cancelling (voiceengine ec) . One way is to modify the Python script and extract the denoised data, but I just want to see if I can use alsa fifo device like what ec does to make a virtual denoised mic. I modified asound.conf as below:

pcm.!default {
    type asym
    playback.pcm "eci"
    capture.pcm "dnc"
}

pcm.dn{
    type asym
    playback.pcm "dno"
    capture.pcm "eco"
}

pcm.eci {
    type plug
    slave {
        format S16_LE
        rate 16000
        channels 1
        pcm {
            type file
            slave.pcm null
            file "/tmp/ec.input"
            format "raw"
        }
    }
}

pcm.eco {
    type plug
    slave.pcm {
        type fifo
        infile "/tmp/ec.output"
        rate 16000
        format S16_LE
        channels 2
    }
}

# let denoise script output to this device
pcm.dno {
    type plug
    slave {
        format S16_LE
        rate 16000
        channels 1
        pcm {
            type file
            slave.pcm null
            file "/tmp/dn.output"
            format "raw"
        }
    }
}

# use this as a capture device to read denoised audio
pcm.dnc {
    type plug
    slave.pcm {
        type fifo
        infile "/tmp/dn.output"
        rate 16000
        format S16_LE
        channels 1
    }
}

It works in the sense that I can record and here denoised audio. However, the denoise program generates tons of “input underflow” message and its CPU usage stays 100%. As a result the denoised audio sounds coppy. If I run the real_time_dtln_audio.py script with plughw:0 as in and out there’s no “input underflow” and CPU usage stays around 70%.

I am new to asound.conf and not sure if I did something wrong. Maybe someone familiar with this can help, or can come up with a better solution to use the DTLN.

I got the RNNoise LADSPA plugin to run but its results seemed to be a bit hit and miss.

If you do recompile speex and alsa-plugins the the speex utils do work.

pcm.agc {
 type speex
 slave.pcm "cap"
 agc on
 agc_level 8000
 denoise on
}

But NS always seems to cause artifacts that for us sounds better but seems to effect recognition prob if your models are recorded with the NS of use it would work well.

So with the above if you use on its own with a sound device its OK even if 70% load aka plughw: with the auto resample as opposed to hw: (direct)
But when used in conjunction with EC hence the above PCMs your just running out of load space?
As EC + DTLN is starting to get heavy.

With non RTOS often you can not run up to max load as all is synced and running because it has time spare, but when you start nearing load capacity things can start rearing there head.
I know with a Pi3 with TF and TFlite that just simply switching to the 64bit image gives beween 2-3x perf than 32bit as TF & TFlite are heavily 64bit optimised think its the SIMD instructions and Neon optimistation.

Its real interesting but the 64bit perf boost might be enough to help if running standard image.
Also looks like it could be further optimised as from experience with the built in MFCC of the KWS models I have been playing with Google have FFT running in the model and seems approx 2x than Librosa.

Its interesting an will have a play but the flex deligates and MFCC embedded ops of the google-kws are beyond me.

I have been having a gander as it runs in about 20% or 70% of a core Pi3a+ so dunno.

I will have a go with EC and see PS that is the full tensorflow just hacked import tensorflow.lite as tflite as its the same just a bit bigger to load I guess maybe.
I must of compiled tflite wrong a while back as both registered as tensorflow which confused me and just use the lite method of the full lib.

So to make them work in stream mode the requirement is the block process time should be lower than the block shift time (8ms in DTLN and DTLN-aec projects).

On my Pi 3B+ (armv7l), DTLN with quantized model the process time is on avg 5 ms, which is slower than reported DTLN repo (maybe they run 64bit OS and faster?).

I also managed to run DTLN-aec on my Pi. Although the script provided is file-based but can be easily modified to stream mode. However, I measure the process time, even for the lightest model provided (128 LSTM units) it’s 22 ms on average. So maybe Pi 3B+ or even 4 cannot handle this. I don’t know how to quantize the model but seen an issue on their repo someone has tried that and failed because lack of some files. But I doubt even quantization model cannot reduce it within 8ms on Pi.

If you want to measure the process time on your device you may use my modified script: https://github.com/SaneBow/DTLN-aec

You wil get a 2-3x speed by just going 64 bit

Yeah its missing the original PB model (normal full tf model) but maybe you could go round the house.

If you have the original or model loaded its quite easy

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

If we had the pb you could load it with the above.
I can not remember how much post training quant helped it wasn’t as much as just the OS swap.
Its looking like it could be anywhere from annoying close to just faster than RT.

I bet 64 bit it prob annoyingly 10ms on a Pi3a+

In the KWS models I sent you there is tflite_stream_state_external which is the non quantised model and will give you a comparison.

Lols 10.823751102664513 Pi3A+ 64bit so close but so far

I know numpy is fast but the TF libs are heavily optimised but prob a headache

tf.signal.rfft(
    input_tensor, fft_length=None, name=None
)
tf.signal.irfft(
    input_tensor, fft_length=None, name=None
)
tf.audio.decode_wav(
    contents, desired_channels=-1, desired_samples=-1, name=None
)

Far out of my comfort zone but embedding as keras layers from the Google Research seems really fast dunno if its compiled by TF or something as seems much faster than similar TF routines in python from the models.

The TF routines in python are also fast as the MFCC calc using the tf python_ops is approx 2x faster than librosa which is supposedly quite optimised.

The NS is pretty good I concatenated a few dataset samples

https://drive.google.com/open?id=1GYZ3rnS6EbQczOEYGRpK_OHKFd0LIiF3

output

https://drive.google.com/open?id=1gs-_i6VnN4ttJdG4YTgZS_gX1JGfhvTV

Back to the initial question about Alsa is that after the EC install it should be just the default device.
So you shouldn’t need to set as default will be used.
EC sets up some Fifo’s where it creates a dummy input for input to get the audio played before output and then it subtracts that by grabbing the mic doing the alg and then creating a new sink that is the capture device.

So with NS you should just use the defaults but it will play the output and want a source not a sink.

So prob best thing is to prob

modprobe snd-aloop

The loopback will turn up as device, side, subchannel so you can use to to play to and it will then be available as a input.
you can ignore the subchannel as 1st will be used the subchannels are corresponding
hw:0,0,2 is conversely hw:0,1,2 but we can just use the 1st and not specify
So you play in hw:0,0 and its available as hw:0,1 (or the other way round as a google will tell you which is sink/source side)

So slightly confused about the extra asym device as what you want to do is chain the audio, Set up ec and just pick the input of NS as the default cap and the output to a loopback.
EC acts as a piggy in the middle and directs to the fifo files but in the above NS doesn’t do that and expects a standard cap/play sink/source.

Then just try recording from the loopback.

Then you can change the alsa defaults asym to be the EC play & loopback cap.
use the ec cap as your input in NS and your output will be the loopback play and that should work for any app using the default device.

I just reinstalled everything on a 64bit OS and it’s now ~11ms on Pi 3B+. A bit slower than your 3A+ possibly because my power cable is not that good and cpu under voltage sometimes.

Tried to use the tflite2tensorflow to quantize the model but there’s some error

Update: the author fixed the bug real quick. Now I can convert it back to .pb. But quantization still failed, see my commend in the above issue. Not sure about the reason right now.

snd-aloop looks like exactly what I need, thanks, will look into it

I don’t think you would get that much of the quantisation but are the models in DTLN the same?

Also with the AEC maybe we could run each interpreter in its own thread?

Yeah no under volt prob the standard non standard 5.1v pi PSU :slight_smile:
With a 40mm heatsink so maybe but unfortunately the wrong side of RT for both of us.
All the inference engines are heavily 64bit optimised as guess it just means they can carry more 16/8bit tensors in one go tensorflow is anyway.

For DTLN models the quantization can make it from 9.6 ms 2.2 ms (in their 3A+ evaluation), so I think maybe can achieve same level for DTLN-aec 128 model.

The models are already named quant and thinking its already done.
Actually not my memory had to check the quant is only in DTLN or they are just not named.
Weird that the other repo is empty of models and training is it that it is all in DTLN?

pcm.!default {
    type asym
    playback.pcm "eci"
    capture.pcm "dnc"
}

The asym is really for only if you have to reference playback/capture as one device.
Usually you can set separate input /output devices and only place I can think it is really nessacary is when creating a multihomed default device.

For DTLN both quantized version / tflite / full models are available. For DTLN-aec seems like they only release tflite model.

Since they have source code for DTLN and two papers. Also both NS and AEC dataset are available as they are for open contest, see AEC-Challenge. So someone familiar with machine learning may easily reimplement and train the AEC version based on DTLN, but for sure this is beyond me. Also see this issue, the author said two model structure are the same.

Yeah its a bit out of my comfort zone also but like the KWS model custom models are really interesting as overfitting could be a real advantage as likely only to work at least well with your voice and so cancelling others.

The output from its pretty amazing but also with NS I wonder what happens with specific voice training?

Maybe though it might be as simple as upping self.numUnits = 128 but that should mean the DTLN models should work with AEC but just badly.

Traceback (most recent call last):
  File "run_aec.py", line 224, in <module>
    process_folder(args.model, args.in_folder, args.out_folder)
  File "run_aec.py", line 210, in process_folder
    os.path.join(new_directories[idx], file_names[idx]),
  File "run_aec.py", line 115, in process_file
    interpreter_1.set_tensor(input_details_1[2]["index"], lpb_mag)
IndexError: list index out of range

I think ‘same structure’ might of been a lose description, but prob just the process creates a different tensor structure.

Still with the NS wondering what adding ‘own voice’ to clean does for results and starting to wish I had more than a 1050ti, but custom models prob not what many can realistically achieve.

The average time for one training epoch on a Nvidia RTX 2080 TI is around 21 minutes