ArmNN GpuAcc & CpuAcc tflite Wav2Letter example

rolyan_trauts · November 4, 2022, 4:18am

I finally got ArmNN working on a SBC with a Mail GPU ArmNN is optimised for both Arm Cpu’s & Gpu’s.
The tutorial is here

https://developer.arm.com/documentation/102603/2108/Device-specific-installation/Install-on-Odroid-N2-Plus

Its a fairly easy install the model for ASR is pretty bad but its purely to test optimisation and load.
On a RockPi5 RK3588 results are…

rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ python3 run_audio_file.py --audio_file_path samples/hp0.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends CpuAcc

Inference End: Avg CPU%=44.22205882352939
Runtime=0:00:05.506307
Realtime=x49.63404910042248

rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ Realtime=x49.63404910042248rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ python3 run_audio_file.py --audio_file_path samples/hp0.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends GpuAcc

Inference End: Avg CPU%=6.852573529411753
Runtime=0:00:06.292449
Realtime=x43.43305952896877

As you will see you just switch between CpuAcc or GpuAcc for the --preferred_backends with a Pi with no Mali its CpuAcc only CpuAcc means its been heavily Neon optimised.

Dunno what it is with software side of Arm as there example has prob one of the most load heavy MFCC audio preprocessing I have ever seen and it makes evaluation near impossible and the majority of the load isn’t ArmNN but preprocessing audio.
I have hacked the code so it preprocesses all the audio 1st then loads that into the model so we are only looking at model performance not MFCC code.

# Copyright © 2021 Arm Ltd and Contributors. All rights reserved.
# SPDX-License-Identifier: MIT

"""Automatic speech recognition with PyArmNN demo for processing audio clips to text."""

import sys
import os
import numpy as np
import psutil
import soundfile as sf
script_dir = os.path.dirname(__file__)
sys.path.insert(1, os.path.join(script_dir, '..', 'common'))

from argparse import ArgumentParser
from network_executor import ArmnnNetworkExecutor
from utils import prepare_input_data
from audio_capture import AudioCaptureParams, capture_audio
from audio_utils import decode_text, display_text
from wav2letter_mfcc import Wav2LetterMFCC, W2LAudioPreprocessor
from mfcc import MFCCParams
from datetime import datetime, timedelta

# Model Specific Labels
labels = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k', 11: 'l', 12: 'm',
          13: 'n',
          14: 'o', 15: 'p', 16: 'q', 17: 'r', 18: 's', 19: 't', 20: 'u', 21: 'v', 22: 'w', 23: 'x', 24: 'y',
          25: 'z',
          26: "'", 27: ' ', 28: '$'}


def time_float(result):
    seconds = int(result)
    microseconds = int((result * 1000000) % 1000000)
    output = timedelta(0, seconds, microseconds)
    return output

def parse_args():
    parser = ArgumentParser(description="ASR with PyArmNN")
    parser.add_argument(
        "--audio_file_path",
        required=True,
        type=str,
        help="Path to the audio file to perform ASR",
    )
    parser.add_argument(
        "--model_file_path",
        required=True,
        type=str,
        help="Path to ASR model to use",
    )
    parser.add_argument(
        "--preferred_backends",
        type=str,
        nargs="+",
        default=["GpuAcc", "CpuAcc", "CpuRef"],
        help="""List of backends in order of preference for optimizing
        subgraphs, falling back to the next backend in the list on unsupported
        layers. Defaults to [GpuAcc, CpuAcc, CpuRef]""",
    )
    return parser.parse_args()


def main(args, network, input_data):

    current_r_context = ""
    is_first_window = True
    avg_cpu = 0.0
    for input_chunk in input_data:
        # Run inference
        output_result = network.run([input_chunk])

        # Slice and Decode the text, and store the right context
        current_r_context, text = decode_text(is_first_window, labels, output_result)

        is_first_window = False

        display_text(text)
        runtime = datetime.now() - starttime
        print(" " + str(runtime))
        avg_cpu = avg_cpu + psutil.cpu_percent()

    print(current_r_context, flush=True)
    print("Inference End: Avg CPU%=" + str(avg_cpu / len(input_data)))
    return runtime

if __name__ == "__main__":
    args = parse_args()
    # Create the ArmNN inference runner
    network = ArmnnNetworkExecutor(args.model_file_path, args.preferred_backends)
    # Read command line args
    audio_file = args.audio_file_path
    sf_data, samplerate = sf.read(audio_file)
    sf_secs = time_float((len(sf_data) / samplerate))
    # Specify model specific audio data requirements
    audio_capture_params = AudioCaptureParams(dtype=np.float32, overlap=31712, min_samples=47712, sampling_freq=16000,
                                              mono=True)

    buffer = capture_audio(audio_file, audio_capture_params)
    # Extract features and create the preprocessor

    mfcc_params = MFCCParams(sampling_freq=16000, num_fbank_bins=128, mel_lo_freq=0, mel_hi_freq=8000,
                             num_mfcc_feats=13, frame_len=512, use_htk_method=False, n_fft=512)

    wmfcc = Wav2LetterMFCC(mfcc_params)
    preprocessor = W2LAudioPreprocessor(wmfcc, model_input_size=296, stride=160)   
    print("Processing Audio Frames...")
    input_data = []

    for audio_data in buffer:
        # Prepare the input Tensors
        input_data.append(prepare_input_data(audio_data, network.get_data_type(), network.get_input_quantization_scale(0),
                                        network.get_input_quantization_offset(0), preprocessor))
                                        
        
  
    
    starttime = datetime.now()
    runtime = main(args, network, input_data)
    print("Runtime=" + str(runtime))
    print("Realtime=x" + str(sf_secs / runtime))
    starttime = datetime.now()
    runtime = main(args, network, input_data)
    print("Runtime=" + str(runtime))
    print("Realtime=x" + str(sf_secs / runtime))

Both the model and the manner it works isn’t good but this is a perf evaluation which cpu is x50 realtime / gpu x45
There are no Mesa drivers for the Mali G610 and its using a Rockchip blob that I think is underperforming slightly with load of about 70% its a new gen3 vallhall so finger crossed like others it will be added to Mesa.
There are quite a few boards out there now with quite decent multi core Mali GPU’s that as you will see do the heavy lifting and leave the cpu almost load free, that Mesa does have drivers for.
I should really try another board with a Mesa driver as have a hunch this one is a tad slow and hence also a bit more CPU load.