ONNX Runtime Acceleration

ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Provider (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform’s compute capabilities.

ONNX Runtime Execution Providers

On Genio, the following execution providers are supported and integrated by default:

CPUExecutionProvider
XnnpackExecutionProvider
NeuronExecutionProvider

CPUExecutionProvider and XnnpackExecutionProvider execute models on the CPU.

NPU Acceleration

MediaTek provides advanced AI capabilities through its state-of-the-art NPU. ONNX models can access these advanced NPU capabilities by leveraging NeuronExecutionProvider, MediaTek’s proprietary ONNX Runtime Execution Provider for executing ONNX models on the NPU.

The NeuronExecutionProvider is the official ONNX Runtime execution provider for leveraging the power of the MediaTek NPU. When this EP is used, it intelligently deploys the ONNX model to the NPU, and any unsupported operations automatically fall back to the CPU for seamless execution. It supports both the Python and C++ APIs.

NeuronExecutionProvider supports both FP32 (executed as FP16) and QDQ INT8 ONNX models.

Platform Availability

Genio 720 / Genio 520: The NeuronExecutionProvider is available by default.
All other Genio products: These platforms have default access to CPUExecutionProvider and XnnpackExecutionProvider.
Genio 700 / Genio 510: To enable the NeuronExecutionProvider, follow the steps below:
1. Add the following to your local.conf:
```
ENABLE_NEURON_EP = "1"
```
2. Build your Rity image:
```
bitbake rity-demo-image
```
Genio 350: The NeuronExecutionProvider is not available. Only CPUExecutionProvider and XnnpackExecutionProvider are supported on this platform.
Genio 1200: The NeuronExecutionProvider can be added using the same steps as Genio 700/510, but not all ONNX ops may be supported on the NPU, which might impact performance. CPUExecutionProvider and XnnpackExecutionProvider are supported on this platform by default.

NPU Provider Options

The NeuronExecutionProvider has several important provider options:

NEURON_FLAG_USE_FP16: Executes FP32 models in FP16 precision. Note: This flag is mandatory, as the NPU does not support FP32 execution. Your model will fail to run without it.
NEURON_FLAG_MIN_GROUP_SIZE: Sets the minimum number of nodes for a subgraph to be offloaded to the NPU. For larger models, setting this to a value greater than 1 may improve performance.
NEURON_FLAG_OPTIMIZATION_STRING: Advanced optimization flags, including: - --opt=3 (enabled by default) - --num-mdla=1 (enabled by default) - --reshape-to-4d (enabled by default) - --interval-coloring-converage=1.0 (enabled by default)

Getting Started: Python API Example

Below is a simple example of how to initialize an ONNX Runtime session with the NeuronExecutionProvider on Genio 720/520 using Python:

import onnxruntime as ort
import numpy as np

model_path = "your_model.onnx"

neuron_provider_options = {
    "NEURON_FLAG_USE_FP16": "1",           # Allow FP32 to execute as FP16
    "NEURON_FLAG_MIN_GROUP_SIZE": "1"      # Set minimum subgraph size
}

providers = [("NeuronExecutionProvider", neuron_provider_options),
             "XnnpackExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)

def generate_dummy_inputs(session):
    inputs = {}
    for input_meta in session.get_inputs():
        shape = [dim if isinstance(dim, int) else 1 for dim in input_meta.shape]
        data = np.random.randn(*shape).astype(np.float32)
        inputs[input_meta.name] = data
    return inputs

dummy_inputs = generate_dummy_inputs(session)
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, dummy_inputs)

print("Inference outputs:", outputs)

Getting Started: C++ API Example

Below is a simple example of how to initialize an ONNX Runtime session with the NeuronExecutionProvider on Genio 720/520 using C++:

#include "onnxruntime/core/providers/neuron/neuron_provider_factory.h"
#include "onnxruntime/core/session/onnxruntime_cxx_api.h"

const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example");
Ort::SessionOptions session_options;
session_options.SetGraphOptimizationLevel(ORT_ENABLE_BASIC);

std::vector<const char*> options_keys = {"NEURON_FLAG_USE_FP16", "NEURON_FLAG_MIN_GROUP_SIZE"};
std::vector<const char*> options_values = {"1", "1"};

if (g_ort->SessionOptionsAppendExecutionProvider(
        session_options,
        "Neuron",
        options_keys.data(),
        options_values.data(),
        options_keys.size()) != 0) {
    std::cout << "Failed to append NeuronExecutionProvider." << std::endl;
    return 1;
}

Ort::Session session(env, "model.onnx", session_options);

// Prepare input and output names/tensors as required by your model
// Example:
// std::vector<const char*> input_names = { "input" };
// std::vector<Ort::Value> input_tensors = { /* your input tensor(s) */ };
// std::vector<const char*> output_names = { "output" };

// Run inference
// auto output_tensors = session.Run(Ort::RunOptions{nullptr},
//                                   input_names.data(), input_tensors.data(), input_names.size(),
//                                   output_names.data(), output_names.size());

Note

The NPU does not support dynamic op shapes, so it is imperative to ensure that all dynamic shapes in your ONNX model have been made static.

ONNX Runtime Benchmarking

Python benchmarking scripts are provided to evaluate ONNX model inference performance on different hardware and execution providers.

Benchmarking Script Overview

onnxruntime_benchmark.py is the Python-based benchmarking script for ONNX models on CPU, Neuron, and XNNPACK Execution Providers.
The script benchmarks models on selected execution providers and provides a performance comparison table when executed in auto mode.

Usage

To see all available options:

cd /usr/share/onnxruntime_benchmark

python3 onnxruntime_benchmark.py -h

usage: onnxruntime_benchmark.py [-h] [--model MODEL] [--auto]
                                [--directory DIRECTORY] [--num_runs NUM_RUNS]
                                [--num_threads NUM_THREADS]
                                [--thread_mode {parallel,sequential}]
                                [--profile]
                                [--optimize {DISABLE_ALL,ENABLE_BASIC,ENABLE_EXTENDED,ENABLE_ALL}]
                                [--execution_provider EXECUTION_PROVIDER]
                                [--neuron_flag_use_fp16 NEURON_FLAG_USE_FP16]
                                [--neuron_flag_min_group_size NEURON_FLAG_MIN_GROUP_SIZE]
                                [--neuron_opt] [--neuron_num_mdla]
                                [--neuron_reshape_to_4d]
                                [--neuron_interval_coloring]

options:
  -h, --help            show this help message and exit
  --model MODEL, -m MODEL
                        Path to ONNX model
  --auto                Auto-benchmark all ONNX models in directory
  --directory DIRECTORY, -d DIRECTORY
                        Directory to search for ONNX models (used with --auto)
  --num_runs NUM_RUNS, -n NUM_RUNS
                        Number of inference runs for benchmarking
  --num_threads NUM_THREADS
                        Number of threads for parallel/sequential execution
  --thread_mode {parallel,sequential}
                        Threading mode: 'parallel' sets intra_op_num_threads,
                        'sequential' sets inter_op_num_threads
  --profile             Enable ONNX Runtime profiling
  --optimize {DISABLE_ALL,ENABLE_BASIC,ENABLE_EXTENDED,ENABLE_ALL}
                        Set ONNX Runtime graph optimization level
  --execution_provider EXECUTION_PROVIDER
                        ONNX Runtime execution provider (e.g.,
                        CPUExecutionProvider, XnnpackExecutionProvider,
                        NeuronExecutionProvider)
  --neuron_flag_use_fp16 NEURON_FLAG_USE_FP16
                        Set NEURON_FLAG_USE_FP16 for NeuronExecutionProvider
  --neuron_flag_min_group_size NEURON_FLAG_MIN_GROUP_SIZE
                        Set NEURON_FLAG_MIN_GROUP_SIZE for NeuronExecutionProvider
  --neuron_opt          Enable NEURON_FLAG_OPTIMIZATION_STRING|--opt=3 (default: enabled)
  --neuron_num_mdla     Enable NEURON_FLAG_OPTIMIZATION_STRING|--num-mdla=1 (default: enabled)
  --neuron_reshape_to_4d
                        Enable NEURON_FLAG_OPTIMIZATION_STRING|--reshape-to-4d (default: enabled)
  --neuron_interval_coloring
                        Enable NEURON_FLAG_OPTIMIZATION_STRING|--interval-coloring-converage=1.0 (default: enabled)

Sample Execution

Per Model Benchmarking

CPU Execution Provider:

python3 onnxruntime_benchmark.py --model <model_name.onnx> --num_runs 10 --num_threads 1 --execution_provider CPUExecutionProvider

Xnnpack Execution Provider:

python3 onnxruntime_benchmark.py --model <model_name.onnx> --num_runs 10 --num_threads 1 --execution_provider XnnpackExecutionProvider

Neuron Execution Provider:

python3 onnxruntime_benchmark.py --model <model_name.onnx> --num_runs 10 --num_threads 1 --execution_provider NeuronExecutionProvider --neuron_flag_use_fp16 1 --neuron_flag_min_group_size 1

The script will print model input details, run times for each inference, and a summary including average, min, and max inference times.

Auto Mode Benchmarking

To benchmark all ONNX models in a directory on both CPU and Neuron EP:

cd /usr/share/onnxruntime_benchmark

python3 onnxruntime_benchmark.py --auto

The script will find all ONNX models in the specified directory, benchmark each on both CPU and Neuron EP, and generate a summary table:

Summary (Neuron vs CPU):
Model                              CPU (ms)    Neuron (ms)    Speedup
----------------------------------------------------------------------
resnet18_quant.onnx                   45.19           1.78     25.35x
mobilenet_v2_quant.onnx               12.91           1.27     10.18x
squeezenet_quant.onnx                 32.66           8.40      3.89x

Notes

You can specify provider options such as NEURON_FLAG_USE_FP16 and NEURON_FLAG_MIN_GROUP_SIZE for NeuronExecutionProvider.
Use the --profile option to enable ONNX Runtime profiling for advanced analysis.
For best results, run multiple inference passes (--num_runs) and adjust threading options as needed.
XNNPACK EP can be tested by replacing CPUExecutionProvider with XnnpackExecutionProvider in the command.