ONNX Runtime Acceleration
ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Provider (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform’s compute capabilities.
ONNX Runtime Execution Providers
On Genio, the following execution providers are supported and integrated by default:
CPUExecutionProviderXnnpackExecutionProviderNeuronExecutionProvider
CPUExecutionProvider and XnnpackExecutionProvider execute models on the
CPU.
NPU Acceleration
MediaTek provides advanced AI capabilities through its state-of-the-art NPU.
ONNX models can access these advanced NPU capabilities by leveraging
NeuronExecutionProvider, MediaTek’s proprietary ONNX Runtime Execution
Provider for executing ONNX models on the NPU.
The NeuronExecutionProvider is the official ONNX Runtime execution provider
for leveraging the power of the MediaTek NPU. When this EP is used, it
intelligently deploys the ONNX model to the NPU, and any unsupported operations
automatically fall back to the CPU for seamless execution. It supports both the
Python and C++ APIs.
NeuronExecutionProvider supports both FP32 (executed as FP16) and QDQ INT8
ONNX models.
Platform Availability
Genio 720 / Genio 520: The
NeuronExecutionProvideris available by default.All other Genio products: These platforms have default access to
CPUExecutionProviderandXnnpackExecutionProvider.Genio 700 / Genio 510: To enable the
NeuronExecutionProvider, follow the steps below:Add the following to your
local.conf:ENABLE_NEURON_EP = "1"
Build your Rity image:
bitbake rity-demo-image
Genio 350: The
NeuronExecutionProvideris not available. OnlyCPUExecutionProviderandXnnpackExecutionProviderare supported on this platform.Genio 1200: The
NeuronExecutionProvidercan be added using the same steps as Genio 700/510, but not all ONNX ops may be supported on the NPU, which might impact performance.CPUExecutionProviderandXnnpackExecutionProviderare supported on this platform by default.
NPU Provider Options
The NeuronExecutionProvider has several important provider options:
NEURON_FLAG_USE_FP16: Executes FP32 models in FP16 precision. Note: This flag is mandatory, as the NPU does not support FP32 execution. Your model will fail to run without it.NEURON_FLAG_MIN_GROUP_SIZE: Sets the minimum number of nodes for a subgraph to be offloaded to the NPU. For larger models, setting this to a value greater than 1 may improve performance.NEURON_FLAG_OPTIMIZATION_STRING: Advanced optimization flags, including: ---opt=3(enabled by default) ---num-mdla=1(enabled by default) ---reshape-to-4d(enabled by default) ---interval-coloring-converage=1.0(enabled by default)
Getting Started: Python API Example
Below is a simple example of how to initialize an ONNX Runtime session with the
NeuronExecutionProvider on Genio 720/520 using Python:
import onnxruntime as ort
import numpy as np
model_path = "your_model.onnx"
neuron_provider_options = {
"NEURON_FLAG_USE_FP16": "1", # Allow FP32 to execute as FP16
"NEURON_FLAG_MIN_GROUP_SIZE": "1" # Set minimum subgraph size
}
providers = [("NeuronExecutionProvider", neuron_provider_options),
"XnnpackExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)
def generate_dummy_inputs(session):
inputs = {}
for input_meta in session.get_inputs():
shape = [dim if isinstance(dim, int) else 1 for dim in input_meta.shape]
data = np.random.randn(*shape).astype(np.float32)
inputs[input_meta.name] = data
return inputs
dummy_inputs = generate_dummy_inputs(session)
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, dummy_inputs)
print("Inference outputs:", outputs)
Getting Started: C++ API Example
Below is a simple example of how to initialize an ONNX Runtime session with
the NeuronExecutionProvider on Genio 720/520 using C++:
#include "onnxruntime/core/providers/neuron/neuron_provider_factory.h"
#include "onnxruntime/core/session/onnxruntime_cxx_api.h"
const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example");
Ort::SessionOptions session_options;
session_options.SetGraphOptimizationLevel(ORT_ENABLE_BASIC);
std::vector<const char*> options_keys = {"NEURON_FLAG_USE_FP16", "NEURON_FLAG_MIN_GROUP_SIZE"};
std::vector<const char*> options_values = {"1", "1"};
if (g_ort->SessionOptionsAppendExecutionProvider(
session_options,
"Neuron",
options_keys.data(),
options_values.data(),
options_keys.size()) != 0) {
std::cout << "Failed to append NeuronExecutionProvider." << std::endl;
return 1;
}
Ort::Session session(env, "model.onnx", session_options);
// Prepare input and output names/tensors as required by your model
// Example:
// std::vector<const char*> input_names = { "input" };
// std::vector<Ort::Value> input_tensors = { /* your input tensor(s) */ };
// std::vector<const char*> output_names = { "output" };
// Run inference
// auto output_tensors = session.Run(Ort::RunOptions{nullptr},
// input_names.data(), input_tensors.data(), input_names.size(),
// output_names.data(), output_names.size());
Note
The NPU does not support dynamic op shapes, so it is imperative to ensure that all dynamic shapes in your ONNX model have been made static.
ONNX Runtime Benchmarking
Python benchmarking scripts are provided to evaluate ONNX model inference performance on different hardware and execution providers.
Benchmarking Script Overview
onnxruntime_benchmark.pyis the Python-based benchmarking script for ONNX models on CPU, Neuron, and XNNPACK Execution Providers.The script benchmarks models on selected execution providers and provides a performance comparison table when executed in auto mode.
Usage
To see all available options:
cd /usr/share/onnxruntime_benchmark
python3 onnxruntime_benchmark.py -h
usage: onnxruntime_benchmark.py [-h] [--model MODEL] [--auto]
[--directory DIRECTORY] [--num_runs NUM_RUNS]
[--num_threads NUM_THREADS]
[--thread_mode {parallel,sequential}]
[--profile]
[--optimize {DISABLE_ALL,ENABLE_BASIC,ENABLE_EXTENDED,ENABLE_ALL}]
[--execution_provider EXECUTION_PROVIDER]
[--neuron_flag_use_fp16 NEURON_FLAG_USE_FP16]
[--neuron_flag_min_group_size NEURON_FLAG_MIN_GROUP_SIZE]
[--neuron_opt] [--neuron_num_mdla]
[--neuron_reshape_to_4d]
[--neuron_interval_coloring]
options:
-h, --help show this help message and exit
--model MODEL, -m MODEL
Path to ONNX model
--auto Auto-benchmark all ONNX models in directory
--directory DIRECTORY, -d DIRECTORY
Directory to search for ONNX models (used with --auto)
--num_runs NUM_RUNS, -n NUM_RUNS
Number of inference runs for benchmarking
--num_threads NUM_THREADS
Number of threads for parallel/sequential execution
--thread_mode {parallel,sequential}
Threading mode: 'parallel' sets intra_op_num_threads,
'sequential' sets inter_op_num_threads
--profile Enable ONNX Runtime profiling
--optimize {DISABLE_ALL,ENABLE_BASIC,ENABLE_EXTENDED,ENABLE_ALL}
Set ONNX Runtime graph optimization level
--execution_provider EXECUTION_PROVIDER
ONNX Runtime execution provider (e.g.,
CPUExecutionProvider, XnnpackExecutionProvider,
NeuronExecutionProvider)
--neuron_flag_use_fp16 NEURON_FLAG_USE_FP16
Set NEURON_FLAG_USE_FP16 for NeuronExecutionProvider
--neuron_flag_min_group_size NEURON_FLAG_MIN_GROUP_SIZE
Set NEURON_FLAG_MIN_GROUP_SIZE for NeuronExecutionProvider
--neuron_opt Enable NEURON_FLAG_OPTIMIZATION_STRING|--opt=3 (default: enabled)
--neuron_num_mdla Enable NEURON_FLAG_OPTIMIZATION_STRING|--num-mdla=1 (default: enabled)
--neuron_reshape_to_4d
Enable NEURON_FLAG_OPTIMIZATION_STRING|--reshape-to-4d (default: enabled)
--neuron_interval_coloring
Enable NEURON_FLAG_OPTIMIZATION_STRING|--interval-coloring-converage=1.0 (default: enabled)
Sample Execution
Per Model Benchmarking
CPU Execution Provider:
python3 onnxruntime_benchmark.py --model <model_name.onnx> --num_runs 10 --num_threads 1 --execution_provider CPUExecutionProvider
Xnnpack Execution Provider:
python3 onnxruntime_benchmark.py --model <model_name.onnx> --num_runs 10 --num_threads 1 --execution_provider XnnpackExecutionProvider
Neuron Execution Provider:
python3 onnxruntime_benchmark.py --model <model_name.onnx> --num_runs 10 --num_threads 1 --execution_provider NeuronExecutionProvider --neuron_flag_use_fp16 1 --neuron_flag_min_group_size 1
The script will print model input details, run times for each inference, and a summary including average, min, and max inference times.
Auto Mode Benchmarking
To benchmark all ONNX models in a directory on both CPU and Neuron EP:
cd /usr/share/onnxruntime_benchmark python3 onnxruntime_benchmark.py --auto
The script will find all ONNX models in the specified directory, benchmark each on both CPU and Neuron EP, and generate a summary table:
Summary (Neuron vs CPU):
Model CPU (ms) Neuron (ms) Speedup
----------------------------------------------------------------------
resnet18_quant.onnx 45.19 1.78 25.35x
mobilenet_v2_quant.onnx 12.91 1.27 10.18x
squeezenet_quant.onnx 32.66 8.40 3.89x
Notes
You can specify provider options such as
NEURON_FLAG_USE_FP16andNEURON_FLAG_MIN_GROUP_SIZEfor NeuronExecutionProvider.Use the
--profileoption to enable ONNX Runtime profiling for advanced analysis.For best results, run multiple inference passes (
--num_runs) and adjust threading options as needed.XNNPACK EP can be tested by replacing
CPUExecutionProviderwithXnnpackExecutionProviderin the command.