ONNX Runtime Acceleration
ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Provider (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform’s compute capabilities.
ONNX Runtime Execution Providers
On Genio, the following execution providers are supported and integrated by default:
CPUExecutionProviderXnnpackExecutionProviderNeuronExecutionProvider
CPUExecutionProvider and XnnpackExecutionProvider execute models on the CPU.
NPU Acceleration
MediaTek provides advanced AI capabilities through its state-of-the-art NPU.
ONNX models can access these advanced NPU capabilities by leveraging
NeuronExecutionProvider, MediaTek’s proprietary ONNX Runtime Execution Provider for executing ONNX models on the NPU.
The NeuronExecutionProvider is the official ONNX Runtime execution provider for leveraging the power of the MediaTek NPU. When this EP is used, it intelligently deploys the ONNX model to the NPU, and any unsupported operations automatically fall back to the CPU for seamless execution. It supports both the Python and C++ APIs.
NeuronExecutionProvider supports both FP32 (executed as FP16) and QDQ INT8 ONNX models.
Platform Availability
Genio 720 / Genio 520: The
NeuronExecutionProvideris available by default.All other Genio products: These platforms have default access to
CPUExecutionProviderandXnnpackExecutionProvider.Genio 700 / Genio 510: To enable the
NeuronExecutionProvider, follow the steps below:Add the following to your
local.conf:ENABLE_NEURON_EP = "1"
Build your Rity image:
bitbake rity-demo-image
Genio 350: The
NeuronExecutionProvideris not available. OnlyCPUExecutionProviderandXnnpackExecutionProviderare supported on this platform.Genio 1200: The
NeuronExecutionProvidercan be added using the same steps as Genio 700/510, but not all ONNX ops may be supported on the NPU, which might impact performance.CPUExecutionProviderandXnnpackExecutionProviderare supported on this platform by default.
NPU Provider Options
The NeuronExecutionProvider has two important provider options:
NEURON_FLAG_USE_FP16: Executes FP32 models in FP16 precision. Note: This flag is mandatory, as the NPU does not support FP32 execution. Your model will fail to run without it.NEURON_FLAG_MIN_GROUP_SIZE: Sets the minimum number of nodes for a subgraph to be offloaded to the NPU. For larger models, setting this to a value greater than 1 may improve performance.
Getting Started: Python API Example
Below is a simple example of how to initialize an ONNX Runtime session with the
NeuronExecutionProvider on Genio 720/520 using Python:
# Dummy Python Example Script for inferencing FP32 ONNX model
import onnxruntime as ort
import numpy as np
# Path to your ONNX model
model_path = "your_model.onnx"
# Neuron provider options (set as needed)
neuron_provider_options = {
"NEURON_FLAG_USE_FP16": "1", # Allow FP32 to execute as FP16
"NEURON_FLAG_MIN_GROUP_SIZE": "1" # Set minimum subgraph size
}
# Create session with NeuronExecutionProvider and options
providers = [("NeuronExecutionProvider", neuron_provider_options),
"XnnpackExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)
# Generate dummy input data based on model input shape and type
def generate_dummy_inputs(session):
inputs = {}
for input_meta in session.get_inputs():
shape = [dim if isinstance(dim, int) else 1 for dim in input_meta.shape]
data = np.random.randn(*shape).astype(np.float32)
inputs[input_meta.name] = data
return inputs
# Prepare inputs and run inference
dummy_inputs = generate_dummy_inputs(session)
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, dummy_inputs)
print("Inference outputs:", outputs)
Getting Started: C++ API Example
Below is a simple example of how to initialize an ONNX Runtime session with
the NeuronExecutionProvider on Genio 720/520 using C++:
#include "onnxruntime/core/providers/neuron/neuron_provider_factory.h"
#include "onnxruntime/core/session/onnxruntime_cxx_api.h"
// Initialize ONNX Runtime environment and session options
const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example");
Ort::SessionOptions session_options;
session_options.SetGraphOptimizationLevel(ORT_ENABLE_BASIC);
// Set NeuronExecutionProvider options
std::vector<const char*> options_keys = {"NEURON_FLAG_USE_FP16", "NEURON_FLAG_MIN_GROUP_SIZE"};
std::vector<const char*> options_values = {"1", "1"}; // Use FP16, set min group size to 1
// Append NeuronExecutionProvider
if (g_ort->SessionOptionsAppendExecutionProvider(
session_options,
"Neuron",
options_keys.data(),
options_values.data(),
options_keys.size()) != 0) {
std::cout << "Failed to append NeuronExecutionProvider." << std::endl;
return 1;
}
// Create ONNX Runtime session
Ort::Session session(env, "model.onnx", session_options);
// Prepare input and output names/tensors as required by your model
// Example:
// std::vector<const char*> input_names = { "input" };
// std::vector<Ort::Value> input_tensors = { /* your input tensor(s) */ };
// std::vector<const char*> output_names = { "output" };
// Run inference
// auto output_tensors = session.Run(Ort::RunOptions{nullptr},
// input_names.data(), input_tensors.data(), input_names.size(),
// output_names.data(), output_names.size());
Note
The NPU does not support dynamic op shapes, so it is imperative to ensure that all dynamic shapes in your ONNX model have been made static.