ONNX Runtime Acceleration

ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Provider (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform’s compute capabilities.

ONNX Runtime Execution Providers

On Genio, the following execution providers are supported and integrated by default:

  • CPUExecutionProvider

  • XnnpackExecutionProvider

  • NeuronExecutionProvider

CPUExecutionProvider and XnnpackExecutionProvider execute models on the CPU.

NPU Acceleration

MediaTek provides advanced AI capabilities through its state-of-the-art NPU. ONNX models can access these advanced NPU capabilities by leveraging NeuronExecutionProvider, MediaTek’s proprietary ONNX Runtime Execution Provider for executing ONNX models on the NPU.

The NeuronExecutionProvider is the official ONNX Runtime execution provider for leveraging the power of the MediaTek NPU. When this EP is used, it intelligently deploys the ONNX model to the NPU, and any unsupported operations automatically fall back to the CPU for seamless execution. It supports both the Python and C++ APIs.

NeuronExecutionProvider supports both FP32 (executed as FP16) and QDQ INT8 ONNX models.

Platform Availability

  • Genio 720 / Genio 520: The NeuronExecutionProvider is available by default.

  • All other Genio products: These platforms have default access to CPUExecutionProvider and XnnpackExecutionProvider.

  • Genio 700 / Genio 510: To enable the NeuronExecutionProvider, follow the steps below:

    1. Add the following to your local.conf:

      ENABLE_NEURON_EP = "1"
      
    2. Build your Rity image:

      bitbake rity-demo-image
      
  • Genio 350: The NeuronExecutionProvider is not available. Only CPUExecutionProvider and XnnpackExecutionProvider are supported on this platform.

  • Genio 1200: The NeuronExecutionProvider can be added using the same steps as Genio 700/510, but not all ONNX ops may be supported on the NPU, which might impact performance. CPUExecutionProvider and XnnpackExecutionProvider are supported on this platform by default.

NPU Provider Options

The NeuronExecutionProvider has two important provider options:

  • NEURON_FLAG_USE_FP16: Executes FP32 models in FP16 precision. Note: This flag is mandatory, as the NPU does not support FP32 execution. Your model will fail to run without it.

  • NEURON_FLAG_MIN_GROUP_SIZE: Sets the minimum number of nodes for a subgraph to be offloaded to the NPU. For larger models, setting this to a value greater than 1 may improve performance.

Getting Started: Python API Example

Below is a simple example of how to initialize an ONNX Runtime session with the NeuronExecutionProvider on Genio 720/520 using Python:

# Dummy Python Example Script for inferencing FP32 ONNX model

import onnxruntime as ort
import numpy as np

# Path to your ONNX model
model_path = "your_model.onnx"

# Neuron provider options (set as needed)
neuron_provider_options = {
    "NEURON_FLAG_USE_FP16": "1",           # Allow FP32 to execute as FP16
    "NEURON_FLAG_MIN_GROUP_SIZE": "1"      # Set minimum subgraph size
}

# Create session with NeuronExecutionProvider and options
providers = [("NeuronExecutionProvider", neuron_provider_options),
             "XnnpackExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)

# Generate dummy input data based on model input shape and type
def generate_dummy_inputs(session):
    inputs = {}
    for input_meta in session.get_inputs():
        shape = [dim if isinstance(dim, int) else 1 for dim in input_meta.shape]
        data = np.random.randn(*shape).astype(np.float32)
        inputs[input_meta.name] = data
    return inputs

# Prepare inputs and run inference
dummy_inputs = generate_dummy_inputs(session)
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, dummy_inputs)

print("Inference outputs:", outputs)

Getting Started: C++ API Example

Below is a simple example of how to initialize an ONNX Runtime session with the NeuronExecutionProvider on Genio 720/520 using C++:

#include "onnxruntime/core/providers/neuron/neuron_provider_factory.h"
#include "onnxruntime/core/session/onnxruntime_cxx_api.h"

// Initialize ONNX Runtime environment and session options
const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example");
Ort::SessionOptions session_options;
session_options.SetGraphOptimizationLevel(ORT_ENABLE_BASIC);

// Set NeuronExecutionProvider options
std::vector<const char*> options_keys = {"NEURON_FLAG_USE_FP16", "NEURON_FLAG_MIN_GROUP_SIZE"};
std::vector<const char*> options_values = {"1", "1"}; // Use FP16, set min group size to 1

// Append NeuronExecutionProvider
if (g_ort->SessionOptionsAppendExecutionProvider(
        session_options,
        "Neuron",
        options_keys.data(),
        options_values.data(),
        options_keys.size()) != 0) {
    std::cout << "Failed to append NeuronExecutionProvider." << std::endl;
    return 1;
}

// Create ONNX Runtime session
Ort::Session session(env, "model.onnx", session_options);

// Prepare input and output names/tensors as required by your model
// Example:
// std::vector<const char*> input_names = { "input" };
// std::vector<Ort::Value> input_tensors = { /* your input tensor(s) */ };
// std::vector<const char*> output_names = { "output" };

// Run inference
// auto output_tensors = session.Run(Ort::RunOptions{nullptr},
//                                   input_names.data(), input_tensors.data(), input_names.size(),
//                                   output_names.data(), output_names.size());

Note

The NPU does not support dynamic op shapes, so it is imperative to ensure that all dynamic shapes in your ONNX model have been made static.