ONNX Runtime

Introduction

ONNX Runtime is a high-performance, cross-platform engine for scoring and training machine learning models in the ONNX (Open Neural Network Exchange) format. It accelerates inference and training for models from popular frameworks such as PyTorch, TensorFlow/Keras, and classical libraries like scikit-learn, LightGBM, and XGBoost. ONNX Runtime is compatible with various hardware, drivers, and operating systems, leveraging hardware accelerators and advanced graph optimizations for optimal performance.

ONNX Runtime on Genio

ONNX models can be executed efficiently on Genio platforms using ONNX Runtime. We currently support ONNX Runtime v1.20.2 on both Kirkstone and Scarthgap Yocto versions. The ONNX Runtime recipes are included in the meta-mediatek-experimental layer.

To build ONNX Runtime packages as part of your Rity build, follow these steps:

Initialize the repo:

repo init -u https://gitlab.com/mediatek/aiot/bsp/manifest.git -b rity/kirkstone -m meta-mediatek-experimental.xml
repo init -u https://gitlab.com/mediatek/aiot/bsp/manifest.git -b rity/scarthgap -m meta-mediatek-experimental.xml

Sync the repo:
```
repo sync -j 48
```
Add the meta-mediatek-experimental layer to your bblayers.conf.

Add the following to your local.conf:

" onnxruntime onnxruntime-examples onnxruntime-staticdev "

If using Kirkstone, ensure you add:
```
PREFERRED_VERSION_cmake = "3.26.4"
```
Build your Rity image as usual:
```
bitbake rity-demo-image
```

Executing ONNX Models on Genio

With ONNX Runtime support on Genio, executing ONNX models is straightforward. Below is a sample Python script to benchmark your ONNX models on CPU:

import onnxruntime as ort
import numpy as np
import time

def load_model(model_path):
    """Load the ONNX model."""
    session_options = ort.SessionOptions()
    session_options.intra_op_num_threads = 4
    session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
    session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    execution_providers = ['XnnpackExecutionProvider', 'CPUExecutionProvider']
    session = ort.InferenceSession(model_path, sess_options=session_options, providers=execution_providers)
    return session

def generate_sample_input(session):
    """Generate a sample input based on the model's input shape."""
    input_name = session.get_inputs()[0].name
    input_shape = session.get_inputs()[0].shape
    sample_input = np.random.random(input_shape).astype(np.float32)
    return {input_name: sample_input}

def run_inference(session, input_data):
    """Run inference and measure time taken."""
    start_time = time.time()
    outputs = session.run(None, input_data)
    end_time = time.time()
    return outputs, end_time - start_time

def benchmark_model(model_path, num_iterations=100):
    """Benchmark the ONNX model."""
    session = load_model(model_path)
    input_data = generate_sample_input(session)
    total_time = 0.0
    for _ in range(num_iterations):
        _, inference_time = run_inference(session, input_data)
        total_time += inference_time
    average_time = total_time / num_iterations
    print(f"Average inference time over {num_iterations} iterations: {average_time:.6f} seconds")

if __name__ == "__main__":
    model_path = "path/to/your/model.onnx"
    benchmark_model(model_path)

Examples

Below is an example of running the onnxruntime_test Python script using the EfficientNet-Lite4 ONNX model for image classification on Genio 700:

root@genio-700-evk:~/onnxruntime_example# python3 onnxruntime_test.py -i kitten.jfif -l labels_map.txt -m /home/root/efficientnet-lite4/efficientnet-lite4.onnx
88291174  281: 'tabby, tabby cat',
093538865  285: 'Egyptian cat',
023412632  282: 'tiger cat',
1124982e-05  539: 'doormat, welcome mat',
9749632e-05  287: 'lynx, catamount',
time: 103.599ms

This example runs on 1 CPU thread using CPUExecutionProvider.

ONNX Runtime Execution Providers

ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Providers (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform’s compute capabilities.

On Genio, the following execution providers are supported and integrated by default:

CPUExecutionProvider
XnnpackExecutionProvider

Both providers execute on CPU.

Quantization Methods

Quantization in ONNX Runtime refers to 8-bit linear quantization of ONNX models. For more information, see the official documentation: ONNX Runtime Quantization

Dynamic Quantization

Dynamic quantization calculates quantization parameters (scale and zero point) for activations dynamically during inference, typically resulting in higher accuracy at the cost of increased inference time.

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = 'path/to/the/model.onnx'
model_quant = 'path/to/the/model.quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant)

Static Quantization

Static quantization uses calibration data to determine quantization parameters, which are then stored in the quantized model. There are two static quantization formats:

QOperator (Operator-oriented): Quantized operators have their own ONNX definitions (e.g., QLinearConv, MatMulInteger).
QDQ (Tensor-oriented): Inserts QuantizeLinear and DeQuantizeLinear operators between original operators to simulate quantization and dequantization.

import onnx
import onnxruntime
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat
import numpy as np
import argparse

class DummyCalibrationDataReader(CalibrationDataReader):
    def __init__(self, input_name, input_shape, num_samples=100):
        self.input_name = input_name
        self.data = [{self.input_name: np.random.rand(*input_shape).astype(np.float32)} for _ in range(num_samples)]
        self.iterator = iter(self.data)

    def get_next(self):
        return next(self.iterator, None)

def get_input_details(model_path):
    model = onnx.load(model_path)
    graph = model.graph
    input_tensor = graph.input[0]
    input_name = input_tensor.name
    input_shape = [dim.dim_value for dim in input_tensor.type.tensor_type.shape.dim]
    return input_name, input_shape

def quantize_model(fp32_model_path, int8_model_path):
    input_name, input_shape = get_input_details(fp32_model_path)
    calibration_data_reader = DummyCalibrationDataReader(input_name, input_shape)
    quantize_static(fp32_model_path, int8_model_path, calibration_data_reader, quant_format=QuantFormat.QOperator, weight_type=QuantType.QInt8)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Quantize FP32 ONNX model to INT8 using dummy inputs")
    parser.add_argument("fp32_model_path", type=str, help="Path to the FP32 ONNX model")
    parser.add_argument("int8_model_path", type=str, help="Path to save the INT8 ONNX model")
    args = parser.parse_args()
    quantize_model(args.fp32_model_path, args.int8_model_path)
    print(f"Quantized model saved to {args.int8_model_path}")

Make Dynamic Shapes Static

Many ONNX models use dynamic shapes, which may not always be compatible with certain execution providers. In such cases, it is often necessary to convert dynamic shapes to static shapes. This conversion is typically performed on a host machine, after which the static ONNX model can be deployed to the target device for application use.

For more details, refer to the official documentation: ONNX Runtime: Make Dynamic Shape Fixed

Making Symbolic Dimensions Fixed

Consider a model (as visualized in Netron) with a symbolic dimension called batch for the batch size in input:0. We want to update this symbolic dimension to a fixed value, such as 1.

../../../_images/dynamic_shape_1.PNG — Example model with symbolic dimension

To convert the symbolic dimension to a static value, use the following Python command:

python3 -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 model.onnx model.fixed.onnx

After this replacement, the shape for input:0 will be fixed to [1, 36, 36, 3].

../../../_images/dynamic_shape_2.PNG — Example model with fixed dimension

Making Input Shape Fixed

Some models have unnamed dynamic dimensions, often represented as ? in Netron (for example, in the input x). Since these dimensions are unnamed, you can update the shape using the --input_shape option.

../../../_images/dynamic_shape_3.PNG — Example model with unnamed dynamic dimension

To fix these dynamic shapes, use the following command:

python3 -m onnxruntime.tools.make_dynamic_shape_fixed --input_name x --input_shape 1,3,960,960 model.onnx model.fixed.onnx

After running this command, the shape for x will be fixed to [1, 3, 960, 960].

../../../_images/dynamic_shape_4.PNG — Example model with fixed dynamic dimension

ONNX TAO

For more information about TAO, please refer to the official documentation: MediaTek Genio TAO Documentation

ORT-GENAI

Genio supports ONNX Runtime GenAI packages, which are available as Python pip packages integrated into the Yocto build. Currently, this feature is in the experimental phase, and GenAI models are executed only on the CPU.

With the ONNX Runtime GenAI package, you can run models from the Phi family (text and vision) as well as DeepSeek on Genio. This enables the development of chatbots and experimentation with the conversational capabilities of small language models (SLMs). As of now, version 0.6 has been tested on Genio.

Note: This section will be updated as more features and support become available.