.. include:: /keyword.rst ============= ONNX Runtime ============= Introduction ------------ `ONNX Runtime `_ is a high-performance, cross-platform engine for scoring and training machine learning models in the ONNX (Open Neural Network Exchange) format. It accelerates inference and training for models from popular frameworks such as PyTorch, TensorFlow/Keras, and classical libraries like scikit-learn, LightGBM, and XGBoost. ONNX Runtime is compatible with various hardware, drivers, and operating systems, leveraging hardware accelerators and advanced graph optimizations for optimal performance. ONNX Runtime on Genio --------------------- ONNX models can be executed efficiently on Genio platforms using ONNX Runtime. We currently support ONNX Runtime v1.20.2 on both Kirkstone and Scarthgap Yocto versions. The ONNX Runtime recipes are included in the `meta-mediatek-experimental` layer. To build ONNX Runtime packages as part of your Rity build, follow these steps: 1. Initialize the repo: .. code-block:: shell repo init -u https://gitlab.com/mediatek/aiot/bsp/manifest.git -b rity/kirkstone -m meta-mediatek-experimental.xml repo init -u https://gitlab.com/mediatek/aiot/bsp/manifest.git -b rity/scarthgap -m meta-mediatek-experimental.xml 2. Sync the repo: .. code-block:: shell repo sync -j 48 3. Add the ``meta-mediatek-experimental`` layer to your ``bblayers.conf``. 4. Add the following to your ``local.conf``: .. code-block:: shell " onnxruntime onnxruntime-examples onnxruntime-staticdev " 5. If using Kirkstone, ensure you add: .. code-block:: shell PREFERRED_VERSION_cmake = "3.26.4" 6. Build your Rity image as usual: .. code-block:: shell bitbake rity-demo-image Executing ONNX Models on Genio ------------------------------ With ONNX Runtime support on Genio, executing ONNX models is straightforward. Below is a sample Python script to benchmark your ONNX models on CPU: .. code-block:: python import onnxruntime as ort import numpy as np import time def load_model(model_path): """Load the ONNX model.""" session_options = ort.SessionOptions() session_options.intra_op_num_threads = 4 session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL execution_providers = ['XnnpackExecutionProvider', 'CPUExecutionProvider'] session = ort.InferenceSession(model_path, sess_options=session_options, providers=execution_providers) return session def generate_sample_input(session): """Generate a sample input based on the model's input shape.""" input_name = session.get_inputs()[0].name input_shape = session.get_inputs()[0].shape sample_input = np.random.random(input_shape).astype(np.float32) return {input_name: sample_input} def run_inference(session, input_data): """Run inference and measure time taken.""" start_time = time.time() outputs = session.run(None, input_data) end_time = time.time() return outputs, end_time - start_time def benchmark_model(model_path, num_iterations=100): """Benchmark the ONNX model.""" session = load_model(model_path) input_data = generate_sample_input(session) total_time = 0.0 for _ in range(num_iterations): _, inference_time = run_inference(session, input_data) total_time += inference_time average_time = total_time / num_iterations print(f"Average inference time over {num_iterations} iterations: {average_time:.6f} seconds") if __name__ == "__main__": model_path = "path/to/your/model.onnx" benchmark_model(model_path) Examples -------- Below is an example of running the `onnxruntime_test` Python script using the EfficientNet-Lite4 ONNX model for image classification on Genio 700: .. code-block:: shell root@genio-700-evk:~/onnxruntime_example# python3 onnxruntime_test.py -i kitten.jfif -l labels_map.txt -m /home/root/efficientnet-lite4/efficientnet-lite4.onnx 0.88291174 281: 'tabby, tabby cat', 0.093538865 285: 'Egyptian cat', 0.023412632 282: 'tiger cat', 3.1124982e-05 539: 'doormat, welcome mat', 1.9749632e-05 287: 'lynx, catamount', time: 103.599ms This example runs on 1 CPU thread using `CPUExecutionProvider`. ONNX Runtime Execution Providers -------------------------------- ONNX Runtime supports various hardware acceleration libraries through its extensible Execution Providers (EP) framework, enabling optimal execution of ONNX models on different platforms. This flexibility allows developers to deploy ONNX models in diverse environments—cloud or edge—and optimize execution by leveraging the platform's compute capabilities. On Genio, the following execution providers are supported and integrated by default: - `CPUExecutionProvider` - `XnnpackExecutionProvider` Both providers execute on CPU. Quantization Methods -------------------- Quantization in ONNX Runtime refers to 8-bit linear quantization of ONNX models. For more information, see the official documentation: `ONNX Runtime Quantization `_ **Dynamic Quantization** Dynamic quantization calculates quantization parameters (scale and zero point) for activations dynamically during inference, typically resulting in higher accuracy at the cost of increased inference time. .. code-block:: python import onnx from onnxruntime.quantization import quantize_dynamic, QuantType model_fp32 = 'path/to/the/model.onnx' model_quant = 'path/to/the/model.quant.onnx' quantized_model = quantize_dynamic(model_fp32, model_quant) **Static Quantization** Static quantization uses calibration data to determine quantization parameters, which are then stored in the quantized model. There are two static quantization formats: - **QOperator (Operator-oriented):** Quantized operators have their own ONNX definitions (e.g., QLinearConv, MatMulInteger). - **QDQ (Tensor-oriented):** Inserts `QuantizeLinear` and `DeQuantizeLinear` operators between original operators to simulate quantization and dequantization. .. code-block:: python import onnx import onnxruntime from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat import numpy as np import argparse class DummyCalibrationDataReader(CalibrationDataReader): def __init__(self, input_name, input_shape, num_samples=100): self.input_name = input_name self.data = [{self.input_name: np.random.rand(*input_shape).astype(np.float32)} for _ in range(num_samples)] self.iterator = iter(self.data) def get_next(self): return next(self.iterator, None) def get_input_details(model_path): model = onnx.load(model_path) graph = model.graph input_tensor = graph.input[0] input_name = input_tensor.name input_shape = [dim.dim_value for dim in input_tensor.type.tensor_type.shape.dim] return input_name, input_shape def quantize_model(fp32_model_path, int8_model_path): input_name, input_shape = get_input_details(fp32_model_path) calibration_data_reader = DummyCalibrationDataReader(input_name, input_shape) quantize_static(fp32_model_path, int8_model_path, calibration_data_reader, quant_format=QuantFormat.QOperator, weight_type=QuantType.QInt8) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Quantize FP32 ONNX model to INT8 using dummy inputs") parser.add_argument("fp32_model_path", type=str, help="Path to the FP32 ONNX model") parser.add_argument("int8_model_path", type=str, help="Path to save the INT8 ONNX model") args = parser.parse_args() quantize_model(args.fp32_model_path, args.int8_model_path) print(f"Quantized model saved to {args.int8_model_path}") Make Dynamic Shapes Static -------------------------- Many ONNX models use dynamic shapes, which may not always be compatible with certain execution providers. In such cases, it is often necessary to convert dynamic shapes to static shapes. This conversion is typically performed on a host machine, after which the static ONNX model can be deployed to the target device for application use. For more details, refer to the official documentation: `ONNX Runtime: Make Dynamic Shape Fixed `_ **Making Symbolic Dimensions Fixed** Consider a model (as visualized in Netron) with a symbolic dimension called ``batch`` for the batch size in ``input:0``. We want to update this symbolic dimension to a fixed value, such as 1. .. figure:: /_asset/dynamic_shape_1.PNG :align: center Example model with symbolic dimension To convert the symbolic dimension to a static value, use the following Python command: .. code-block:: shell python3 -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 model.onnx model.fixed.onnx After this replacement, the shape for ``input:0`` will be fixed to ``[1, 36, 36, 3]``. .. figure:: /_asset/dynamic_shape_2.PNG :align: center Example model with fixed dimension **Making Input Shape Fixed** Some models have unnamed dynamic dimensions, often represented as ``?`` in Netron (for example, in the input ``x``). Since these dimensions are unnamed, you can update the shape using the ``--input_shape`` option. .. figure:: /_asset/dynamic_shape_3.PNG :align: center Example model with unnamed dynamic dimension To fix these dynamic shapes, use the following command: .. code-block:: shell python3 -m onnxruntime.tools.make_dynamic_shape_fixed --input_name x --input_shape 1,3,960,960 model.onnx model.fixed.onnx After running this command, the shape for ``x`` will be fixed to ``[1, 3, 960, 960]``. .. figure:: /_asset/dynamic_shape_4.PNG :align: center Example model with fixed dynamic dimension ONNX TAO -------- For more information about TAO, please refer to the official documentation: `MediaTek Genio TAO Documentation `_ ORT-GENAI --------- Genio supports ONNX Runtime GenAI packages, which are available as Python pip packages integrated into the Yocto build. Currently, this feature is in the experimental phase, and GenAI models are executed only on the CPU. With the ONNX Runtime GenAI package, you can run models from the Phi family (text and vision) as well as DeepSeek on Genio. This enables the development of chatbots and experimentation with the conversational capabilities of small language models (SLMs). As of now, version 0.6 has been tested on Genio. *Note: This section will be updated as more features and support become available.*